pdftoppm raises Syntax issues and Errors on a variety of documents.
But it still produces usable results despite the failures. From the
user's perspective it's best to have a document even if imperfect than
having none at all. For this reason, we ignore non-relevant output.
Some documents were reporting the following error when running them
over pdftoppm:
Syntax Error: Missing language pack for 'Adobe-Japan1' mapping
This did not necessarily make the document fail but it could be
that some fonts were not properly rendered due to the missing package.
Allow users to disable timeouts via the CLI, with the
`--disable-timeouts` argument. By default, the timeouts are always
enabled.
This option applies both to the CLI version of Dangerzone, and the GUI
one. For the latter, the user must start the GUI from their CLI (i.e.,
`dangerzone --disable-timeouts ...`)
Introduce proportional timeouts in the container code, where the
conversion logic runs.
Previously, we had a single timeout for each command (120 seconds),
which didn't scale well either with the number of pages in a document,
or with the size of the document.
In this commit, we look into each operation, and we're trying to figure
out the following:
1. What's the number of pages we will operate on?
2. How large is the document?
Knowing the above, we can break down a command into multiple operations,
at least conceptually. Having a number of operations and a sane timeout
value per operation (10 seconds), we can multiply those and reach to a
timeout that fits the command better.
Fixes#306Fixes#314
Refs #327
Convert the Dangerzone script that in the container to run commands
asynchronously, via the asyncio module.
The main advantage of this approach is that it's fast, easy, and safe to
consume the command's streams, while the command is running in the
background.
Previously, we had implemented an approach that used non-blocking
sockets, but those are easy to get wrong. For instance, timeouts were
not exact, capturing output was brittle.
Fixes#325
Commit d7be28ec2a assumed that OpenJDK was
required for the PDFtk package, which is no longer installed in the
Dangerzone image, and thus was removed.
Turns out that while LibreOffice does not depend on OpenJDK, it may
produce corrupted PDFs if installed without it, and will not abort the
operation.
Reinstate OpenJDK to fix the issue of corrupted PDFs.
Fixes#315
default-jre and java dependencies dependencies had been added initially
[1] because of libreoffice-java-common, which is no longer present.
Then, when the image was changed from ubuntu to alpine [2], default-jre
was replaced with openjdk-8.
If java is still a dependency for libreoffice, then it should be pulled
automatically.
[1] 9ecdb9e995
[2] 650ae6eee1
PDFtk actually isn't needed. It was being used for breaking a PDF
into pages but this is something that be replaced by the already present
'pdftoppm'. Furthermore, by removing this dependency we contribute to
reproducible builds and overall supply chain security because it was
obtained from gitlab with no signature verification or version pinning.
The replacement 'pdftoppm' enabled us to do a shortcut:
- before: PDF -> PDF pages -> PNG images -> RGB images
- after: PDF -> PPM images -> RGB images
And this last conversion step is trivial since the RGB format we were
using is just a PPM file without the metadata in its header.
Bump the global timeout used for various steps from 1 minute to 2
minutes. The reason is that we've seen several reports of operations
failing due to timeout reasons, that were otherwise legitimately
running.
Also, bump the timeout used for compression, which has been reported as
problematic as well.
Refs #146
Refs #149