Commit graph

34 commits

Author SHA1 Message Date
Alex Pyrgiotis
5a0c4d0a03
Bump timeouts
Perform the following timeout bumps:

1. Increase the minimum timeout per page/MiB by x3. The rationale is that
   10 seconds is a reasonable timeout, but to be on the safe side, it's
   best if we multiply it by a safety factor.
2. Increase the minimum timeout from 10 seconds to 60 seconds. 10
   seconds may be too little if the application runtime (e.g.,
   LibreOffice) is slow to start due to background CPU thrashing.
2023-03-08 17:38:59 +02:00
deeplow
541fe7f382
Container: ignore non-progress pdftoppm output
pdftoppm raises Syntax issues and Errors on a variety of documents.
But it still produces usable results despite the failures. From the
user's perspective it's best to have a document even if imperfect than
having none at all. For this reason, we ignore non-relevant output.
2023-02-21 19:05:21 +00:00
deeplow
dbd0450542
Add poppler-data package due to missing fonts
Some documents were reporting the following error when running them
over pdftoppm:

    Syntax Error: Missing language pack for 'Adobe-Japan1' mapping

This did not necessarily make the document fail but it could be
that some fonts were not properly rendered due to the missing package.
2023-02-21 18:39:14 +00:00
Alex Pyrgiotis
93a06d72f0
Allow users to disable timeouts
Allow users to disable timeouts via the CLI, with the
`--disable-timeouts` argument. By default, the timeouts are always
enabled.

This option applies both to the CLI version of Dangerzone, and the GUI
one. For the latter, the user must start the GUI from their CLI (i.e.,
`dangerzone --disable-timeouts ...`)
2023-02-15 23:48:36 +02:00
Alex Pyrgiotis
f2a4f29cff
container: Introduce proportional timeouts
Introduce proportional timeouts in the container code, where the
conversion logic runs.

Previously, we had a single timeout for each command (120 seconds),
which didn't scale well either with the number of pages in a document,
or with the size of the document.

In this commit, we look into each operation, and we're trying to figure
out the following:

1. What's the number of pages we will operate on?
2. How large is the document?

Knowing the above, we can break down a command into multiple operations,
at least conceptually. Having a number of operations and a sane timeout
value per operation (10 seconds), we can multiply those and reach to a
timeout that fits the command better.

Fixes #306
Fixes #314
Refs #327
2023-02-15 23:46:53 +02:00
Alex Pyrgiotis
aeeed411a0
container: Run commands asynchronously
Convert the Dangerzone script that in the container to run commands
asynchronously, via the asyncio module.

The main advantage of this approach is that it's fast, easy, and safe to
consume the command's streams, while the command is running in the
background.

Previously, we had implemented an approach that used non-blocking
sockets, but those are easy to get wrong. For instance, timeouts were
not exact, capturing output was brittle.

Fixes #325
2023-02-07 18:52:49 +02:00
Alex Pyrgiotis
24975fabd5
container: Reinstate OpenJDK 8 dependency
Commit d7be28ec2a assumed that OpenJDK was
required for the PDFtk package, which is no longer installed in the
Dangerzone image, and thus was removed.

Turns out that while LibreOffice does not depend on OpenJDK, it may
produce corrupted PDFs if installed without it, and will not abort the
operation.

Reinstate OpenJDK to fix the issue of corrupted PDFs.

Fixes #315
2023-02-07 18:52:49 +02:00
deeplow
2da973232b
Remove sudo: no longer needed
Fixes #232
2023-01-23 14:13:56 +00:00
deeplow
d7be28ec2a
Remove openjdk-8 as a dependency.
default-jre and java dependencies dependencies had been added initially
[1] because of libreoffice-java-common, which is no longer present.
Then, when the image was changed from ubuntu to alpine [2], default-jre
was replaced with openjdk-8.

If java is still a dependency for libreoffice, then it should be pulled
automatically.

[1] 9ecdb9e995
[2] 650ae6eee1
2023-01-23 14:13:48 +00:00
deeplow
272d25aee0
Make pdf to ppm conversion dependent on num pages 2023-01-23 14:01:32 +00:00
deeplow
d28aa5a25b
Remove PDFtk dependency (replace w/ pdftoppm)
PDFtk actually isn't needed. It was being used for breaking a PDF
into pages but this is something that be replaced by the already present
'pdftoppm'. Furthermore, by removing this dependency we contribute to
reproducible builds and overall supply chain security because it was
obtained from gitlab with no signature verification or version pinning.

The replacement 'pdftoppm' enabled us to do a shortcut:
 - before: PDF -> PDF pages -> PNG images -> RGB images
 - after:  PDF -> PPM images -> RGB images

And this last conversion step is trivial since the RGB format we were
using is just a PPM file without the metadata in its header.
2023-01-23 14:00:57 +00:00
deeplow
642d86899b
Fix timeout message: replace pdfseparate with pdftk 2022-12-01 14:51:52 +00:00
Alex Pyrgiotis
57fdf06f0f
Bump global timeout to two minutes
Bump the global timeout used for various steps from 1 minute to 2
minutes. The reason is that we've seen several reports of operations
failing due to timeout reasons, that were otherwise legitimately
running.

Also, bump the timeout used for compression, which has been reported as
problematic as well.

Refs #146
Refs #149
2022-11-23 18:13:41 +02:00
Guthrie McAfee Armstrong
2085405d05
Remove redundant f-strings 2022-11-10 09:59:09 +00:00
deeplow
968fd20ac7
fix comma typo 2022-11-10 09:59:08 +00:00
deeplow
e4ff9801ee
make lint happy 2022-11-10 09:59:05 +00:00
Guthrie McAfee Armstrong
1bd8354228
simplify setting percentage to 0.0 2022-11-10 09:59:04 +00:00
Guthrie McAfee Armstrong
9989ffea37
catch ValueError, simplify try/except on top-level job runs
See https://github.com/freedomofpress/dangerzone/pull/167#discussion_r915757189
2022-11-10 09:59:02 +00:00
Guthrie McAfee Armstrong
6b44db9043
Update container/dangerzone.py
Co-authored-by: deeplow <47065258+deeplow@users.noreply.github.com>
2022-11-10 09:59:01 +00:00
Guthrie McAfee Armstrong
3ef8b183e2
Update container/dangerzone.py
Co-authored-by: deeplow <47065258+deeplow@users.noreply.github.com>
2022-11-10 09:58:59 +00:00
Guthrie McAfee Armstrong
2533eac4be
Rename ConversionJob back to DangerzoneConverter
Co-authored-by: deeplow <47065258+deeplow@users.noreply.github.com>
2022-11-10 09:58:57 +00:00
Guthrie McAfee Armstrong
5a4bf99211
Remove another "END OF FOR LOOP" comment 2022-11-10 09:58:54 +00:00
Guthrie McAfee Armstrong
c18f170caf
Remove "END OF FOR LOOP" comment
Co-authored-by: deeplow <47065258+deeplow@users.noreply.github.com>
2022-11-10 09:58:53 +00:00
Guthrie McAfee Armstrong
17939cb70c
Wrap dangerzone.py back into a class to keep track of percentage 2022-11-10 09:58:51 +00:00
Guthrie McAfee Armstrong
eaa08c9c3d
refactor dangerzone.py, raise exceptions instead of returning int
Standardize calls to subprocess.run to shrink file by about 100 lines
2022-11-10 09:58:50 +00:00
Guthrie McAfee Armstrong
7a84b89410
(container functions): Replace int return codes with raised exceptions 2022-11-10 09:58:48 +00:00
Guthrie McAfee Armstrong
c78b1ea71b
Flatten DangerzoneConverter methods into functions 2022-11-10 09:58:45 +00:00
deeplow
092456434b
don't type check dev scripts 2022-08-22 12:28:48 +01:00
deeplow
23e30ae40a
check that OCR_LANGUAGE has also been set 2022-08-22 12:28:46 +01:00
deeplow
463ff97b97
add type hints to container dz py code 2022-08-22 12:28:44 +01:00
deeplow
4d8e4c53e3
sort imports with isort linter 2022-08-22 10:15:26 +01:00
deeplow
21a9a6c98c
running dangerzone without root in container
There was previously a user created in the container but it was not
used via the dockerfile RUN directive (as pointed out by
gmarmstrong[1]).

Fixes #169

[1]: https://github.com/freedomofpress/dangerzone/issues/169#issue-1268399245
2022-08-22 08:43:58 +01:00
Micah Lee
8052220034
Get rid of wrapper scripts in the container 2021-11-29 15:39:24 -08:00
Micah Lee
2de2b6dca5
Rename dangerzone-converter to container 2021-11-29 15:30:21 -08:00