The container image does not need the TESSDATA_PREFIX env variable since
its PyMuPDF version is new enough to support `tessdata` as an argument
when calling the PyMuPDF tesseract method.
Since the progress information is now inferred on host based on the
number of pages obtained, progress-tracking variables should be removed
from the server.
Remove timeouts due to several reasons:
1. Lost purpose: after implementing the containers page streaming the
only subprocess we have left is LibreOffice. So don't have such a
big risk of commands hanging (the original reason for timeouts).
2. Little benefit: predicting execution time is generically unsolvable
computer science problem. Ultimately we were guessing an arbitrary
time based on the number of pages and the document size. As a guess
we made it pretty lax (30s per page or MB). A document hanging for
this long will probably lead to user frustration in any case and the
user may be compelled to abort the conversion.
3. Technical Challenges with non-blocking timeout: there have been
several technical challenges in keeping timeouts that we've made effort
to accommodate. A significant one was having to do non-blocking read to
ensure we could timeout when reading conversion stream (and then used
here)
Fixes#687
This reverts commit fea193e935.
This is part of the purge of timeout-related code since we no longer
need it [1]. Non-blocking reads were introduced in the reverted commit
in order to be able to cut a stream mid-way due to a timeout. This is
no longer needed now that we're getting rid of timeouts.
[1]: https://github.com/freedomofpress/dangerzone/issues/687
If we increased the number of parallel conversions, we'd run into an
issue where the streams were getting mixed together. This was because
the Converter.proc was a single attribute. This breaks it down into a
local variable such that this mixup doesn't happen.
Conversions methods had changed and that was part of the reason why
the tests were failing. Furthermore, due to the `provider.proc`, which
stores the associated qrexec / container process, "server" exceptions
raise a IterruptedConversion error (now ConverterProcException), which
then requires interpretation of the process exit code to obtain the
"real" exception.
Now that only the second container can send JSON-encoded progress
information, we can the untrusted JSON parsing. The parse_progress was
also renamed to `parse_progress_trusted` to ensure future developers
don't mistake this as a safe method.
The old methods for sending untrusted JSON were repurposed to send the
progress instead to stderr for troubleshooting in development mode.
Fixes#456
If one converted more than one document, since the state of
IsolationProvider.percentage would be stored in the IsolationProvider
instance, it would get reused for the second document. The fix is to
keep it as a local variable, but we can explore having progress stored
on the document itself, for example. Or having one IsolationProvider per
conversion.
Merge Qubes and Containers isolation providers core code into the class
parent IsolationProviders abstract class.
This is done by streaming pages in containers for exclusively in first
conversion process. The commit is rather large due to the multiple
interdependencies of the code, making it difficult to split into various
commits.
The main conversion method (_convert) now in the superclass simply calls
two methods:
- doc_to_pixels()
- pixels_to_pdf()
Critically, doc_to_pixels is implemented in the superclass, diverging
only in a specialized method called "start_doc_to_pixels_proc()". This
method obtains the process responsible that communicates with the
isolation provider (container / disp VM) via `podman/docker` and qrexec
on Containers and Qubes respectively.
Known regressions:
- progress reports stopped working on containers
Fixes#443
Some tests [1] lead to the conclusion that ocr_compression does the same
to the file (performance and size-wise) to the file as deflating images
when saving the file. However, both methods active do add a bit of extra
time. For this reason we're disabling the image deflation (default
option).
[1]: https://github.com/freedomofpress/dangerzone/pull/622#discussion_r1434042296
Qubes does on-host pixels-to-pdf whereas the containers version doesn't.
This leads to an issue where on the containers version it tries to load
fitz, which isn't installed there, just because it's trying to check if
it should run the Qubes version.
The error it was showing was something like this:
ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'.
tests/__init__.py:8: in <module>
from dangerzone.document import SAFE_EXTENSION
dangerzone/__init__.py:16: in <module>
from .gui import gui_main as main
dangerzone/gui/__init__.py:28: in <module>
from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion
dangerzone/isolation_provider/qubes.py:15: in <module>
from ..conversion.pixels_to_pdf import PixelsToPDF
dangerzone/conversion/pixels_to_pdf.py:16: in <module>
import fitz
E ModuleNotFoundError: No module named 'fitz'
For context see discussion in [1].
[1]: https://github.com/freedomofpress/dangerzone/pull/622#issuecomment-1839164885
The original document was larger in dimensions than the original one due
to a mismatch in DPI settings. When converting documents to pixels we
were setting the DPI to 150 pixels per inch. Then when converting back
into a PDF we were using 70 DPI. This difference would result in an
overall larger document in dimensions (though not necessarily in file
size).
Fixes#626
Adding PyMuPDF essentially make the code much simpler since it can do
everything that we'd need multiple programs for. It also includes
tesseract-OCR integration, which this commit makes use of.
Timeout can no longer be used since we're not calling a subprocess. We
could still implement it, but it's more worthy to reply in
yet-to-implement client-side timeouts (in containers).
Use PyMuPDF (AGPL-licensed) within the container conversion to replace
the pdf conversion to RGB. This massively simplifies the code since
PyMuPDF is a native python library.
This PR reverts the patch that disables HWP / HWPX conversion on MacOS M1.
It does not fix conversion on Qubes OS (#494).
Previously, HWP / HWPX conversion didn't work on MacOS (Apple silicon CPU) (#498)
because libreoffice wasn't built with Java support on Alpine Linux for ARM (aarch64).
Gratefully, the Alpine team has enabled Java support on the aarch64
system [1], so we can enable it again for ARM architectures.
And this patch is included in Alpine 3.19
This commit was included in #541 and reverted on #562 due to a stability issue.
Fixes#498
[1]: 74d443f479
Fix a bug in the "Change Selection" action, whereby changing your
selection and picking files from another directory results in:
"Dangerzone does not support adding documents from multiple
locations. The newly added documents were ignored."
To fix this, change the output directory when we change selection as
well.
The original intention of leaving the update checkbox in the hamburger
menu was to let non-supported Linux distros (e.g. compiled from source)
to check for updates. However, on Linux it ended up being disabled
forcefully by default on startup.
This takes into account an overriden update checkbox.
Fixes#596
If a Qubes conversion encounters an exception that is not a subclass of
ConversionException, it will still show a preview of a file that does
not exist.
Send an error progress report in that case, so that the GUI code can
detect that an error occurred and not open a file preview
Fixes#581
Create a temporary dir before the conversion begins, and store every
file necessary for the conversion there. We are mostly concerned about
the second stage of the conversion, which runs in the host. The first
stage runs in a disposable qube and cleanup is implicit.
Fixes#575Fixes#436
Extend the PixelsToPDF converter by adding an additional `tempdir`
argument. This argument can be used to make the conversion use a
different temporary directory other than `/tmp`.
For containers, this extra arguments makes no difference, as it won't be
used. For Qubes, this argument will allow storing files in a temporary
dir that will be cleaned up once the conversion completes. Previously,
these files would linger in the user's `/tmp`.
Refs #575
If a command encounters an error or times out during the second stage of
the conversion in Qubes, handle it the same way as we would have handled
it in the first stage:
1. Get its error message.
2. Throw an UnexpectedConversionError exception, with the original
message.
Note that, because the second stage takes place locally, users will see
the original content of the error.
Refs #567Closes#430
This should only affect the alpha version of Qubes OS (in containers
it only allows the attacker to control the timeout). In short, an
attacker could have PDF metadata that would show before "Pages:" in
the `pdfinfo` command output and this would essentially override the
number of pages measured in the server. This could enable the attacker
to shorten the number of pages of a document for example.
Fixes#565
When qrexec-client-vm fails, it could be a symptom of various issues:
- the system being out of RAM
- dz-dvm not existing
The exit code is the same in all cases (126), which makes it
particularly tricky to solve in the client application. For this reason
the approach is now to tell the user to see the qubes error notification
on the top right of their screen.
Add a sanity check at the end of the conversion from doc to pixels, to
ensure that the resulting document will have the same number of pages as
the original one.
Refs #560