Commit graph

62 commits

Author SHA1 Message Date
deeplow
f3032a7142
Make big endian explicit in int to bytes
Fix issues in older distros that don't yet support python 3.11 where
endianness was not a default argument [1]. This is in response to CI
failures [2].

[1]: https://docs.python.org/3/library/stdtypes.html#int.to_bytes
[2]: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/2186/workflows/e340ca21-85ce-42b6-9bc3-09e66f96684a/jobs/27380y
2024-02-06 19:42:41 +00:00
deeplow
1835756b45
Allow each conversion to have its own proc
If we increased the number of parallel conversions, we'd run into an
issue where the streams were getting mixed together. This was because
the Converter.proc was a single attribute. This breaks it down into a
local variable such that this mixup doesn't happen.
2024-02-06 19:42:41 +00:00
deeplow
61e7a3c107
Fix isolation provider tests
Conversions methods had changed and that was part of the reason why
the tests were failing. Furthermore, due to the `provider.proc`, which
stores the associated qrexec / container process, "server" exceptions
raise a IterruptedConversion error (now ConverterProcException), which
then requires interpretation of the process exit code to obtain the
"real" exception.
2024-02-06 19:42:41 +00:00
deeplow
550786adfe
Remove untrusted progress parsing (stderr instead)
Now that only the second container can send JSON-encoded progress
information, we can the untrusted JSON parsing. The parse_progress was
also renamed to `parse_progress_trusted` to ensure future developers
don't mistake this as a safe method.

The old methods for sending untrusted JSON were repurposed to send the
progress instead to stderr for troubleshooting in development mode.

Fixes #456
2024-02-06 19:42:40 +00:00
deeplow
0a099540c8
Stream pages in containers: merge isolation providers
Merge Qubes and Containers isolation providers core code into the class
parent IsolationProviders abstract class.

This is done by streaming pages in containers for exclusively in first
conversion process. The commit is rather large due to the multiple
interdependencies of the code, making it difficult to split into various
commits.

The main conversion method (_convert) now in the superclass simply calls
two methods:
  - doc_to_pixels()
  - pixels_to_pdf()

Critically, doc_to_pixels is implemented in the superclass, diverging
only in a specialized method called "start_doc_to_pixels_proc()". This
method obtains the process responsible that communicates with the
isolation provider (container / disp VM) via `podman/docker` and qrexec
on Containers and Qubes respectively.

Known regressions:
  - progress reports stopped working on containers

Fixes #443
2024-02-06 19:42:33 +00:00
deeplow
331b6514e8
Containers: remove debug messages (via files)
Remove container_log messages ahead of debug info being sent over
standard streams.
2024-02-06 18:54:39 +00:00
deeplow
cd99122385
Adds file formats: epub svg bmp pnm bpm ppm
Partially fix for #660. Missing some files due to limitations [1]:
- PSD - only available from PyMuPDF>=1.23.0 (qubes-fedora is lower)
- TXT - only available from PyMuPDF>=1.23.7 (qubes-fedora is lower)
- JXR - PyMuPDF was refusing to due to missing codec [1]
- JPX - Generated test file was rejected by PyMuPDF [2]
- FB2 - Most often cannot be detected by mime type alone [3]
- CBZ - (idem)
- XPS - (idem)
- MOBI - (idem)
- PAM - General version of other file format already included, so I
  decided not to include this extension [0]

New test files were generated locally:
 - epub - generated with calibre's convert-ebook from another
   sample file
 - svg - generated with inkscape from a mix of a default template
   (hexagons) and a logo's PNG file
 - bmp, pnm, bpm, ppm - generated with ImageMagick's 'convert' from
   tests/test_docs/sample-png.png

[0]: https://github.com/freedomofpress/dangerzone/issues/660#issuecomment-1914681487
[1]: https://github.com/freedomofpress/dangerzone/issues/660#issuecomment-1916803201
[2]: https://github.com/freedomofpress/dangerzone/issues/660#issuecomment-1916870347
[3]: https://github.com/freedomofpress/dangerzone/issues/688
2024-01-31 19:58:48 +00:00
deeplow
4e720aa6e2
Replace 'None' conversion type with "PyMuPDF"
Replaced for clarity over the fact that this conversion is in fact
handled by PyMuPDF.
2024-01-31 19:58:36 +00:00
deeplow
f1d90c6fa9
Compress per page when not using OCR
Make the compression happen per page when OCR is not enabled [1].

[1]: https://github.com/freedomofpress/dangerzone/pull/622#discussion_r1410986342
2024-01-03 12:58:36 +00:00
deeplow
e2531279c0
FIXUP Revert "Disable image compression when saving PDF"
This reverts commit f074db0beaa50389634203657f9b46307164a353.
2024-01-03 12:58:36 +00:00
deeplow
ee35e28aa6
Disable image compression when saving PDF
Some tests [1] lead to the conclusion that ocr_compression does the same
to the file (performance and size-wise) to the file as deflating images
when saving the file. However, both methods active do add a bit of extra
time. For this reason we're disabling the image deflation (default
option).

[1]: https://github.com/freedomofpress/dangerzone/pull/622#discussion_r1434042296
2024-01-03 12:58:36 +00:00
deeplow
6f61e44502
Solve import errors by lazy-loading fitz module
Qubes does on-host pixels-to-pdf whereas the containers version doesn't.
This leads to an issue where on the containers version it tries to load
fitz, which isn't installed there, just because it's trying to check if
it should run the Qubes version.

The error it was showing was something like this:

    ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'.
        tests/__init__.py:8: in <module>
            from dangerzone.document import SAFE_EXTENSION
        dangerzone/__init__.py:16: in <module>
            from .gui import gui_main as main
        dangerzone/gui/__init__.py:28: in <module>
            from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion
        dangerzone/isolation_provider/qubes.py:15: in <module>
            from ..conversion.pixels_to_pdf import PixelsToPDF
        dangerzone/conversion/pixels_to_pdf.py:16: in <module>
            import fitz
        E   ModuleNotFoundError: No module named 'fitz'

For context see discussion in [1].

[1]: https://github.com/freedomofpress/dangerzone/pull/622#issuecomment-1839164885
2024-01-03 12:58:36 +00:00
deeplow
80db7bb02e
Remove pre-pymupdf exceptions and detect pymupdf ones 2024-01-03 12:58:35 +00:00
deeplow
b75417bbec
Remove all server-side timeouts from doc to pixels
Now we're using client-side timeouts so the server side-ones are not
needed. Implemented following the suggestion from @apyrgio [1].

[1]: https://github.com/freedomofpress/dangerzone/pull/622#discussion_r1413906514
2024-01-03 12:58:35 +00:00
deeplow
576cbd3382
Fix DPI mismatch between doc2pixels and pixels2pdf
The original document was larger in dimensions than the original one due
to a mismatch in DPI settings. When converting documents to pixels we
were setting the DPI to 150 pixels per inch. Then when converting back
into a PDF we were using 70 DPI. This difference would result in an
overall larger document in dimensions (though not necessarily in file
size).

Fixes #626
2024-01-03 12:58:34 +00:00
deeplow
e5dbe25abb
Replace 'convert' with PyMuPDF for images
PyMuPDF can also convert images of the types we already support so we
don't need ImageMagick's 'convert'.
2024-01-03 12:58:34 +00:00
deeplow
77d5ea5940
Add PyMuPDF in pixels_to_pdf replacing old logic
Adding PyMuPDF essentially make the code much simpler since it can do
everything that we'd need multiple programs for. It also includes
tesseract-OCR integration, which this commit makes use of.
2024-01-03 12:56:33 +00:00
deeplow
ba17016643
Doc_to_pixels: remove unneeded timeout
Timeout can no longer be used since we're not calling a subprocess. We
could still implement it, but it's more worthy to reply in
yet-to-implement client-side timeouts (in containers).
2024-01-03 12:40:45 +00:00
deeplow
317deadbe4
Replace pdfinfo logic (get # pages) with PyMuPDF 2024-01-03 12:40:45 +00:00
deeplow
327ab8791f
Replace pdftoppm logic with PyMuPDF (native python)
Use PyMuPDF (AGPL-licensed) within the container conversion to replace
the pdf conversion to RGB. This massively simplifies the code since
PyMuPDF is a native python library.
2024-01-03 12:40:45 +00:00
Moon Sungjoon
63aea4cb45
Enable HWP conversion on MacOS (Apple silicon CPU)
This PR reverts the patch that disables HWP / HWPX conversion on MacOS M1.
It does not fix conversion on Qubes OS (#494).

Previously, HWP / HWPX conversion didn't work on MacOS (Apple silicon CPU) (#498)
because libreoffice wasn't built with Java support on Alpine Linux for ARM (aarch64).

Gratefully, the Alpine team has enabled Java support on the aarch64
system [1], so we can enable it again for ARM architectures.
And this patch is included in Alpine 3.19

This commit was included in #541 and reverted on #562 due to a stability issue.

Fixes #498

[1]: 74d443f479
2023-12-13 12:57:22 +02:00
Alex Pyrgiotis
f37d89f042
conversion: Allow using a temp dir other than /tmp
Extend the PixelsToPDF converter by adding an additional `tempdir`
argument. This argument can be used to make the conversion use a
different temporary directory other than `/tmp`.

For containers, this extra arguments makes no difference, as it won't be
used. For Qubes, this argument will allow storing files in a temporary
dir that will be cleaned up once the conversion completes. Previously,
these files would linger in the user's `/tmp`.

Refs #575
2023-10-04 14:00:53 +03:00
Alex Pyrgiotis
2016965c84
Revert "Enable HWP conversion on MacOS M1"
This reverts commit 214ce9720d. The
rationale is that we want to wait until the LibreOffice package that
allows HWP conversion in Alpine Linux lands in `alpine:latest`.

For more info, read
https://github.com/freedomofpress/dangerzone/issues/498#issuecomment-1739894100
2023-10-02 14:22:47 +03:00
deeplow
7daeccdfea
Prevent PDF from overwriting num_pages in Qubes
This should only affect the alpha version of Qubes OS (in containers
it only allows the attacker to control the timeout). In short, an
attacker could have PDF metadata that would show before "Pages:" in
the `pdfinfo` command output and this would essentially override the
number of pages measured in the server. This could enable the attacker
to shorten the number of pages of a document for example.

Fixes #565
2023-10-02 12:18:12 +01:00
deeplow
dabdf6c286
FIXUP: rename to QubesQrexecFailed instead 2023-10-02 12:06:18 +01:00
deeplow
eb488b16c5
FIXUP: rename QubesNotEnoughRAMError to QubesConversionStartFailed 2023-10-02 11:51:55 +01:00
deeplow
9cfac7ac2a
Generalize "out of RAM" error to reflect other issues
When qrexec-client-vm fails, it could be a symptom of various issues:
  - the system being out of RAM
  - dz-dvm not existing

The exit code is the same in all cases (126), which makes it
particularly tricky to solve in the client application. For this reason
the approach is now to tell the user to see the qubes error notification
on the top right of their screen.
2023-10-02 11:06:17 +01:00
Alex Pyrgiotis
ccf4132ea0
conversion: Add sanity check for page count
Add a sanity check at the end of the conversion from doc to pixels, to
ensure that the resulting document will have the same number of pages as
the original one.

Refs #560
2023-09-28 22:50:54 +03:00
Alex Pyrgiotis
b4e5cf5be7
qubes: Stream page data in real time
Stream page data back to the caller, immediately after we read them from
pdftoppm. This way, we have more accurate progress reports and timeouts.

Fixes #557
2023-09-28 22:50:54 +03:00
Alex Pyrgiotis
4bb959f220
conversion: Add anchor points for streaming page data/metadata
Introduce 4 new methods that can be overloaded by the Qubes isolation
provider to stream page data/metadata back to the caller. For the time
being, these methods do what they did before, i.e., write this info in
files within the pixels directory.
2023-09-28 22:50:53 +03:00
Alex Pyrgiotis
6012cd1491
Improve EOF detection when reading command output
Do not read a line from the command output and then check if
we are at EOF, because it's possible that the writer immediately exited
after writing the last line of output. Instead, switch the order of
actions.

This is a very serious bug that can lead to Dangerzone excluding the
last page of the document. It should have bit us right from the start
(see aeeed411a0), but it seems that the
small period of time it takes the kernel to close the file descriptors
was hiding this bug.

Fixes #560
2023-09-28 22:50:53 +03:00
deeplow
0a6b33ebed
Qubes: detect qube failing to start (missing RAM)
In Qubes OS it's often the case that the user doesn't have enough
RAM to start the conversion. In this case it raises BrokenPipeException
and exits with code 126.

It didn't seem possible to distinguish this kind of failure to one
where the user has misconfigured qrexec policies.

NOTE: this approach is not ideal UX-wise. After the first doc failing
the next one will also try and fail. Upon first failure we should
inform the user that they need to close some programs or qubes.
2023-09-28 11:08:50 +01:00
deeplow
63f03d5bcd
Add limit and test to max width and height of docs 2023-09-28 11:08:47 +01:00
deeplow
54b8ffbf96
Add page limit of 10000
Theoretically the max pages would be 65536 (2byte unsigned int.
However this limit is much higher than practical documents have
and larger ones can lead to unforseen problems, for example RAM
limitations.

We thus opted to use a lower limit of 10K. The limit must be
detected client-side, given that the server is distrusted. However
we also check it in the server, just as a fail-early mechanism.
2023-09-28 11:01:14 +01:00
Alex Pyrgiotis
30196ff35b
errors: Add error for interrupted conversions
Add an error for interrupted conversions, in order to better
differentiate this scenario from other ValueErrors that may be raised
throughout the code's lifetime.
2023-09-26 17:35:26 +03:00
Alex Pyrgiotis
e64d1da61f
qubes: Pass OCR parameters properly
Pass OCR parameters to conversion functions as arguments, instead of
setting environment variables.

Fixes #455
2023-09-20 18:04:40 +03:00
Alex Pyrgiotis
20157bef58
Fix typo 2023-09-20 17:45:44 +03:00
Alex Pyrgiotis
c547ffc3b4
conversion: Factor out calculate_timeout
Factor out the logic behind the calculate_timeout() method, used in
Dangerzone conversions, so that isolation providers can call it
directly.
2023-09-20 17:14:24 +03:00
deeplow
94f569cdf5
Add error code for unexpected errors in conversion 2023-09-19 15:52:47 +01:00
deeplow
8e4f04a52e
Shift to conversion exit codes by 128
Distinguish from podman or other errors in called binaries by shifting
the error codes by 128.
2023-09-19 15:34:00 +01:00
deeplow
b4c3e07d36
Remove attacker-controlled error messages
Creates exceptions in the server code to be shared with the client via an
identifying exit code. These exceptions are then reconstructed in the
client.

Refs #456 but does not completely fix it. Unexpected exceptions and
progress descriptions are still passed in Containers.
2023-09-19 15:33:20 +01:00
Moon Sungjoon
214ce9720d
Enable HWP conversion on MacOS M1
This PR reverts the patch that disables HWP / HWPX conversion on MacOS
M1. It does not fix conversion on Qubes OS (#494)

Previously, HWP / HWPX conversion didn't work on MacOS M1 systems (#498)
because libreoffice wasn't built with Java support on Alpine Linux for
ARM (aarch64).

Gratefully, the Alpine team has enabled Java support on the aarch64
system [1], so we can enable it again for ARM architectures.

Fixes #498

[1]: 74d443f479
2023-09-06 13:10:18 +03:00
deeplow
fa215063ee
Add logging for second container 2023-08-22 16:11:38 +01:00
deeplow
75369cf621
Adapt code so it works for reporting script
Reporting script now parses JunitXML instead of a series of
".container_log" files. The script in in changed submodule.

Additionally it makes failed tests actually fail so that this is
recorded in the JunitXML report.
2023-08-22 16:11:36 +01:00
deeplow
eb16285790
Replace container output command prefix ">>>"
In the junitxml this prefix would look ugly ("&gt&gt&gt") because it has
to escape any non-xml tags.
2023-08-22 16:11:35 +01:00
deeplow
48b2e7bc3c
Log command to debug log for traceback purposes
Log commands so we can trace back which errors / outputs are from each
command.
2023-08-22 16:11:34 +01:00
deeplow
95cef8cf0a
Containers: capture conversion logs
Store the conversion log to a file (captured-output.txt) in the
container and when in development mode, have its output displayed on the
terminal output.
2023-08-22 16:11:26 +01:00
deeplow
874b8865e2
Qubes: strategy for capturing conversion logs
Use qrexec stdout to send conversion data (pixels) and stderr to send
conversion progress at the end of the conversion. This happens
regardless of whether or not the conversion is in developer mode or not.

It's the client that decides if it reads the debug data from stderr or
not. In this case, it only reads it if developer mode is enabled.
2023-08-22 16:11:20 +01:00
Alex Pyrgiotis
e3a8a651f1
Disable HWP / HWPX conversion on MacOS M1 / Qubes
The HWP / HWPX conversion feature does not work on the following
platforms:

* MacOS with Apple Silicon CPU
* Native Qubes OS

For this reason, we need to:

1. Disable it on the GUI side, by not allowing the user to select these
   files.
2. Throw an error on the isolation provider side, in case the user
   directly attempts to convert the file (either through CLI or via
   "Open With").

Refs #494
Refs #498
2023-08-05 16:50:49 +01:00
Alex Pyrgiotis
bc83341d2a
conversion: Detect when LibreOffice silently fails
Sometimes, LibreOffice returns with status code 0, but in reality, it
fails. It doesn't create a file, and Dangerzone does not detect this.
What happens next is that it fails in the next command, and throws an
unrelated error.

Detect that LibreOffice fails, by checking if the output file exists,
after the PDF conversion.
2023-08-05 16:50:47 +01:00