Commit graph

81 commits

Author SHA1 Message Date
a49e4e92a7
docs: document why get_tmp_dir is required in the imports 2024-06-05 13:04:35 +02:00
7b0384d682
fix: do not catch bare exceptions
Bare excepts will catch keyboard-exit exceptions, system-exit etc. which
is probably not what we want.
2024-06-05 13:04:34 +02:00
9e18be4355
chore: minor linting
A few minor changes about when to use `==` and when to use `is`.
Basically, this uses `is` for booleans, and `==` for other values.

With a few other changes about coding style which was enforced by
`ruff`.
2024-06-05 13:04:34 +02:00
0535022020
chore: remove unused code
This commit removes code that's not being used, it can be exceptions
with the `as e` where the exception itself is not used, the same with
`with` statements, and some other parts where there were duplicated
code.
2024-06-05 13:04:34 +02:00
a1cfc96118
chore(imports): remove useless imports
As detected by [ruff](https://github.com/astral-sh/ruff)

Related to #254, although it doesn't provide the command to lint the
codebase itself.
2024-06-05 13:04:33 +02:00
Alex Pyrgiotis
ff25fa3045
Fix stuck conversion processes
Gracefully terminate certain conversion processes that may get stuck
when writing lots of data to stdout. Also, handle a race condition when
a conversion process terminates slightly after the associated container.

Fixes #791
2024-05-09 16:46:15 +03:00
Alex Pyrgiotis
37bf9badf4
Remove extraneous log sanitization
Remove an extra call to `replace_control_chars()`, as well as an
unnecessary method.
2024-05-09 15:57:42 +03:00
Alex Pyrgiotis
0b45360384
Keep newlines when reading debug logs
In d632908a44 we improved our
`replace_control_chars()` function, by replacing every control or
invalid Unicode character with a placeholder one. This change, however,
made our debug logs harder to read, since newlines were not preserved.

There are indeed various cases in which replacing newlines is wise
(e.g., in filenames), so we should keep this behavior by default.
However, specifically for reading debug logs, we add an option to keep
newlines to improve readability, at no expense to security.
2024-05-09 15:57:42 +03:00
Alex Pyrgiotis
d6202cd028
Invoke external command on Windows properly
On Windows, if we don't use the `startupinfo=` argument of
subprocess.Popen, then a terminal window will flash while running the
command.

Use `startupinfo=` when killing a container, as we do for every other
command.
2024-05-09 15:57:42 +03:00
Alex Pyrgiotis
f57d2f7191
isolation_provider: Always terminate spawned process
Previously, we always assumed that the spawned process would quit
within 3 seconds. This was an arbitrary call, and did not work in
practice.

We can improve our standing here by doing the following:

1. Make `Popen.wait()` calls take a generous amount of time (since they
   are usually on the sad path), and handle any timeout errors that they
   throw. This way, a slow conversion process cleanup does not take too
   much of our users time, nor is it reported as an error.
2. Always make sure that once the conversion of doc to pixels is over,
   the corresponding process will finish within a reasonable amount of
   time as well.

Fixes #749
2024-04-24 14:39:15 +03:00
Alex Pyrgiotis
cd4cbdb00a
isolation_provider: Get exit code without timing out
Get the exit code of the spawned process for the doc-to-pixels phase,
without timing out. More specifically, if the spawned process has not
finished within a generous amount of time (hardcode to 15 seconds),
return UnexpectedConversionError, with a custom message.

This way, the happy path is not affected, and we still make our best to
learn the underlying cause of the I/O error.
2024-04-24 14:36:14 +03:00
Alex Pyrgiotis
171a7eca52
isolation_provider: Terminate doc-to-pixels proc
Extend the IsolationProvider class with a
`terminate_doc_to_pixels_proc()` method, which must be implemented by
the Qubes/Container providers and gracefully terminate a process started
for the doc to pixels phase.

Refs #563
2024-04-24 14:36:14 +03:00
Alex Pyrgiotis
a63f4b85eb
isolation_provider: Set a unique name for spawned containers
Set a unique name for spawned containers, based on the ID of the
provided document. This ID is not globally unique, as it has few bits of
entropy.  However, since we only want to avoid collisions within a
single Dangerzone invocation, and since we can't support multiple
containers running in parallel, this ID will suffice.
2024-04-24 14:33:33 +03:00
Alex Pyrgiotis
6850d31edc
isolation_provider: Pass doc when creating doc-to-pixels proc
Pass the Document instance that will be converted to the
`IsolationProvider.start_doc_to_pixels_proc()` method. Concrete classes
can then associate this name with the started process, so that they can
later on kill it.
2024-04-24 14:33:33 +03:00
Alex Pyrgiotis
a31f3370d0
Capture missing logs in second-stage conversion
For a while now, we didn't get logs for the second-stage conversion
when using containers. Extend the code to log any captured output from
the second stage conversion, only if we run Dangerzone via our dev
entrypoint.

Note that the Qubes isolation provider was always logging output from
the second stage of the conversion.
2024-03-13 20:59:50 +02:00
deeplow
0449840ec3
dz.ConvertDev: do not teleport .pyc files
On Qubes the conversion in dev mode would fail when converting from a
Fedora 38 development qube via a Fedora 39 disposable qube. The reason
was that dz.ConvertDev was receiving `.pyc` files, which were compiled
for python 3.11 but running on python 3.12.

Unfortunately PyZipFile objects cannot send source python files, even
though the documentation is a little bit unclear on this [1].

Fixes #723

[1]: https://docs.python.org/3/library/zipfile.html#pyzipfile-objects
2024-03-13 07:13:39 +00:00
Alex Pyrgiotis
634523dac9
Get underlying error when conversion fails
When we get an early EOF from the converter process, we should
immediately get the exit code of that process, to find out the actual
underlying error. Currently, the exception we raise masks the underlying
error.

Raise a ConverterProcException, that in turns makes our error handling
code read the exit code of the spawned process, and converts it to a
helpful error message.

Fixes #714
2024-02-20 15:55:45 +02:00
Alex Pyrgiotis
6ee1d14c9a
Start conversion process earlier
Start the conversion process earlier, so that we have a reference to the
Popen object in case of an exception.
2024-02-20 15:55:45 +02:00
deeplow
e4a5dbce46
Don't show 50% duplicated progress info
50% would show twice in the conversion progress due to an overlap in
conversion progress values. The doc_to_pixels would be from 0-50% and
the pixels_to_pdf from 50%-100%.

This commit makes the first part go from 0 to 49% instead.

Fixes #715
2024-02-20 13:47:15 +00:00
deeplow
879fca6f9f
Remove uneeded TESSDATA_PREFIX setting in container
The container image does not need the TESSDATA_PREFIX env variable since
its PyMuPDF version is new enough to support `tessdata` as an argument
when calling the PyMuPDF tesseract method.
2024-02-07 13:14:08 +00:00
deeplow
69c2a02d81
Remove timeouts
Remove timeouts due to several reasons:

1. Lost purpose: after implementing the containers page streaming the
   only subprocess we have left is LibreOffice. So don't have such a
   big risk of commands hanging (the original reason for timeouts).

2. Little benefit: predicting execution time is generically unsolvable
   computer science problem. Ultimately we were guessing an arbitrary
   time based on the number of pages and the document size. As a guess
   we made it pretty lax (30s per page or MB). A document hanging for
   this long will probably lead to user frustration in any case and the
   user may be compelled to abort the conversion.

3. Technical Challenges with non-blocking timeout: there have been
several technical challenges in keeping timeouts that we've made effort
to accommodate. A significant one was having to do non-blocking read to
ensure we could timeout when reading conversion stream (and then used
here)

Fixes #687
2024-02-06 20:11:43 +00:00
deeplow
07dd54cd13
Fix hanging: disable container logging
The conversion was hanging arbitrarily [1] on some systems. Sometimes it
would send the full page other times stop half-way.

Originally found by @apyrgio.

Co-authored-by: @apyrgio

[1]: https://github.com/freedomofpress/dangerzone/pull/627#issuecomment-1892491968
2024-02-06 19:42:41 +00:00
deeplow
f3032a7142
Make big endian explicit in int to bytes
Fix issues in older distros that don't yet support python 3.11 where
endianness was not a default argument [1]. This is in response to CI
failures [2].

[1]: https://docs.python.org/3/library/stdtypes.html#int.to_bytes
[2]: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/2186/workflows/e340ca21-85ce-42b6-9bc3-09e66f96684a/jobs/27380y
2024-02-06 19:42:41 +00:00
deeplow
1835756b45
Allow each conversion to have its own proc
If we increased the number of parallel conversions, we'd run into an
issue where the streams were getting mixed together. This was because
the Converter.proc was a single attribute. This breaks it down into a
local variable such that this mixup doesn't happen.
2024-02-06 19:42:41 +00:00
deeplow
61e7a3c107
Fix isolation provider tests
Conversions methods had changed and that was part of the reason why
the tests were failing. Furthermore, due to the `provider.proc`, which
stores the associated qrexec / container process, "server" exceptions
raise a IterruptedConversion error (now ConverterProcException), which
then requires interpretation of the process exit code to obtain the
"real" exception.
2024-02-06 19:42:41 +00:00
deeplow
550786adfe
Remove untrusted progress parsing (stderr instead)
Now that only the second container can send JSON-encoded progress
information, we can the untrusted JSON parsing. The parse_progress was
also renamed to `parse_progress_trusted` to ensure future developers
don't mistake this as a safe method.

The old methods for sending untrusted JSON were repurposed to send the
progress instead to stderr for troubleshooting in development mode.

Fixes #456
2024-02-06 19:42:40 +00:00
deeplow
c991e530d0
Fix IsolationProvider.percentage variable reuse
If one converted more than one document, since the state of
IsolationProvider.percentage would be stored in the IsolationProvider
instance, it would get reused for the second document. The fix is to
keep it as a local variable, but we can explore having progress stored
on the document itself, for example. Or having one IsolationProvider per
conversion.
2024-02-06 19:42:40 +00:00
deeplow
0a099540c8
Stream pages in containers: merge isolation providers
Merge Qubes and Containers isolation providers core code into the class
parent IsolationProviders abstract class.

This is done by streaming pages in containers for exclusively in first
conversion process. The commit is rather large due to the multiple
interdependencies of the code, making it difficult to split into various
commits.

The main conversion method (_convert) now in the superclass simply calls
two methods:
  - doc_to_pixels()
  - pixels_to_pdf()

Critically, doc_to_pixels is implemented in the superclass, diverging
only in a specialized method called "start_doc_to_pixels_proc()". This
method obtains the process responsible that communicates with the
isolation provider (container / disp VM) via `podman/docker` and qrexec
on Containers and Qubes respectively.

Known regressions:
  - progress reports stopped working on containers

Fixes #443
2024-02-06 19:42:33 +00:00
deeplow
331b6514e8
Containers: remove debug messages (via files)
Remove container_log messages ahead of debug info being sent over
standard streams.
2024-02-06 18:54:39 +00:00
deeplow
dca46d0a6b
Homogenize qubes and containers inner convert method
Simple rename of the __convert() method in the Qubes conversion to make
the code structurally similar.
2024-02-06 18:54:31 +00:00
deeplow
77d5ea5940
Add PyMuPDF in pixels_to_pdf replacing old logic
Adding PyMuPDF essentially make the code much simpler since it can do
everything that we'd need multiple programs for. It also includes
tesseract-OCR integration, which this commit makes use of.
2024-01-03 12:56:33 +00:00
Alex Pyrgiotis
edfba0c783
Qubes: Fix progress in first stage of Qubes conversion 2023-10-13 22:44:37 +03:00
Alex Pyrgiotis
3daf0e2cb7
Do not show file previews in case of exceptions
If a Qubes conversion encounters an exception that is not a subclass of
ConversionException, it will still show a preview of a file that does
not exist.

Send an error progress report in that case, so that the GUI code can
detect that an error occurred and not open a file preview

Fixes #581
2023-10-05 11:11:42 +03:00
Alex Pyrgiotis
bdf3f8babc
qubes: Clean up temporary files
Create a temporary dir before the conversion begins, and store every
file necessary for the conversion there. We are mostly concerned about
the second stage of the conversion, which runs in the host. The first
stage runs in a disposable qube and cleanup is implicit.

Fixes #575
Fixes #436
2023-10-04 14:05:23 +03:00
Alex Pyrgiotis
6232062146
Add missing newline char 2023-10-02 15:41:29 +03:00
Alex Pyrgiotis
b7b76174ab
qubes: Log captured output for the second stage
Log the captured command output during the second stage, only in dev
environments. This follows what we have already done for the first
stage.
2023-10-02 15:41:29 +03:00
Alex Pyrgiotis
16603875d6
qubes: Display all errors in second stage
If a command encounters an error or times out during the second stage of
the conversion in Qubes, handle it the same way as we would have handled
it in the first stage:

1. Get its error message.
2. Throw an UnexpectedConversionError exception, with the original
   message.

Note that, because the second stage takes place locally, users will see
the original content of the error.

Refs #567
Closes #430
2023-10-02 15:41:17 +03:00
deeplow
0a6b33ebed
Qubes: detect qube failing to start (missing RAM)
In Qubes OS it's often the case that the user doesn't have enough
RAM to start the conversion. In this case it raises BrokenPipeException
and exits with code 126.

It didn't seem possible to distinguish this kind of failure to one
where the user has misconfigured qrexec policies.

NOTE: this approach is not ideal UX-wise. After the first doc failing
the next one will also try and fail. Upon first failure we should
inform the user that they need to close some programs or qubes.
2023-09-28 11:08:50 +01:00
deeplow
63f03d5bcd
Add limit and test to max width and height of docs 2023-09-28 11:08:47 +01:00
deeplow
54b8ffbf96
Add page limit of 10000
Theoretically the max pages would be 65536 (2byte unsigned int.
However this limit is much higher than practical documents have
and larger ones can lead to unforseen problems, for example RAM
limitations.

We thus opted to use a lower limit of 10K. The limit must be
detected client-side, given that the server is distrusted. However
we also check it in the server, just as a fail-early mechanism.
2023-09-28 11:01:14 +01:00
Alex Pyrgiotis
18b73d94b0
qubes: Find out reason of interrupted conversions
If a conversion has been interrupted (usually due to an EOF), figure out
why this happened by checking the exit code of the spawned process.
2023-09-26 17:35:26 +03:00
Alex Pyrgiotis
30196ff35b
errors: Add error for interrupted conversions
Add an error for interrupted conversions, in order to better
differentiate this scenario from other ValueErrors that may be raised
throughout the code's lifetime.
2023-09-26 17:35:26 +03:00
Alex Pyrgiotis
0273522fb1
qubes: Store the process for the spawned qube
Store, in an instance attribute, the process that we have started for
the spawned disposable qube. In subsequent commits, we will use it from
other places as well, aside from the `_convert` method.

Note that this commit does not alter the conversion logic, and only does
the following:
1. Renames `p.` to `self.proc.`
2. Adds an `__init__` method to the Qubes isolation provider, and
   initializes the `self.proc` attribute to `None`.
3. Adds an assert that `self.proc` is not `None` after it's spawned, to
   placate Mypy.
2023-09-26 17:35:25 +03:00
deeplow
e08b6defc3
Round conversion progress from float to int
Fixes #553
2023-09-26 15:20:41 +01:00
deeplow
8d37ff15e0
Remove duplicated Qubes message: "Safe PDF Created"
Fixes #555.  This is a leftover from when we didn't have progress
reports from the second stage conversion (AKA. pixels to PDF) in #429.
2023-09-26 12:16:48 +01:00
Alex Pyrgiotis
e64d1da61f
qubes: Pass OCR parameters properly
Pass OCR parameters to conversion functions as arguments, instead of
setting environment variables.

Fixes #455
2023-09-20 18:04:40 +03:00
Alex Pyrgiotis
8a0c0a4673
Make parameter actually optional 2023-09-20 17:58:39 +03:00
Alex Pyrgiotis
99dd5f5139
qubes: Add client-side timeouts
Extend the client-side capabilities of the Qubes isolation provider, by
adding client-side timeout logic.

This implementation brings the same logic that we used server-side to
the client, by taking into account the original file size and the number
of pages that the server returns.

Since the code does not have the exact same insight as the server has,
the calculated timeouts are in two places:

1. The timeout for getting the number of pages. This timeout takes into
   account:
   * the disposable qube startup time, and
   * the time it takes to convert a file type to PDF
2. The total timeout for converting the PDF into pixels, in the same way
   that we do it on the server-side.

Besides these changes, we also ensure that partial reads (e.g., due to
EOF) are detected (see exact=... argument)

Some things that are not resolved in this commit are:
* We have both client-side and server-side timeouts for the first phase
  of the conversion. Once containers can stream data back to the
  application (see #443), these server-side timeouts can be removed.
* We do not show a proper error message when a timeout occurs. This will
  be part of the error handling PR (see #430)

Fixes #446
Refs #443
Refs #430
2023-09-20 17:32:42 +03:00
Alex Pyrgiotis
55a4491ced
Consolidate import statements 2023-09-20 17:14:24 +03:00
deeplow
b4c3e07d36
Remove attacker-controlled error messages
Creates exceptions in the server code to be shared with the client via an
identifying exit code. These exceptions are then reconstructed in the
client.

Refs #456 but does not completely fix it. Unexpected exceptions and
progress descriptions are still passed in Containers.
2023-09-19 15:33:20 +01:00