Extend the base isolation provider to immediately convert each page to
a PDF, and optionally use OCR. In contract with the way we did things
previously, there are no more two separate stages (document to pixels,
pixels to PDF). We now handle each page individually, for two main
reasons:
1. We don't want to buffer pixel data, either on disk or in memory,
since they take a lot of space, and can potentially leave traces.
2. We can perform these operations in parallel, saving time. This is
more evident when OCR is not used, where the time to convert a page
to pixels, and then back to a PDF are comparable.
The PyMuPDF package was previously mainly used within the Dangerzone
container, as well as on Qubes. With on-host conversion, PyMuPDF will be
used in all supported platforms by default. For this reason, we can
promote it to a main dependency.
Add a new way to detect where the Tesseract data are stored in a user's
system. On Linux, the Tesseract data should be installed via the package
manager. On macOS and Windows, they should be bundled with the
Dangerzone application.
There is also the exception of running Dangerzone locally, where even
on Linux, we should get the Tesseract data from the Dangerzone share/
folder.
Add a Python script that can run in all supported platforms, and can
download and extract the Tesseract language data from GitHub, while
also:
1. Checking that the expected hash matches.
2. Informing the user if the language data have already been downloaded.
3. Extracting only the subset of language data that Dangerzone needs
Install PyMuPDF under ./dangerzone/vendor, right before we build the
.deb package. We vendor PyMuPDF just for Debian, since the provided
versions don't have OCR support enabled.
Currently, we don't use PyMuPDf on the host, but this will change once
we fully implement the on-host conversion feature.
Refs #625
Add a script that installs PyMuPDF under ./dangerzone/vendor. This will
be useful in subsequent commits, for vendoring PyMuPDF when building
Debian packages.
Do not use the `provider_wait` fixture in our termination logic tests,
and switch instead to the `provider` fixture, which instantiates a
typical isolation provider.
The `provider_wait` fixture's goal was to emulate how would the process
behave if it had fully spawned. In practice, this masked some
termination logic issues that became apparent in the WIP on-host
conversion PR. Now that we kill the spawned process via its process
group, we can just use the default isolation provider in our tests.
In practice, in this PR we just do `s/provider_wait/provider`, and
remove some stale code.
Add a fixture that returns our stock Dummy provider. Also, explicitly
use a blocking Dummy provider (`DummyWait`) for a specific test case.
This will prove useful when we stop using the `provider_wait` variant of
our isolation providers in the next commits.
Make the dummy provider behave a bit more like the other providers, with
a proper function and termination logic. This will be helpful soon in
the tests.
Instead of killing just the invoked Podman/Docker/qrexec process, kill
the whole process group, to make sure that other components that have
been spawned die as well. In the case of Podman, conmon is one of the
processes that lingers, so that's one way to kill it.
Start the conversion process in a new session, so that we can later on
kill the process group, without killing the controlling script (i.e.,
the Dangezone UI). This should not affect the conversion process in any
other way.
GitHub provides an Intel macOS runner as `macos-13`. Add it alongside
our M1 macOS runner (`macos-latest`), in order to cover all of our
target environments.
It seems that we need to specify that Python files have LF line endings
on Windows environments, else they will get converted to CRLF. If this
happens, then the container image we build in this environment will have
Python files with wrong endings, and tests will break.
Refs #838 for previous attempt.
Make sure that the Debian package we build conforms to the expected
naming scheme else, it's possible that something is off. A scenario
we've encountered is bumping `share/version.txt`, but not
`debian/changelog`, which would create a Debian package with an older
version.
As part of this change, the dev (build) and end-user test images names
changed from `dangerzone.rocks/*` to `ghcr.io`.
A new `--sync` option is provided in the `env.py` command, in order to
retrieve the images from the registry, or build and upload otherwise.