Commit graph

1375 commits

Author SHA1 Message Date
Alex Pyrgiotis
fff7be7535
WIP for progress report 2024-10-15 19:38:48 +03:00
Alex Pyrgiotis
6b658812f0
FIXUP: Include more tessdata dirs 2024-10-09 22:30:58 +03:00
Alex Pyrgiotis
073a6e69f8
FIXUP: Fix lint and log to stderr 2024-10-09 22:30:47 +03:00
Alex Pyrgiotis
297fe5ecdd
FIXUP: Make run-tests CI job require cached tessdata 2024-10-09 22:15:36 +03:00
Alex Pyrgiotis
d31a10f9ec
FIXUP: Fix lint errors 2024-10-09 22:04:39 +03:00
Alex Pyrgiotis
d9eaec4a9a
FIXUP: Make _convert a public method 2024-10-09 22:03:46 +03:00
Alex Pyrgiotis
149ba235d9
FIXUP: Fix a deprecation warning for filter= 2024-10-09 21:55:01 +03:00
Alex Pyrgiotis
c37ff7322d
FIXUP: Replace print statements with logging 2024-10-09 21:54:42 +03:00
Alex Pyrgiotis
8db9261ccf
Revert "FIXUP: Factor out git_root"
This reverts commit 6a5b6e4249.
2024-10-09 20:22:52 +03:00
Alex Pyrgiotis
b3d8ddc086
FIXUP: Detect proper tessdata dir for Linux systems 2024-10-09 20:21:31 +03:00
Alex Pyrgiotis
6a5b6e4249
FIXUP: Factor out git_root 2024-10-09 20:03:44 +03:00
Alex Pyrgiotis
e690b2503f
FIXUP: Add tesseract-ocr-all as a required dependency for Debian 2024-10-09 19:16:51 +03:00
Alex Pyrgiotis
79f6fccb0a
FIXUP: debian: Explain why we ignore share/tessdata 2024-10-09 19:08:19 +03:00
Alex Pyrgiotis
09bb12593a
FIXUP: Use pathlib.Path for newer code 2024-10-09 18:57:33 +03:00
Alex Pyrgiotis
80e972b456
ci: Check OCR in Debian/Fedora tests 2024-10-09 18:56:41 +03:00
Alex Pyrgiotis
3e12aa3dfd
FIXUP: Fix progress percentages 2024-10-09 18:36:29 +03:00
Alex Pyrgiotis
4c7db48c59
FIXUP: Use 'in' instead of '==' 2024-10-09 18:26:06 +03:00
Alex Pyrgiotis
1302a1fcbf
tests: Improve test for top-level conversion errors 2024-10-08 19:15:00 +03:00
Alex Pyrgiotis
328ddbe5be
tests: Remove provider_wait fixtures 2024-10-08 19:15:00 +03:00
Alex Pyrgiotis
eddc06b436
Make Dummy isolation provider more realistic
Make the Dummy isolation provider follow the rest of the isolation
providers and perform the second part of the conversion on the host. The
first part of the conversion is just a dummy script that reads a file
from stdin and prints pixels to stdout.
2024-10-08 19:15:00 +03:00
Alex Pyrgiotis
2c08b5f9c3
Remove dead docs 2024-10-08 19:15:00 +03:00
Alex Pyrgiotis
1ab3aab08e
Remove dead code 2024-10-08 19:15:00 +03:00
Alex Pyrgiotis
62c32673f1
Update the way we get debug logs
Move the logic for grabbing debug logs to a new place, now that we have
merged the two conversion stages (doc to pixels, pixels to PDF).
2024-10-08 19:15:00 +03:00
Alex Pyrgiotis
bae637c974
Perform on-host pixels to PDF conversion
Extend the base isolation provider to immediately convert each page to
a PDF, and optionally use OCR. In contract with the way we did things
previously, there are no more two separate stages (document to pixels,
pixels to PDF). We now handle each page individually, for two main
reasons:

1. We don't want to buffer pixel data, either on disk or in memory,
   since they take a lot of space, and can potentially leave traces.
2. We can perform these operations in parallel, saving time. This is
   more evident when OCR is not used, where the time to convert a page
   to pixels, and then back to a PDF are comparable.
2024-10-08 19:15:00 +03:00
Alex Pyrgiotis
afe8179a51
Update .deb/.rpm dependencies
Update .deb/.rpm specs to include PyMuPDF as a required package.
2024-10-08 19:15:00 +03:00
Alex Pyrgiotis
0f2be58167
Make PyMuPDF a main Dangerzone dependency
The PyMuPDF package was previously mainly used within the Dangerzone
container, as well as on Qubes. With on-host conversion, PyMuPDF will be
used in all supported platforms by default. For this reason, we can
promote it to a main dependency.
2024-10-08 19:14:59 +03:00
Alex Pyrgiotis
b9e5c59520
Add new way to detect tessdata dir
Add a new way to detect where the Tesseract data are stored in a user's
system. On Linux, the Tesseract data should be installed via the package
manager. On macOS and Windows, they should be bundled with the
Dangerzone application.

There is also the exception of running Dangerzone locally, where even
on Linux, we should get the Tesseract data from the Dangerzone share/
folder.
2024-10-08 19:14:59 +03:00
Alex Pyrgiotis
c6475ed526
Ignore tesseract data when building DEB/RPM packages 2024-10-08 19:14:59 +03:00
Alex Pyrgiotis
84a4ae7fdd
ci: Add GitHub action for tessdata 2024-10-08 19:14:59 +03:00
Alex Pyrgiotis
23caf9faf7
Update build instructions 2024-10-08 19:14:59 +03:00
Alex Pyrgiotis
0921cc23e7
Add script for downloading Tesseract data
Add a Python script that can run in all supported platforms, and can
download and extract the Tesseract language data from GitHub, while
also:

1. Checking that the expected hash matches.
2. Informing the user if the language data have already been downloaded.
3. Extracting only the subset of language data that Dangerzone needs
2024-10-08 19:10:02 +03:00
Alex Pyrgiotis
6547998633
Provide sanitized version of output filename 2024-10-08 19:10:02 +03:00
Alex Pyrgiotis
17fa82297e
Better way to collect tests 2024-10-08 19:10:02 +03:00
Alex Pyrgiotis
54dc22c410
ci: Be explicit about the Debian package we install in end-user envs 2024-10-08 19:10:02 +03:00
Alex Pyrgiotis
25ac980b0b
FIXUP: Fix for vendoring PyMuPDF 2024-10-08 19:10:02 +03:00
Alex Pyrgiotis
6fd0f925a8
FIXUP: Fix a lint 2024-10-08 13:34:33 +03:00
Alex Pyrgiotis
30b4f24d77
FIXUP: Use the proper pip argument 2024-10-08 13:34:33 +03:00
Alex Pyrgiotis
e027d853c2
FIXUP: Implement review comments 2024-10-08 13:34:33 +03:00
Alex Pyrgiotis
07921566ba
FIXUP: Make Dockerfile work with latest wheels 2024-10-08 13:34:33 +03:00
Alex Pyrgiotis
eef4e8b548
debian: Vendor PyMuPDf when building Debian package
Install PyMuPDF under ./dangerzone/vendor, right before we build the
.deb package. We vendor PyMuPDF just for Debian, since the provided
versions don't have OCR support enabled.

Currently, we don't use PyMuPDf on the host, but this will change once
we fully implement the on-host conversion feature.

Refs #625
2024-10-08 13:34:32 +03:00
Alex Pyrgiotis
ed55124a8b
Add an import preference for vendored packages
Prefer importing packages from ./dangerzone/vendor, if there is one
there, instead of using the system ones.
2024-10-08 13:34:32 +03:00
Alex Pyrgiotis
f61097e9b3
install: Add script for vendoring PyMuPDF
Add a script that installs PyMuPDF under ./dangerzone/vendor. This will
be useful in subsequent commits, for vendoring PyMuPDF when building
Debian packages.
2024-10-08 13:34:32 +03:00
Alex Pyrgiotis
c22f945614
dev_scripts: Install pip in dev environments
Install pip in dev environments, so that we can use it to vendor
PyMuPDf in subsequent commits.
2024-10-08 13:34:32 +03:00
Alex Pyrgiotis
892dfaf1bc
Bump our Poetry dependencies 2024-10-08 13:34:32 +03:00
Alex Pyrgiotis
00711fa9e2
Add missing .pybuild dir in .gitignore 2024-10-08 13:34:32 +03:00
Alex Pyrgiotis
93b960cd23
Bump H2ORestart to version 0.6.6
Follow Debian's lead [1] and bump this version to 0.6.6. This change
should bring some stability improvements to our CI tests as well.

[1]: https://packages.debian.org/unstable/text/libreoffice-h2orestart
2024-10-07 18:36:06 +03:00
bnewc
752eff02d8
Prevent user from using illegal characters in output filename
Add some checks in the Dangerzone GUI and CLI that will prevent a user
from mistakenly adding illegal characters in the output filename.
2024-10-07 18:04:47 +03:00
Alex Pyrgiotis
275189587e
tests: Test termination logic under default conditions
Do not use the `provider_wait` fixture in our termination logic tests,
and switch instead to the `provider` fixture, which instantiates a
typical isolation provider.

The `provider_wait` fixture's goal was to emulate how would the process
behave if it had fully spawned. In practice, this masked some
termination logic issues that became apparent in the WIP on-host
conversion PR. Now that we kill the spawned process via its process
group, we can just use the default isolation provider in our tests.

In practice, in this PR we just do `s/provider_wait/provider`, and
remove some stale code.
2024-10-07 17:37:57 +03:00
Alex Pyrgiotis
b5130b08b6
tests: Improve Dummy provider tests
Add a fixture that returns our stock Dummy provider. Also, explicitly
use a blocking Dummy provider (`DummyWait`) for a specific test case.
This will prove useful when we stop using the `provider_wait` variant of
our isolation providers in the next commits.
2024-10-07 17:37:42 +03:00
Alex Pyrgiotis
dc8a22c8e7
Fix the dummy provider
Make the dummy provider behave a bit more like the other providers, with
a proper function and termination logic. This will be helpful soon in
the tests.
2024-10-07 17:37:42 +03:00