Commit graph

6 commits

Author SHA1 Message Date
Alex Pyrgiotis
28b7249a6a
Add new way to detect tessdata dir
Add a new way to detect where the Tesseract data are stored in a user's
system. On Linux, the Tesseract data should be installed via the package
manager. On macOS and Windows, they should be bundled with the
Dangerzone application.

There is also the exception of running Dangerzone locally, where even
on Linux, we should get the Tesseract data from the Dangerzone share/
folder.
2024-10-17 15:50:11 +03:00
Alex Pyrgiotis
4ea0650f42
tests: Skip a test for missing OCR files on Qubes
We have a container-specific test that deals with missing OCR files in
the container image. This test _can_ be run under Qubes, and it may
fail since it requires Podman.

Make the pytest guard more strict and don't allow running this test on
Qubes.

Also, fix a typo in the word "omission".
2024-06-27 22:11:50 +03:00
Etienne Perot
f03bc71855
Sandbox all Dangerzone document processing within gVisor.
This wraps the existing container image inside a gVisor-based sandbox.

gVisor is an open-source OCI-compliant container runtime.
It is a userspace reimplementation of the Linux kernel in a
memory-safe language.

It works by creating a sandboxed environment in which regular Linux
applications run, but their system calls are intercepted by gVisor.
gVisor then redirects these system calls and reinterprets them in
its own kernel. This means the host Linux kernel is isolated
from the sandboxed application, thereby providing protection against
Linux container escape attacks.

It also uses `seccomp-bpf` to provide a secondary layer of defense
against container escapes. Even if its userspace kernel gets
compromised, attackers would have to additionally have a Linux
container escape vector, and that exploit would have to fit within
the restricted `seccomp-bpf` rules that gVisor adds on itself.

Fixes #126
Fixes #224
Fixes #225
Fixes #228
2024-06-12 13:40:04 +03:00
deeplow
69c2a02d81
Remove timeouts
Remove timeouts due to several reasons:

1. Lost purpose: after implementing the containers page streaming the
   only subprocess we have left is LibreOffice. So don't have such a
   big risk of commands hanging (the original reason for timeouts).

2. Little benefit: predicting execution time is generically unsolvable
   computer science problem. Ultimately we were guessing an arbitrary
   time based on the number of pages and the document size. As a guess
   we made it pretty lax (30s per page or MB). A document hanging for
   this long will probably lead to user frustration in any case and the
   user may be compelled to abort the conversion.

3. Technical Challenges with non-blocking timeout: there have been
several technical challenges in keeping timeouts that we've made effort
to accommodate. A significant one was having to do non-blocking read to
ensure we could timeout when reading conversion stream (and then used
here)

Fixes #687
2024-02-06 20:11:43 +00:00
deeplow
f676891482
Remove Dockerfile dependencies replaced by PyMuPDF
PyMuPDF replaced the need for almost all dependencies, which this commit
now removes.

We are also removing tesseract-ocr as a dependency since
(to our surprise) PyMuPDF ships directly with tesseract binaries [1].
However, now that tesseract-ocr is not available directly as a binary
tool, the `test_ocr.py` needed to be changed.

Fixes #658

[1]: https://github.com/freedomofpress/dangerzone/issues/658#issuecomment-1861033149
2024-01-03 12:58:36 +00:00
Alex Pyrgiotis
641aa131c9
ci: Add test for OCR languages
Test that the languages that we provide to users for OCR match the
languages that are installed in the container image

Fixes #417
2023-05-24 13:43:29 +03:00