Commit graph

11 commits

Author SHA1 Message Date
deeplow
814d533c3b
Restructure container code
The files in `container/` no longer make sense to have that name since
the "document to pixels" part will run in Qubes OS in its own virtual
machine.

To adapt to this, this PR does the following:
- Moves all the files in `container` to `dangerzone/conversion`
- Splits the old `container/dangerzone.py` into its two components
  `dangerzone/conversion/{doc_to_pixels,pixels_to_pdf}.py` with a
  `common.py` file for shared functions
- Moves the Dockerfile to the project root and adapts it to the new
  container code location
- Updates the CircleCI config to properly cache Docker images.
- Updates our install scripts to properly build Docker images.
- Adds the new conversion module to the container image, so that it can
  be imported as a package.
- Adapts the container isolation provider to use the new way of calling
  the code.

NOTE: We have made zero changes to the conversion code in this commit,
except for necessary imports in order to factor out some common parts.
Any changes necessary for Qubes integration follow in the subsequent
commits.
2023-06-21 11:44:47 +03:00
Alex Pyrgiotis
a0d6f0d719
container: Grab trained OCR models from GitHub
Grab Tesseract's trained models from GitHub, instead of from the Alpine
Linux repos. Over the past few months, the models in the Alpine Linux
repos did not remain stable, leading to CI issues.

Since the models are already pre-trained and available through
Tesseract's repo on GitHub, we can use the release tarball that they
offer to install them in the container image, which is basically what
the upstream packages are doing as well.

In order to make sure that we have no regressions, at the time of this
commit we ensured that the hashes of the models offered through the
Alpine Linux repos and the models offered from the GitHub release are
the same. Also, in order to detect future regressions or foul play, we
check the downloaded models against a known checksum. Given that these
models change every few years, updating the checksum should not be an
issue.

Fix #357
2023-05-23 16:27:40 +03:00
Alex Pyrgiotis
3d822e1aa3
container: Install a renamed package
The tesseract-ocr-data-ell package for the Greek language has been
renamed to tesseract-ocr-data-grc. Use the new name in our Dockerfile.
2023-05-17 20:29:13 +03:00
deeplow
dbd0450542
Add poppler-data package due to missing fonts
Some documents were reporting the following error when running them
over pdftoppm:

    Syntax Error: Missing language pack for 'Adobe-Japan1' mapping

This did not necessarily make the document fail but it could be
that some fonts were not properly rendered due to the missing package.
2023-02-21 18:39:14 +00:00
Alex Pyrgiotis
24975fabd5
container: Reinstate OpenJDK 8 dependency
Commit d7be28ec2a assumed that OpenJDK was
required for the PDFtk package, which is no longer installed in the
Dangerzone image, and thus was removed.

Turns out that while LibreOffice does not depend on OpenJDK, it may
produce corrupted PDFs if installed without it, and will not abort the
operation.

Reinstate OpenJDK to fix the issue of corrupted PDFs.

Fixes #315
2023-02-07 18:52:49 +02:00
deeplow
2da973232b
Remove sudo: no longer needed
Fixes #232
2023-01-23 14:13:56 +00:00
deeplow
d7be28ec2a
Remove openjdk-8 as a dependency.
default-jre and java dependencies dependencies had been added initially
[1] because of libreoffice-java-common, which is no longer present.
Then, when the image was changed from ubuntu to alpine [2], default-jre
was replaced with openjdk-8.

If java is still a dependency for libreoffice, then it should be pulled
automatically.

[1] 9ecdb9e995
[2] 650ae6eee1
2023-01-23 14:13:48 +00:00
deeplow
d28aa5a25b
Remove PDFtk dependency (replace w/ pdftoppm)
PDFtk actually isn't needed. It was being used for breaking a PDF
into pages but this is something that be replaced by the already present
'pdftoppm'. Furthermore, by removing this dependency we contribute to
reproducible builds and overall supply chain security because it was
obtained from gitlab with no signature verification or version pinning.

The replacement 'pdftoppm' enabled us to do a shortcut:
 - before: PDF -> PDF pages -> PNG images -> RGB images
 - after:  PDF -> PPM images -> RGB images

And this last conversion step is trivial since the RGB format we were
using is just a PPM file without the metadata in its header.
2023-01-23 14:00:57 +00:00
deeplow
21a9a6c98c
running dangerzone without root in container
There was previously a user created in the container but it was not
used via the dockerfile RUN directive (as pointed out by
gmarmstrong[1]).

Fixes #169

[1]: https://github.com/freedomofpress/dangerzone/issues/169#issue-1268399245
2022-08-22 08:43:58 +01:00
Micah Lee
8052220034
Get rid of wrapper scripts in the container 2021-11-29 15:39:24 -08:00
Micah Lee
2de2b6dca5
Rename dangerzone-converter to container 2021-11-29 15:30:21 -08:00
Renamed from dangerzone-converter/Dockerfile (Browse further)