Remove Dockerfile dependencies replaced by PyMuPDF

PyMuPDF replaced the need for almost all dependencies, which this commit
now removes.

We are also removing tesseract-ocr as a dependency since
(to our surprise) PyMuPDF ships directly with tesseract binaries [1].
However, now that tesseract-ocr is not available directly as a binary
tool, the `test_ocr.py` needed to be changed.

Fixes #658

[1]: https://github.com/freedomofpress/dangerzone/issues/658#issuecomment-1861033149
This commit is contained in:
deeplow 2023-12-22 11:43:33 +00:00
parent ee35e28aa6
commit f676891482
No known key found for this signature in database
GPG key ID: 577982871529A52A
3 changed files with 19 additions and 7 deletions

View file

@ -13,6 +13,7 @@ since 0.4.1, and this project adheres to [Semantic Versioning](https://semver.or
### Changed
- Feature: Add support for HWP/HWPX files (Hancom Office) for macOS Apple Silicon devices ([issue #498](https://github.com/freedomofpress/dangerzone/issues/498), thanks to [@OctopusET](https://github.com/OctopusET))
- Replace Dangerzone document rendering engine from pdftoppm PyMuPDF, essentially replacing a variety of tools (gm / tesseract / pdfunite / ps2pdf) ([issue #658](https://github.com/freedomofpress/dangerzone/issues/658))
## Dangerzone 0.5.1

View file

@ -55,14 +55,10 @@ FROM alpine:latest
# Install dependencies
RUN apk --no-cache -U upgrade && \
apk --no-cache add \
ghostscript \
libreoffice \
openjdk8 \
poppler-utils \
poppler-data \
python3 \
py3-magic \
tesseract-ocr \
font-noto-cjk
COPY --from=pymupdf-build /usr/lib/python3.11/site-packages/fitz/ /usr/lib/python3.11/site-packages/fitz

View file

@ -1,5 +1,6 @@
import platform
import subprocess
from pathlib import Path
import pytest
@ -16,16 +17,30 @@ def test_ocr_ommisions() -> None:
# Create the command that will list all the installed languages in the container
# image.
runtime = Container.get_runtime()
command = [runtime, "run", Container.CONTAINER_NAME, "tesseract", "--list-langs"]
command = [
runtime,
"run",
Container.CONTAINER_NAME,
"find",
"/usr/share/tessdata/",
"-name",
"*.traineddata",
]
# Run the command, strip any extra whitespace, and remove the following first line
# from the result:
#
# List of available languages in "/usr/share/tessdata/" ...
installed_langs = set(
installed_langs_filenames = (
subprocess.run(command, text=True, check=True, stdout=subprocess.PIPE)
.stdout.strip()
.split("\n")[1:]
.split("\n")
)
installed_langs = set(
[
Path(filename).name.split(".traineddata")[0]
for filename in installed_langs_filenames
]
)
# Remove the "osd" and "equ" languages from the list of installed languages, since