mirror of
https://github.com/freedomofpress/dangerzone.git
synced 2025-04-28 18:02:38 +02:00
container: Grab trained OCR models from GitHub
Grab Tesseract's trained models from GitHub, instead of from the Alpine Linux repos. Over the past few months, the models in the Alpine Linux repos did not remain stable, leading to CI issues. Since the models are already pre-trained and available through Tesseract's repo on GitHub, we can use the release tarball that they offer to install them in the container image, which is basically what the upstream packages are doing as well. In order to make sure that we have no regressions, at the time of this commit we ensured that the hashes of the models offered through the Alpine Linux repos and the models offered from the GitHub release are the same. Also, in order to detect future regressions or foul play, we check the downloaded models against a known checksum. Given that these models change every few years, updating the checksum should not be an issue. Fix #357
This commit is contained in:
parent
8059c8e1f1
commit
a0d6f0d719
1 changed files with 21 additions and 65 deletions
|
@ -1,5 +1,7 @@
|
|||
FROM alpine:latest
|
||||
|
||||
ARG TESSDATA_CHECKSUM=990fffb9b7a9b52dc9a2d053a9ef6852ca2b72bd8dfb22988b0b990a700fd3c7
|
||||
|
||||
# Install dependencies
|
||||
RUN apk -U upgrade && \
|
||||
apk add \
|
||||
|
@ -11,71 +13,25 @@ RUN apk -U upgrade && \
|
|||
poppler-data \
|
||||
python3 \
|
||||
py3-magic \
|
||||
tesseract-ocr \
|
||||
tesseract-ocr-data-afr \
|
||||
tesseract-ocr-data-ara \
|
||||
tesseract-ocr-data-aze \
|
||||
tesseract-ocr-data-bel \
|
||||
tesseract-ocr-data-ben \
|
||||
tesseract-ocr-data-bul \
|
||||
tesseract-ocr-data-cat \
|
||||
tesseract-ocr-data-ces \
|
||||
tesseract-ocr-data-chi_sim \
|
||||
tesseract-ocr-data-chi_tra \
|
||||
tesseract-ocr-data-chr \
|
||||
tesseract-ocr-data-dan \
|
||||
tesseract-ocr-data-deu \
|
||||
tesseract-ocr-data-grc \
|
||||
tesseract-ocr-data-enm \
|
||||
tesseract-ocr-data-epo \
|
||||
tesseract-ocr-data-equ \
|
||||
tesseract-ocr-data-est \
|
||||
tesseract-ocr-data-eus \
|
||||
tesseract-ocr-data-fin \
|
||||
tesseract-ocr-data-fra \
|
||||
tesseract-ocr-data-frk \
|
||||
tesseract-ocr-data-frm \
|
||||
tesseract-ocr-data-glg \
|
||||
tesseract-ocr-data-grc \
|
||||
tesseract-ocr-data-heb \
|
||||
tesseract-ocr-data-hin \
|
||||
tesseract-ocr-data-hrv \
|
||||
tesseract-ocr-data-hun \
|
||||
tesseract-ocr-data-ind \
|
||||
tesseract-ocr-data-isl \
|
||||
tesseract-ocr-data-ita \
|
||||
tesseract-ocr-data-ita_old \
|
||||
tesseract-ocr-data-jpn \
|
||||
tesseract-ocr-data-kan \
|
||||
tesseract-ocr-data-kat \
|
||||
tesseract-ocr-data-kor \
|
||||
tesseract-ocr-data-lav \
|
||||
tesseract-ocr-data-lit \
|
||||
tesseract-ocr-data-mal \
|
||||
tesseract-ocr-data-mkd \
|
||||
tesseract-ocr-data-mlt \
|
||||
tesseract-ocr-data-msa \
|
||||
tesseract-ocr-data-nld \
|
||||
tesseract-ocr-data-nor \
|
||||
tesseract-ocr-data-pol \
|
||||
tesseract-ocr-data-por \
|
||||
tesseract-ocr-data-ron \
|
||||
tesseract-ocr-data-rus \
|
||||
tesseract-ocr-data-slk \
|
||||
tesseract-ocr-data-slv \
|
||||
tesseract-ocr-data-spa \
|
||||
tesseract-ocr-data-spa_old \
|
||||
tesseract-ocr-data-sqi \
|
||||
tesseract-ocr-data-srp \
|
||||
tesseract-ocr-data-swa \
|
||||
tesseract-ocr-data-swe \
|
||||
tesseract-ocr-data-tam \
|
||||
tesseract-ocr-data-tel \
|
||||
tesseract-ocr-data-tgl \
|
||||
tesseract-ocr-data-tha \
|
||||
tesseract-ocr-data-tur \
|
||||
tesseract-ocr-data-ukr \
|
||||
tesseract-ocr-data-vie
|
||||
tesseract-ocr
|
||||
|
||||
# Download the trained models from the latest GitHub release of Tesseract, and
|
||||
# store them under /usr/share/tessdata. This is basically what distro packages
|
||||
# do under the hood.
|
||||
#
|
||||
# Because the GitHub release contains more files than just the trained models,
|
||||
# we use `find` to fetch only the '*.traineddata' files in the top directory.
|
||||
#
|
||||
# Before we untar the models, we also check if the checksum is the expected one.
|
||||
RUN mkdir tessdata && cd tessdata \
|
||||
&& TESSDATA_VERSION=$(wget -O- -nv https://api.github.com/repos/tesseract-ocr/tessdata/releases/latest \
|
||||
| sed -n 's/^.*"tag_name": "\([0-9.]\+\)".*$/\1/p') \
|
||||
&& apk --purge del jq \
|
||||
&& wget https://github.com/tesseract-ocr/tessdata/archive/$TESSDATA_VERSION/tessdata-$TESSDATA_VERSION.tar.gz \
|
||||
&& echo "$TESSDATA_CHECKSUM tessdata-$TESSDATA_VERSION.tar.gz" | sha256sum -c \
|
||||
&& tar -xzvf tessdata-$TESSDATA_VERSION.tar.gz -C . \
|
||||
&& find . -name '*.traineddata' -maxdepth 2 -exec cp {} /usr/share/tessdata \; \
|
||||
&& cd .. && rm -r tessdata
|
||||
|
||||
COPY dangerzone.py /usr/local/bin/
|
||||
RUN chmod +x /usr/local/bin/dangerzone.py
|
||||
|
|
Loading…
Reference in a new issue