Commit graph

7 commits

Author SHA1 Message Date
Alex Pyrgiotis
cbca9110ca
Switch to tessdata-fast Tesseract model
Switch to the tessdata-fast Tesseract model, instead of the tessdata
one. The tessdata-fast Tesseract model is much smaller, and a bit faster
than the other one. Also, it's the model that Debian/Fedora ship by
default.

Closes #545
2023-09-25 12:48:05 +03:00
Alex Pyrgiotis
5bd609781d
Remove Kurdish (Arabic) language
Remove the Kurdish (Arabic) language ("kur_ara") from the list of
languages that we offer for OCR, since it's not included in the
installed languages.

Interestingly, it is not present in the Apline Linux repos as well, so
this was probably an omission in the first place.
2023-05-24 13:43:29 +03:00
Alex Pyrgiotis
35e439f9e8
Restore the OCR languages
Restore the OCR languages to the state they were in
66d3c40163, with some minor changes. We
can now do so because we download all the trained models, not just the
ones that Alpine Linux offers.
2023-05-24 13:43:29 +03:00
deeplow
58332fdd6e
tesseract: add new lanaguages and others
Tagalo was replaced with filipino [1] in newer tesseract versions, so it
doesn't make sense for us to use the new name and map it to the old
"tgl" name (Tagalo) under the hood.

Language names obtained from tesseract's man page [2].

[1]: 58f7a72f00
[2]: https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
2023-03-16 14:23:30 +00:00
deeplow
d8d83ff036
Remove languages not supported
When the ocr languages list was originally introduced (commit b527776),
the container was running in a ubuntu 18.04 [1]. Later it changed to
alpine linux. Unfortunately it has less languages than in ubuntu.

This commit removes those languages. Fixes #355

[1]: b527776e28 (diff-ec032b25a6c2af24eaf4128c85090c5ce0dcbab64e64eace10be9f4e4683a71bR1)
2023-03-16 14:23:28 +00:00
deeplow
66d3c40163
Sort OCR languages by tesseract arg name
Make it easier to compare the list of languages with the output of
`tesseract --list-langs`.
2023-03-16 14:23:25 +00:00
deeplow
2d6826afa9
move ocr_languages from global_common to share/
ocr_languages can be treated as just a json file instead of being
in global_common. This way it is easier to maintain and makes
global_common cleaner.
2022-09-15 10:40:34 +01:00