OCR: How to extend the character set for german language OCR model set by additional characters?

29 ビュー (過去 30 日間)
Frank
Frank 2024 年 12 月 10 日 19:36
コメント済み: Frank 2024 年 12 月 12 日 18:22
My documents in german language contain "§". The character set of the model of german language pack used for OCR does not contain this character, but english model does ... The aim is to enable OCR to recognize this special character in german language documents.
How to extend the character set of german laguage model by "§" (or another character)?
I checked a lot of search requests but did not found an example or an answer to solve this.
Thanks a lot!
Frank

採用された回答

Sreeram
Sreeram 2024 年 12 月 12 日 5:33
Hi Frank,
To enable OCR to recognise the "§" character in German language documents, you may train a new OCR model on "§" by fine-tuning the German model using the “trainOCR” function. Detailed instructions can be found in the following documentation:
If training a new model is not feasible, you may specify multiple models (German and English) to use for detection simultaneously. This can be done by passing them as a cell array in “Model” argument of the “ocr” function as specified in the following documentation:
ocr(I,model={"german","english"})
Note that this approach might result in some German characters being misclassified as English.
I hope this helps!
  3 件のコメント
Frank
Frank 2024 年 12 月 12 日 18:11
Hi Sreeram and others interested in ..
I copied the file "deu.traineddata" from folder "C:\ProgramData\MATLAB\SupportPackages\R2024b\3P.instrset\tesseract-ocr-languages-deu.instrset\tessdata_best" to folder "D:\Programs\MATLAB\R2024b\toolbox\vision\visionutilities\tessdata_best" and now it works (line: "result = ocr(I,LayoutAnalysis="page",Model={"german","english"})")
In my windwos 11 Matlab installation the program folder is on disk D:\... but the support packages were automatically installed to C:\... May be this is not the expected standard.
Thanks a lot!
Frank
Frank 2024 年 12 月 12 日 18:22
Important notice for all interested in: The sequence of model names in the statement is important.
Model={"english","german"} works much more better than Model={"german","english"} ...
Final message now ;-)

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeLanguage Support についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by