OCR: How to extend the character set for german language OCR model set by additional characters?

Question

Frank 2024 年 12 月 10 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2171947-ocr-how-to-extend-the-character-set-for-german-language-ocr-model-set-by-additional-characters

コメント済み: Frank 2024 年 12 月 12 日

My documents in german language contain "§". The character set of the model of german language pack used for OCR does not contain this character, but english model does ... The aim is to enable OCR to recognize this special character in german language documents.

How to extend the character set of german laguage model by "§" (or another character)?

I checked a lot of search requests but did not found an example or an answer to solve this.

Thanks a lot!

Frank

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Sreeram 2024 年 12 月 12 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2171947-ocr-how-to-extend-the-character-set-for-german-language-ocr-model-set-by-additional-characters#answer_1555563

MATLAB Online で開く

Hi Frank,

To enable OCR to recognise the "§" character in German language documents, you may train a new OCR model on "§" by fine-tuning the German model using the “trainOCR” function. Detailed instructions can be found in the following documentation:

https://www.mathworks.com/help/releases/R2024b/vision/ref/trainocr.html?#d126e301790

If training a new model is not feasible, you may specify multiple models (German and English) to use for detection simultaneously. This can be done by passing them as a cell array in “Model” argument of the “ocr” function as specified in the following documentation:

https://www.mathworks.com/help/releases/R2024b/vision/ref/ocr.html#bu76sfz

ocr(I,model={"german","english"})

Note that this approach might result in some German characters being misclassified as English.

I hope this helps!

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Frank 2024 年 12 月 12 日

Hi Sreeram,

thank you very much for your answer. But both suggestions do not work:

1) If I train the german model by labeling many "§"-chars, exporting the gTruth data with set labels attribute values to "§", matlab hangs and must be "killed" (may be it cannot handle the not known character "§" ...).

On occasion: What's the meaning of all the "joined"-Phrases in OCR text results when using self trained data?

2) The proposal of model={"german","english"} delivers just the same results like using only the english model ... (no differences).

Do you have another idea?

If not, I will do two separate OCRs (the german and the english), will identify the positions of all "§" in english text and replace it within the german OCR-text. It's a little bit laborious but probably will be a working workaround.

I want to use Matlab OCR for validating german juridical documents and there are a lot of "§" signs within ... To navigate trough the documents texts by using different fuzzy inference systems to find out the semantics I need to know where "§"s are located.

Thanks a lot for the great service! And by the way - Matlab is really great. I do programming for more than 30 years, so I know.

Frank from Berlin

Frank 2024 年 12 月 12 日

Hi Sreeram and others interested in ..

I copied the file "deu.traineddata" from folder "C:\ProgramData\MATLAB\SupportPackages\R2024b\3P.instrset\tesseract-ocr-languages-deu.instrset\tessdata_best" to folder "D:\Programs\MATLAB\R2024b\toolbox\vision\visionutilities\tessdata_best" and now it works (line: "result = ocr(I,LayoutAnalysis="page",Model={"german","english"})")

In my windwos 11 Matlab installation the program folder is on disk D:\... but the support packages were automatically installed to C:\... May be this is not the expected standard.

Thanks a lot!

Frank 2024 年 12 月 12 日

Important notice for all interested in: The sequence of model names in the statement is important.

Model={"english","german"} works much more better than Model={"german","english"} ...

Final message now ;-)

サインインしてコメントする。

OCR: How to extend the character set for german language OCR model set by additional characters?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

OCR: How to extend the character set for german language OCR model set by additional characters?

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示