Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presenting Results and Training Data of Expanded Evaluation Experiment of HMM-Based Arabic Omni Font-Written OCR Mohamed Attia & Mohamed El-Mahallawy RDI.

Similar presentations


Presentation on theme: "Presenting Results and Training Data of Expanded Evaluation Experiment of HMM-Based Arabic Omni Font-Written OCR Mohamed Attia & Mohamed El-Mahallawy RDI."— Presentation transcript:

1 Presenting Results and Training Data of Expanded Evaluation Experiment of HMM-Based Arabic Omni Font-Written OCR Mohamed Attia & Mohamed El-Mahallawy RDI ’s Meeting Room; Oct بسم الله الرحمن الرحيم

2 Overall Results The Omni quality of an OCR system is measured by its capabilities at.. Recognition models are built in both cases from the significant writing size range of 7 distinct MS-Windows + 2 distinct Mac. Fonts Generalization test is run on the significant writing size range of 3 distinct Mac. Fonts WER G, CER G WER A, CER A 10.32%, 2.58%3.08%, 0.77% Generalization: How good at recognizing pages printed in fonts not represented in the training data Ultimate predefined goal: WER G around 3·WER A Assimilation: How good at recognizing pages (whose text contents are not included in the training data) printed in fonts represented in the training data. Ultimate predefined goal: WER A around 3%

3 Error Analysis of Assimilation Test Regarding Font Shape/Size ∑WER A % Over shapes per size DemashqBaghdad Courier New TahomaAkhbar MT Traditional Arabic KoufiMudir Simplified Arabic Small Medium Large ∑WER A % Over sizes per shape Shape Size

4 Error Analysis of Assimilation Test Regarding the Most Frequent Recognition Mistakes These are the most frequent 17 mistakes that contribute to about 63.15% of WER G Frequency % of the total WER Replaced by Original ligature Frequency % of the total WER Replaced by Original ligature Frequency % of the total WER Replaced by Original ligature 0.41 ـاـلـ 2.37 ا Insertion ن ـتـ تت ـنـ ن ……… 2.25 ـرـو 6.33 × Decomp. errors ……… 1.72 لمـلـ ـلـ 5.03 ضـصـ ……… 1.66 ـذـد 4.74 ـبـ ـيــيـ ـبـ ……… 0.95 ـزنـ 3.96 لعـع ……… 0.77 ه لحـحـ ……… 0.71 ق فف ق 2.90 ـذـنـ ……… 0.71 ـرـد 2.60 ا.

5 Error Analysis of Generalization Test Regarding Font Shape/Size ∑WER G % Over shapes per size GizaNaskhNadeem Small Medium Large ∑WER G % Over sizes per shape Shape Size Sample pages from 2 books of a typical quality have also been tried. The WER G of book#1 sample pages (1,700 words) is 11.70%, and that of book#2’s samples (1,100 words) is 7.25%.

6 Error Analysis of Generalization Test Regarding the Most Frequent Recognition Mistakes These are the most frequent 19 mistakes that contribute to about 55.90% of WER G Frequency % of the total WER Replaced by Original ligature Frequency % of the total WER Replaced by Original ligature Frequency % of the total WER Replaced By Original ligature 0.50 ـذـنـ 1.44 ق فف ق ـبـ ـيــيـ ـبـ 0.40 ـرـو 1.25 ×ت ـنـ ن ـتـت ـنـ 0.30 لعـع 1.12 ×ـحـ 3.81 لمـ لـلـ لمـ ……… 1.12 × Decomp. errors 3.10 ءعـ ……… 1.06 ×شـشـ 3.06 الـ ……… 0.94 ضـصـ 2.00 ×و ……… 0.81 ×ـثـ ـث 2.00 ×فـ ……… 0.81 تـة 1.56 ×قـ

7 Training and Evaluation Data 9 distinct fonts, with the significant writing size range of each, are used for training/building recognition models. 7 of them are MS-Windows ones and 2 fonts are Mac. ones.  At each size of each fonts; 25 different pages are used for training and other 5 different ones are used for assimilation test. This sums up to (9·6·25 = 1,350) pages ≈ 1,350·200 = 270,000 words ≈ 270,000·4 = 1,080,000 graphemes for training and (9·6·5=270) pages ≈ 54,000 words ≈ 216,000 graphemes for assimilation testing 3 Mac. OS. fonts at their full size range are used for generalization test.  At each size of each font of these 3 fonts, 5 pages are used for generalization test. This sums up to (5·6·3=90) pages ≈ 18,000 words ≈ 72,000 graphemes for generalization testing

8 Effect of Language Model Our language model is neither constrained by a certain lexicon nor by a set of linguistic rules; i.e. it is an open vocabulary language model. Our statistical language model (SLM) is an m-Gram one built using Bayes_Good-Turing_Back-Off methodology. The unit of our SLM is the grapheme; i.e. the ligature. The order of the deployed SLM in our system is 2. Our SLM is built from the NEMLAR raw text corpus with size of 550,000 words (≈ 2,200,000 graphemes) distributed over the 13 most significant domains of modern and heritage standard Arabic. Deploying/neutralizing the SLM has the following effect on the realized WER of our system: WER A, WER G SLM deployed WER A, WER G SLM neutralized 3.08%, 10.32%6.13%, 14.60%

9 Appreciating How-Distinct are the Fonts Used for Training, and Assimilation & Generalization testing Used forSampleVisual featuresFont Name Training and Assimilation testing Limited ligatures (151), gross graphemes, thin contours, tends to be sharp cornered, clear openings, separate dots, no touching tails,.. Simplified Arabic Training and Assimilation testing Limited ligatures (151), gross graphemes, thick contours, round cornered, clear openings, partially connected dots, no touching tails,.. Mudir Training and Assimilation testing Limited ligatures (151), gross graphemes, very thick contours, very sharp cornered, small openings, separate dots, no touching tails,.. Koufi Training and Assimilation testing Broadest ligatures set (220), minute graphemes, thin contours, round cornered, almost close openings, partially connected dots, no touching tails,.. Traditional Arabic Training and Assimilation testing Limited ligatures (156), minute graphemes, med. contour thickness, tends to be sharp cornered, almost close openings, connected dots, no touching tails,.. Akhbar MT Training and Assimilation testing Limited ligatures (151), gross graphemes, thin contours, round cornered, clear openings, separate dots, touching tails, …, Some ligatures has odd shapes (e.g middle Ha),.. Tahoma Training and Assimilation testing Very broad ligatures set (209), minute graphemes (but very wide), thin contours, sharp cornered, clear openings, connected dots, no touching tails,.. Courier new Training and Assimilation testing Rich ligatures set (167), minute graphemes, thick contours, round cornered, almost close openings, connected dots touching tails,.. Baghdad Training and Assimilation testing Limited ligatures (154), minute graphemes, med. contour thickness, tends to be sharp cornered, almost close openings, connected dots, touching tails,.. Demashq Generalization testing Limited ligatures (154), minute graphemes, med. contour thickness, tends to be sharp cornered, almost close openings, partially connected dots, touching tails,.. Nadeem Generalization testing Rich ligatures set (167), minute graphemes, thin contours, round cornered, close openings, connected dots, no touching tails,.. Naskh Generalization testing Limited ligatures (154), minute graphemes, thin contours, sharp cornered, semi clear openings, partially onnected dots, no touching tails,.. Giza

10 Can our OCR System Statistically Build Concepts of Font Shapes? A Case Study Some fonts which are conceptually distinct from the ones comprising the training data, are very challenging to generalization testing; i.e. WER G >>WER A Upon our first trial to run a generalization test, the recognition models are built from the 7 MS-Windows fonts and the testing data was composed of 3 Mac. OS fonts. Under these conditions we got the poor results of WER G ≈35%≈11·WER A (WER G >>WER A ) After error analysis and some contemplation, we realized that Mac. OS fonts are built with different concepts not covered by the 7 MS-Windows fonts; e.g. connected dots, overlapping of the tails of some graphemes, …, etc. After adding 2 Mac. OS fonts to introduce those concepts in the training data, we have achieved the dramatic enhancement of WER G =10.32%≈3.4·WER A Our OCR system can statistically build font shape concepts.

11 Current Parameters Setting and Computational Capacity Computational Capacity of the current pilot system: Runtime; Recognition phase: Some what slow but bearable. Offline; Training phase: Very slow! As per the experiment reported here.. Building the Codebook takes about 45 hours. Building the HMM ’s takes about 53 hours. Generalization testing data size Assimilation testing data Size Training data size HMM algorithms HMM sates/model Codebook size HMM type 72,000 graphemes 216,000 graphemes 1,080,000 graphemes Initialization Viterbi and Baum Welsh For Embedded training Baum Welsh Recognition Viterbi 14 states except for slim graphemes: 7 states, and wide graphemes: 18 states. 2,048 Discrete 1 st order L-to–R - As our pilot system is built from a hybrid of off-the-shelf tools (some are voluntarily built), a professionally optimised s/w implementation of the system may save up to 50% of the training/recognition time. - Other 25% may be saved using more powerful contemporary hardware.

12 Conclusion Is our OCR system truly omni? Yes It can both assimilate and generalize at a remarkable WER under tough training and testing data sets. In fact, the obtained WER ’s are the best reported in the published literature in this regard. Is there a room for further enhancement? Yes Regarding Both: - Reducing WER G by building recognition models from more distinct fonts (esp. Mac. ones), sizes, and writing styles. - Reducing the training/recognition time by a professionally optimized re-build of the core system, as well as using a more powerful hardware. WER G, CER G WER A, CER A 10.32%, 2.58%3.08%, 0.77%

13 Thank you for your kind attention To probe further, contact..

14 Simplified Arabic

15 Mudir (MS-Windows)

16 Koufi (MS-Windows)

17 Traditional Arabic (MS-Windows)

18 Akhbar (MS-windows)

19 Tahoma (MS-Windows)

20 Courier new (MS-Windows)

21 Baghdad (Mac.)

22 Demashq (Mac.)

23 Nadeem (Mac.)

24 Naskh (Mac.)

25 Giza (Mac.)


Download ppt "Presenting Results and Training Data of Expanded Evaluation Experiment of HMM-Based Arabic Omni Font-Written OCR Mohamed Attia & Mohamed El-Mahallawy RDI."

Similar presentations


Ads by Google