Presentation is loading. Please wait.

Presentation is loading. Please wait.

Omni Font OCR Error Correction with Effect on Retrieval Walid Magdy 1 Kareem Darwish 2 1 Faculty of Engineering, Cairo University, Egypt 1 School of Computing,

Similar presentations


Presentation on theme: "Omni Font OCR Error Correction with Effect on Retrieval Walid Magdy 1 Kareem Darwish 2 1 Faculty of Engineering, Cairo University, Egypt 1 School of Computing,"— Presentation transcript:

1 Omni Font OCR Error Correction with Effect on Retrieval Walid Magdy 1 Kareem Darwish 2 1 Faculty of Engineering, Cairo University, Egypt 1 School of Computing, Dublin City University. Ireland 2 Faculty of Computers and Information, Cairo University, Egypt 2 Cairo Microsoft Innovation Center, Microsoft Research, Egypt ISDA, 30 November 2010

2 What do I mean by Printed text is converted into digital text through optical character recognition (OCR) process. Some errors can exist, which affect search Printecl text i8 convenled into diyital tex throuyh optical chavacter recognition (0CK) process. Some ettors can exist, which attect search Omni Font OCR Error Correction with Effect on Retrieval

3 State-of-the-art Omni Font OCR Error Correction with Effect on Retrieval Error Model: Printed ↔ Printecl  d ↔ cl Needs manual effort Needs accurate algorithm for alignment Dependent on font

4 Question of Research Omni Font OCR Error Correction with Effect on Retrieval Can we create a correction model for OCR: Font independent (Omni font) Totally unsupervised Comparable with state-of-the-art Correction ability Retrieval effectiveness

5 Approach Error Model Language Model OCR Text Generate Candidates Select Correction List of poss. corr. Corr. Text Context Calculate Edit Distance (ED) cokkection

6 Initial Long List of Candidates cokkection: collection, correction, …, pyramids Index the dictionary of words collection (index): {c, o, l, l, e, c, t, i, o, n, #c, co, ol, ll, le, ec, ct, ti, io, on, n#, #co, col, oll, lle, lec, ect, cti, tio, ion, on#, 10} cokkection (search): {c, o, k, k, e, c, t, i, o, n, #c, co, ok, kk, ke, ec, ct, ti, io, on, n#, #co, cok, okk, kke, kec, ect, cti, tio, ion, on#, 10} 1000 initial candidates to calculate ED ED + Unigram probability = Prior probability LM probability of trigrams words = posterior probability

7 Experimental Setup Two Arabic OCR document collections: ZAD: religious book  WER = 39% TREC AFP: newspapers  WER = 31% Correction using Error Model (EM) ZAD: 2000 training words AFP: 4000 training words Two domain specific language models Test EM vs ED correction: Error reduction Retrieval effectiveness

8 Error Reduction CorrectionWER Error Reduction ZAD WER = 39% ED 17%56% EM (ref) 12%70% AFP WER = 31% ED 7.3%76% EM (ref) 5.9%81%

9 Retrieval results for ZAD

10 Conclusion Omni font correction: Reduces errors up to 75% Slightly lower than correction based on error model (EM) Statistically indistinguishable from EM correction for search No training required Independent on font or language

11 Advices Enjoy your stay in Egypt Cairo: Pyramids, Nile Luxur, Aswan: Temples, Nile Sharm El-Shiekh: Red Sea, Safari Do not drive unless you are Egyptian Do not cross the road alone Do not ask questions Thank you

12 Equations S ED (w i ) =


Download ppt "Omni Font OCR Error Correction with Effect on Retrieval Walid Magdy 1 Kareem Darwish 2 1 Faculty of Engineering, Cairo University, Egypt 1 School of Computing,"

Similar presentations


Ads by Google