Presentation on theme: "Effective Retrieval of MisRecognized Scanned Documents R. Manmatha Multimedia Indexing and Retrieval Group, Center for Intelligent Information Retrieval,"— Presentation transcript:
Effective Retrieval of MisRecognized Scanned Documents R. Manmatha Multimedia Indexing and Retrieval Group, Center for Intelligent Information Retrieval, Dept. of Computer Science, University of Massachusetts, Amherst, MA, USA firstname.lastname@example.org
Copyright: R. Manmatha How commercial OCR’s work. True for many commercial OCR’s Find layout of the page Columns, paragraphs, images, lines and so on. Uses whitespace as divider. Segment lines into characters using white space. Layout analysis has a substantial effect on performance. Recognize individual characters using a classifier. If word confidence is low replace by closest dictionary word. Unfortunately, many mistakes remain because word confidence is high.
Copyright: R. Manmatha What research papers say OCR uses a Hidden Markov Model (HMM). A language model constrains possible words. Example: Character bigram ‘aa’ in English is rare. Only common word ‘aardvark’. Model tends to replace with a more common bigram. Weakness People and company names may be a problem. Example ‘Aarhus’. Uncommon words may be misrecognized. Doesn’t deal well with mixed text and graphics eg chemical formulae or model numbers. Much slower.
Copyright: R. Manmatha OCR Problems. Isolated Character Recognition is relatively easy. Problems occur because OCR doesn’t know what a character is or where it begins and ends. Poor quality image. Smudges, low resolution, poor bitonal thresholding, page turned at edge of image. Non Standard Fonts Unusual fonts, typewriter fonts, dot matrix fonts, script fonts. Small text – less than 10 pixels high is problematic. Complex layouts. Lots of boxes and lines, touching lines, text in graphs. Text against image backgrounds.
Copyright: R. Manmatha Examples of OCR Errors OCR errors caused by non-touching nearby lines. AlGerian SONY 20" CRT TV IKV20FS120 20" STEREO COLOR TV, (COMPONENT VIDEO INPUT, EARPHONE JACK R.U34G5-H82 Fry* Offers A Performance Service Contra $2199 Script font causes problems. and of poteritial pathogens and is belisved to be a prerequisite for susceptibility - Co infection c-c the host. For Escherichia coli isolated -Prompatiens with urinary tract infection the severity of infection produced in vivo S strongly relaLed to the capacity to adhere tc human urinary L..act epithelia! cells invitro. Typewritten patent text (from Wolfgang Thielemann).
Copyright: R. Manmatha Solutions Improve Input Image Quality. Ngram Indexing for Search Some slowdown during retrieval. Correct OCR errors. Offline. Document Level OCR. Offline. Some techniques will work both for OCR and people generated errors.
Copyright: R. Manmatha Image Input Garbage in, garbage out. OCR works better with good quality scans. 300 dpi graylevel scans best. Higher if small text. 300 dpi binary may be sufficient for modern laser printed text. Recent USPTO scans seem to be of good quality.
Copyright: R. Manmatha Dictionary Based Approaches Find the nearest word in the lexicon. Replace the current word with the nearest word in the lexicon. Fails for non-dictionary words eg people’s names, product names, model numbers, chemical compounds.
Copyright: R. Manmatha N-gram Indexing Index character n-grams. Example ‘Simvastatin’ 3-grams sim,imv,mva,vas,ast,sta,tat,ati,tin Index all of these. Query Split query word into n-grams. Search ngrams and score the count or percentage. Example ‘Simvestatin’ 3-grams sim,imv,mve,est,est,sta,tat,ati,tin 6/9 n-grams still in common. Intuition, ngram search tries to find what is common Example References: Harding, Croft and Weir, Probabilistic Retrieval of Degraded Documents, European Conf. in Digital Libraries, 1997. Zobel and Dart, Finding Approximate Matches in Large Lexicons, Software, Practice and Experience, 25 (1995).
Copyright: R. Manmatha N-gram Indexing What about ‘Simvastativ’ 8/9 common. Seems like a better match with 1 error. End effect. Need to index circular n-grams. Example Simvastatin –sim,imv,mva,vas,ast,sta,tat,ati,tin,ins,nsi Simvastativ – sim,imv,mva,vas,ast,sta,tat,ati,tiv,ivs,vsi 8/11 common and consistent. Different Distance Measures to find similarity. Ngram indexing not effective if there are many errors in the same word.
Copyright: R. Manmatha Levenshtein distance Minimum number of additions, deletions, substitutions of a single character required to transform one string to another. Example ‘Television’ vs ‘Televlsion’. Distance 1. ‘Television’ vs ‘Telesion’. Distance 2. Slow computation. Do after ngram match. Open Source Lucene search engine. ngrams + windowed Levenshtein distance implemented. ngrams indexed as fields. Unfortunately ‘weird’ scoring function Hard to modify. Default scorer weights beginning and end ngrams more.
Copyright: R. Manmatha Document OCR Current OCRs recognize words in isolation. Ignore document context. Multiple examples of the same word occur in the document. Cluster word and a word (say) 1 character away. Example: methyl, methy1, methyl, methyi, methyl Use statistics to decide correct word. Example – majority vote. Confusion matrix with probabilities. 1, i, l often confused. Determine probability of confusion from a training corpus.
Copyright: R. Manmatha Word Spotting + OCR. Cluster word images (other examples of the same word) using word matching. Done for handwriting – Manmatha et al. Example: Cluster Let OCR output be Alexandria, Alexanria and Alexandria Decide the correct answer – eg majority vote.
Copyright: R. Manmatha Example Telugu 75 million speakers. No commercial OCR exists in Telugu. Simple word level research OCR – 57% WER. Document (book) level OCR Cluster similar word images. Run OCR over each cluster. Majority vote over cluster. OCR WER improves from 57% to 82%. Preliminary Work. Joint work with Anand Kumar, V. Rasagna, C. V. Jawahar at IIIT Hyderabad in India. Issues Image Original Cut Merge
Copyright: R. Manmatha Context of other Words Documents often have certain words which are unique repeated a lot. Say penicillin, penicillin and penicillan. The context of the document can establish that penicillan should be actually penicillin. Eg antibiotic may co-occur with penicillin but not penicillan. Say two television models are possible. HLCD2 – 720 dpi HLCD3 – 1080 dpi Context eg resolution may distinguish HLCD2 and HLCD3 even if the 2 or 3 is erroneously recognized.
Copyright: R. Manmatha Aligning multiple OCR outputs. Feng and Manmatha, A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books (JCDL’06) Automatically align OCR output to groundtruth using a hierarchical HMM. Stewart, Crane and Babeu, A new generation of textual corpora: mining corpora from very large collections (JCDL'07) Use the above technique to align two OCR strings to improve OCR quality. Can also use multiple sources for alignment. Example: Correct ‘e’ to ‘a’ in affairs. ‘timed’ is ‘tide’ There is a tide in the affairs There is a tide in the affeirs There timed in the affairs....... his sentiments and my own perfectly agreed with …... his sentiments wo perfecbly agreed with
Copyright: R. Manmatha WO2007096753 6(R)-[2-(8'(S)-2",2"-dimethylbutyryloxy-2'(S),6'(R)-dimethyl- l',2',6',7,'8',8a'(R)- hexahydronapthyl-l'(S))-ethyl]-4(R)-hydroxy -3,4-5,6-tetrahydro- 2H-pyran-2-one WO2005095374 6(R)-[2-[8(5)-(2,2-dimethyl.butyyloxy)-2 (S), 6 (R)-dimethyl-1, 2, 6, 7, 8, 8a(R)- hexahydro-l (S)-napthylelhyl/-4(R)-hydroxy-3, 4, 5, 6-tetrahydro-2H-pyran-2 one OCR Errors: Chemical Names From Wolfgang Thielemann’s Slides 141 patents. None of them match the correct chemical name. Word order messed up. Extra characters. Need Normalization Example: Remove everything not a letter or digit before indexing. Domain Specific. Ngrams/ Alignment would be useful. Question: How far apart are chemical names? Technical terms? Easier to correct when far apart.
Copyright: R. Manmatha Conclusion Number of ways to improve search results inspite of OCR errors. Improve input images. Ngram indexing/Levenshtein distance for OCR errors. OCR error correction. Document level OCR.