Effective Retrieval of MisRecognized Scanned Documents R. Manmatha Multimedia Indexing and Retrieval Group, Center for Intelligent Information Retrieval,

Slides:

Advertisements

Similar presentations

eClassifier: Tool for Taxonomies

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Indexing DNA Sequences Using q-Grams

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Chapter 5: Introduction to Information Retrieval

Segmentation of Touching Characters in Devnagari & Bangla Scripts Using Fuzzy MultiFactorial Analysis Presented By: Sanjeev Maharjan St. Xavier’s College.

Word Spotting DTW.

Word Recognition of Indic Scripts

Int 1 Revision Word Processing Most people are familiar with word processing packages such as Microsoft Word, Open Office and Word Perfect. Here are some.

Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.

O PTICAL C HARACTER R ECOGNITION USING H IDDEN M ARKOV M ODELS Jan Rupnik.

Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

Real-time Computer Vision with Scanning N-Tuple Grids Simon Lucas Computer Science Dept.

Document Processing CS French Chapter 4. Text editor used for simple text entry and editing not intended to look good for editing programs and data e.g.

 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.

Highlights Lecture on the image part (10) Automatic Perception 16

Introduction to Language Models Evaluation in information retrieval Lecture 4.

Scalable Text Mining with Sparse Generative Models

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

IIIT HyderabadUMASS AMHERST Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna 1, Anand Kumar 1, C. V. Jawahar 1, R. Manmatha.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.

Information Retrieval in Practice

A Search Engine for Historical Manuscript Images Toni M. Rath, R. Manmatha and Victor Lavrenko Center for Intelligent Information Retrieval University.

Multiple testing correction

: Chapter 10: Image Recognition 1 Montri Karnjanadecha ac.th/~montri Image Processing.

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

Spatial Business Detection and Recognition from Images Alexander Darino Weeks 10 & 11.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Joint Bi-Level Image Experts Group ( JBIG ). JBIG Joint Bi-Level Image Experts Group (JBIG), reports both to ISO/IEC JTC1/SC29/WG11 and ITU-T SG 8. 

BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.

IIIT Hyderabad Thesis Presentation By Raman Jain ( ) Towards Efficient Methods for Word Image Retrieval.

S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Avoiding Segmentation in Multi-digit Numeral String Recognition by Combining Single and Two-digit Classifiers Trained without Negative Examples Dan Ciresan.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Intro to Scanners. A scanner works by creating a digital image. When you scan a document, you are making a picture of it. This digital image can be used.

80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.

IIIT Hyderabad Document Image Retrieval using Bag of Visual Words Model Ravi Shekhar CVIT, IIIT Hyderabad Advisor : Prof. C.V. Jawahar.

EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.

PHMMs for Metamorphic Detection Mark Stamp 1PHMMs for Metamorphic Detection.

Clustering Prof. Ramin Zabih

ITGS Application Software. ITGS Application software (productivity software) –Allows the user to perform tasks to solve problems, such as creating documents,

UC Berkeley CS294-9 Fall Document Image Analysis Lecture 11: Word Recognition and Segmentation Richard J. Fateman Henry S. Baird University of.

Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.

Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.

Hardware Lesson 5 1. Starter 2 Name these devices and explain if they are input or output devices.

Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.

Handwriting Recognition

1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.

OMR Scanner vs. Image Scanner + OMR Software. Data Collection Systems OMR Scanners OMR Software.

Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.

Information Storage and Retrieval Fall Lecture 1: Introduction and History.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

S.Rajeswari Head , Scientific Information Resource Division

Text Based Information Retrieval

Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2

Thomas L. Packer BYU CS DEG

Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Pattern Recognition and Training

Extracting Information from Diverse and Noisy Scanned Document Images

Pattern Recognition and Training

Presentation transcript:

Effective Retrieval of MisRecognized Scanned Documents R. Manmatha Multimedia Indexing and Retrieval Group, Center for Intelligent Information Retrieval, Dept. of Computer Science, University of Massachusetts, Amherst, MA, USA

Copyright: R. Manmatha How commercial OCR’s work. True for many commercial OCR’s Find layout of the page Columns, paragraphs, images, lines and so on. Uses whitespace as divider. Segment lines into characters using white space. Layout analysis has a substantial effect on performance. Recognize individual characters using a classifier. If word confidence is low replace by closest dictionary word. Unfortunately, many mistakes remain because word confidence is high.

Copyright: R. Manmatha What research papers say OCR uses a Hidden Markov Model (HMM). A language model constrains possible words. Example: Character bigram ‘aa’ in English is rare. Only common word ‘aardvark’. Model tends to replace with a more common bigram. Weakness People and company names may be a problem. Example ‘Aarhus’. Uncommon words may be misrecognized. Doesn’t deal well with mixed text and graphics eg chemical formulae or model numbers. Much slower.

Copyright: R. Manmatha OCR Problems. Isolated Character Recognition is relatively easy. Problems occur because OCR doesn’t know what a character is or where it begins and ends. Poor quality image. Smudges, low resolution, poor bitonal thresholding, page turned at edge of image. Non Standard Fonts Unusual fonts, typewriter fonts, dot matrix fonts, script fonts. Small text – less than 10 pixels high is problematic. Complex layouts. Lots of boxes and lines, touching lines, text in graphs. Text against image backgrounds.

Copyright: R. Manmatha Examples of OCR Errors OCR errors caused by non-touching nearby lines. AlGerian SONY 20" CRT TV IKV20FS120 20" STEREO COLOR TV, (COMPONENT VIDEO INPUT, EARPHONE JACK R.U34G5-H82 Fry* Offers A Performance Service Contra $2199 Script font causes problems. and of poteritial pathogens and is belisved to be a prerequisite for susceptibility - Co infection c-c the host. For Escherichia coli isolated -Prompatiens with urinary tract infection the severity of infection produced in vivo S strongly relaLed to the capacity to adhere tc human urinary L..act epithelia! cells invitro. Typewritten patent text (from Wolfgang Thielemann).

Copyright: R. Manmatha Solutions Improve Input Image Quality. Ngram Indexing for Search Some slowdown during retrieval. Correct OCR errors. Offline. Document Level OCR. Offline. Some techniques will work both for OCR and people generated errors.

Copyright: R. Manmatha Image Input Garbage in, garbage out. OCR works better with good quality scans. 300 dpi graylevel scans best. Higher if small text. 300 dpi binary may be sufficient for modern laser printed text. Recent USPTO scans seem to be of good quality.

Copyright: R. Manmatha Dictionary Based Approaches Find the nearest word in the lexicon. Replace the current word with the nearest word in the lexicon. Fails for non-dictionary words eg people’s names, product names, model numbers, chemical compounds.

Copyright: R. Manmatha N-gram Indexing Index character n-grams. Example ‘Simvastatin’ 3-grams sim,imv,mva,vas,ast,sta,tat,ati,tin Index all of these. Query Split query word into n-grams. Search ngrams and score the count or percentage. Example ‘Simvestatin’ 3-grams sim,imv,mve,est,est,sta,tat,ati,tin 6/9 n-grams still in common. Intuition, ngram search tries to find what is common Example References: Harding, Croft and Weir, Probabilistic Retrieval of Degraded Documents, European Conf. in Digital Libraries, Zobel and Dart, Finding Approximate Matches in Large Lexicons, Software, Practice and Experience, 25 (1995).

Copyright: R. Manmatha N-gram Indexing What about ‘Simvastativ’ 8/9 common. Seems like a better match with 1 error. End effect. Need to index circular n-grams. Example Simvastatin –sim,imv,mva,vas,ast,sta,tat,ati,tin,ins,nsi Simvastativ – sim,imv,mva,vas,ast,sta,tat,ati,tiv,ivs,vsi 8/11 common and consistent. Different Distance Measures to find similarity. Ngram indexing not effective if there are many errors in the same word.

Copyright: R. Manmatha Levenshtein distance Minimum number of additions, deletions, substitutions of a single character required to transform one string to another. Example ‘Television’ vs ‘Televlsion’. Distance 1. ‘Television’ vs ‘Telesion’. Distance 2. Slow computation. Do after ngram match. Open Source Lucene search engine. ngrams + windowed Levenshtein distance implemented. ngrams indexed as fields. Unfortunately ‘weird’ scoring function Hard to modify. Default scorer weights beginning and end ngrams more.

Copyright: R. Manmatha Document OCR Current OCRs recognize words in isolation. Ignore document context. Multiple examples of the same word occur in the document. Cluster word and a word (say) 1 character away. Example: methyl, methy1, methyl, methyi, methyl Use statistics to decide correct word. Example – majority vote. Confusion matrix with probabilities. 1, i, l often confused. Determine probability of confusion from a training corpus.

Copyright: R. Manmatha Word Spotting + OCR. Cluster word images (other examples of the same word) using word matching. Done for handwriting – Manmatha et al. Example: Cluster Let OCR output be Alexandria, Alexanria and Alexandria Decide the correct answer – eg majority vote.

Copyright: R. Manmatha Example Telugu 75 million speakers. No commercial OCR exists in Telugu. Simple word level research OCR – 57% WER. Document (book) level OCR Cluster similar word images. Run OCR over each cluster. Majority vote over cluster. OCR WER improves from 57% to 82%. Preliminary Work. Joint work with Anand Kumar, V. Rasagna, C. V. Jawahar at IIIT Hyderabad in India. Issues Image Original Cut Merge

Copyright: R. Manmatha Context of other Words Documents often have certain words which are unique repeated a lot. Say penicillin, penicillin and penicillan. The context of the document can establish that penicillan should be actually penicillin. Eg antibiotic may co-occur with penicillin but not penicillan. Say two television models are possible. HLCD2 – 720 dpi HLCD3 – 1080 dpi Context eg resolution may distinguish HLCD2 and HLCD3 even if the 2 or 3 is erroneously recognized.

Copyright: R. Manmatha Aligning multiple OCR outputs. Feng and Manmatha, A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books (JCDL’06) Automatically align OCR output to groundtruth using a hierarchical HMM. Stewart, Crane and Babeu, A new generation of textual corpora: mining corpora from very large collections (JCDL'07) Use the above technique to align two OCR strings to improve OCR quality. Can also use multiple sources for alignment. Example: Correct ‘e’ to ‘a’ in affairs. ‘timed’ is ‘tide’ There is a tide in the affairs There is a tide in the affeirs There timed in the affairs his sentiments and my own perfectly agreed with …... his sentiments wo perfecbly agreed with

Copyright: R. Manmatha WO (R)-[2-(8'(S)-2",2"-dimethylbutyryloxy-2'(S),6'(R)-dimethyl- l',2',6',7,'8',8a'(R)- hexahydronapthyl-l'(S))-ethyl]-4(R)-hydroxy -3,4-5,6-tetrahydro- 2H-pyran-2-one WO (R)-[2-[8(5)-(2,2-dimethyl.butyyloxy)-2 (S), 6 (R)-dimethyl-1, 2, 6, 7, 8, 8a(R)- hexahydro-l (S)-napthylelhyl/-4(R)-hydroxy-3, 4, 5, 6-tetrahydro-2H-pyran-2 one OCR Errors: Chemical Names From Wolfgang Thielemann’s Slides 141 patents. None of them match the correct chemical name. Word order messed up. Extra characters. Need Normalization Example: Remove everything not a letter or digit before indexing. Domain Specific. Ngrams/ Alignment would be useful. Question: How far apart are chemical names? Technical terms? Easier to correct when far apart.

Copyright: R. Manmatha Conclusion Number of ways to improve search results inspite of OCR errors. Improve input images. Ngram indexing/Levenshtein distance for OCR errors. OCR error correction. Document level OCR.