EMOP Post-OCR Triage Diagnosing Page Image Problems with Post-OCR Triage for eMOP Matthew Christy, Loretta Auvil, Dr. Ricardo Gutierrez- Osuna, Boris Capitanu,

Slides:



Advertisements
Similar presentations
Mina Emad Azmy Research Assistant – Signal and Image Processing Lab.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Word Spotting DTW.
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.
Orchard Harvest™ LIS Review Results Training
What you want is not what you get: Predicting sharing policies for text-based content on Facebook Arunesh Sinha*, Yan Li †, Lujo Bauer* *Carnegie Mellon.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)
Optical Character Recognition Qurat-ul-Ain (Ainie) Akram Sarmad Hussain Center for language Engineering Al-Khawarizmi Institute of Computer Science University.
Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.
Information Retrieval in Practice
Aletheia Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom
Highlights Lecture on the image part (10) Automatic Perception 16
1/20 Document Segmentation for Image Compression 27/10/2005 Emma Jonasson Supervisor: Dr. Peter Tischer.
Recognizing User Interest and Document Value from Reading and Organizing Activities in Document Triage Rajiv Badi, Soonil Bae, J. Michael Moore, Konstantinos.
Beyond the Early Modern OCR Project or, What are We Going to do With Ourselves Now? Matthew Christy, Laura Mandell, Elizabeth Grumbach.
Overview of Search Engines
Proposal in Detail – Part 2
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
SCOPUS AND SCIVAL EVALUATION AND PROMOTION OF UKRAINIAN RESEARCH RESULTS PIOTR GOŁKIEWICZ PRODUCT SALES MANAGER, CENTRAL AND EASTERN EUROPE KIEV, 31 JANUARY.
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
Learning Unit Documents and Examples. Learning Units - basic building block of a course For iGETT a Learning Unit consists of –Three parts Instructor.
--Caesar Cat.  Write an optical character recognition application that identifies and recognizes printed text within an image.
Making a great Project 2 OCR 1994/2360. User Documentation Your User Documentation is the instructions for how to use the system you have created It is.
More HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign
The Early Modern OCR Project Big Data in the Humanities Matthew Christy, Laura Mandell, Elizabeth Grumbach.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Options for digital delivery Record Society Conference, April 19 th 2007 Bruce Tate Project Manager British History Online.
1 Cronbach’s Alpha It is very common in psychological research to collect multiple measures of the same construct. For example, in a questionnaire designed.
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
Proposed EOB Workflow. EOB Dilemma  Unstructured Documents, data appears in different areas  Small font affects ability for OCR software to bring back.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Aid Management Platform (AMP) Advanced User Training, Module Entering Activities.
Automatic Ground Truth Generation of Camera Captured Documents Using Document Image Retrieval Sheraz Ahmed, Koichi Kise, Masakazu Iwamura, Marcus Liwicki,
Working with the VB IDE. Running a Program u Clicking the”start” tool begins the program u The “break” tool pauses a program in mid-execution u The “end”
Tablet PC Capstone CSE 481b Richard Anderson Valentin Razmov.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.
EXAMPLES OF DIGITAL ARCHIVES AND LIBRARIES Advanced Techniques in Processing Images Advanced Techniques in Processing Images Chapter 6. Slide 57.
Visualizations, Mashups and Dashboards University of Illinois at Urbana-Champaign.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Lexile Project Guidelines for Data Collection and Analysis.
Observations on Scientific Writing in Social Sciences Dr. Abdalla Kafeel Associate Professor, Management Garden City College for Science & Technology
Handout Six: Sample Size, Effect Size, Power, and Assumptions of ANOVA EPSE 592 Experimental Designs and Analysis in Educational Research Instructor: Dr.
Using the Web for Language Independent Spellchecking and Auto correction Authors: C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis Google Inc. Published.
Comprehensive Continuous Improvement Plan(CCIP) Training Module 3 Funding Application.
© 2015 Lexmark International, Inc. All rights reserved. December 8 – 16 th, x Enovia Upgrade Training.
Optical Character Recognition
Information Retrieval in Practice
Access Problems and Solutions for Full-text Articles or E-books
Training Demo: DuPont Vendor Clients December 2016
IFLA Newspapers pre-conference Geneva, Arturs Zogla
Document Development Cycle
S.Rajeswari Head , Scientific Information Resource Division
Positive Posting & Public Online Presence
Training Demo: Mary Kay Vendor Clients
E-NOTIFY and CAER OnLine Training
2018 NM Community Survey Data Entry Training
Data Mining Practical Machine Learning Tools and Techniques
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Access Problems and Solutions for Full-text Articles or E-books
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
Text Analytics and Machine Learning Workshop Machine Learning Session
Unit #1 Transformations
Re- engineeniering.
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Extracting Information from Diverse and Noisy Scanned Document Images
Quick and Dirty: the art of OCR
Presentation transcript:

eMOP Post-OCR Triage Diagnosing Page Image Problems with Post-OCR Triage for eMOP Matthew Christy, Loretta Auvil, Dr. Ricardo Gutierrez- Osuna, Boris Capitanu, Anshul Gupta, Elizabeth Grumbach

 emop.tamu.edu/ emop.tamu.edu/  DH2014 Presentation  emop.tamu.edu/post- processing  eMOP Workflows  emop.tamu.edu/workflow s emop.tamu.edu/workflow s  Mellon Grant Proposal  idhmc.tamu.edu/projects /Mellon/eMOPPublic.pdf idhmc.tamu.edu/projects /Mellon/eMOPPublic.pdf eMOP Info eMOP WebsiteMore eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 2

The Numbers Page Images  Early English Books online (Proquest) EEBO : ~125,000 documents, ~13 million pages images ( )  Eighteenth Century Collections Online (Gale Cengage) ECCO : ~182,000 documents, ~32 million page images ( )  Total : >300,000 documents & 45 million page images. Ground Truth  Text Creation Partnership TCP : ~46,000 double-keyed hand transcribed docuemnts  44,000 EEBO  2,200 ECCO DH Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 3

Page Images DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 4

The Constraints  45 million page images!  Only 2 years  Small IDHMC team focused on gather data and training Tesseract for early modern typefaces  Great team of collaborators focusing on post-processing  Software Environment for the Advancement of Scholarly Research (SEASR) – University of Illinois, Urbana-Champaign  Perception, Sensing, and Instrumentation (PSI) Lab, Texas A&M University  Everything must be open- source  Focus our efforts on post- processing triage and recovery  Triage system will score page results and route pages to be corrected or analyzed for problems  Results: 1.Good quality, corrected OCR output 2.A DB of tagged pages indicating pre-processing needs DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 5 Solution

Post-Processing Triage DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 6

Post-Processing DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 7 Triage TreatmentDiagnosis

Triage: De-noising DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 8  Uses hOCR results 1.Determine average word bounding box size 2.Weed out boxes are that too big or too small 3.But keep small boxes that have neighbors that are “words”

Triage: De-noising DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 9 Before: 35%After: 58%

Triage: De-noising DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 10 BeforeAfter

Triage: Estimated Correctability DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 11 Page Evaluation  Determine how correctable a page’s OCR results are by examining the text.  The score is based on the ratio of words that fit the correctable profile to the total number of words Correctable Profile 1.Clean tokens:  remove leading and trailing punctuation  remaining token must have at least 3 letters 2.Spell check tokens >1 character 3.Check token profile :  contain at most 2 non-alpha characters, and  at least 1 alpha character,  have a length of at least 3,  and do not contain 4 or more repeated characters in a run 4.Also consider length of tokens compared to average for the page

DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 12 Triage: Estimated Correctability

Treatment: Page Correction DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 13 1.Preliminary cleanup  remove punctuation from begin/end of tokens  remove empty lines and empty tokens  combine hyphenated tokens that appear at the end of a line  retain cleaned & original tokens as “suggestions” 2.Apply common transformations and period specific dictionary lookups to gather suggestions for words.  transformation rules: rn->m; c->e; 1->l; e

DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 14 Treatment: Page Correction 3.Use context checking on a sliding window of 3 words, and their suggested changes, to find the best context matches in our(sanitized, period-specific) Google 3- gram dataset  if no context is found and only one additional suggestion was made from transformation or dictionary, then replace with this suggestion  if no context and “clean” token from above is in the dictionary, replace with this token

DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 15 Treatment: Page Correction window: tbat l thoughc Candidates used for context matching:  tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial, abat, tbat, teat)  l -> Set(l)  thoughc -> Set(thoughc, thought, though) ContextMatch: that l thought (matchCount: 1844, volCount: 1474) window: l thoughc Ihe Candidates used for context matching:  l -> Set(l)  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the) ContextMatch: l though the (matchCount: 497, volCount: 486) ContextMatch: l thought she (matchCount: 1538, volCount: 997) ContextMatch: l thought the (matchCount: 2496, volCount: 1905) tbat I thoughc Ihe Was

DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 16 Treatment: Page Correction window: thoughc Ihe Was Candidates used for context matching:  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)  Was -> Set(Was) ContextMatch: though ice was (matchCount: 121, volCount: 120) ContextMatch: though ike was (matchCount: 65, volCount: 59) ContextMatch: though she was (matchCount: 556,763, volCount: 364,965) ContextMatch: though the was (matchCount: 197, volCount: 196) ContextMatch: thought ice was (matchCount: 45, volCount: 45) ContextMatch: thought ike was (matchCount: 112, volCount: 108) ContextMatch: thought she was (matchCount: 549,531, volCount: 325,822) ContextMatch: thought the was (matchCount: 91, volCount: 91) that I thought she was

Treatment: Results DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 17

Treatment: Results DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 18

Diagnosis: Page Tagging DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 19  Tags pages with problems that prevent good OCR results  Can be used to apply appropriate pre- processing and re- OCRing  Eventually, will end up with a list of pages that simply need to be re- digitized  This will be the first time any comprehensive analysis has been done on these page images.  Users tag sample pages in a desktop version of Picasa  Machine learning algorithms use those tags to learn how to recognize skew, warp, noise, etc.  Have developed algorithms to:  measure skew  measure noise

Further/Current Work  Identifying multiple pages/columns in an image  Predicting juxta scores for documents without corresponding groundtruth  Identifying warp  Identify and fixing incorrect word order in hOCR output  can occur on pages with skew, vertical lines, decorative drop-caps, etc.  will affect scoring and context-based corrections Develop measure of noisiness Develop measure of skew-ness DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 20

The end For eMOP questions please contact us at : DH Diagnosing Page Image Problems with Post-OCR Triage for eMOP 21