Omni Font OCR Error Correction with Effect on Retrieval Walid Magdy 1 Kareem Darwish 2 1 Faculty of Engineering, Cairo University, Egypt 1 School of Computing,

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
An Online Microsoft Word Tutorial & Evaluation Begin.
Overview of PubWEST Patent and Trademark Depository Library Training Seminar April 2006.
Int 1 Revision Word Processing Most people are familiar with word processing packages such as Microsoft Word, Open Office and Word Perfect. Here are some.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Document Data Mining Design Review November 18, Team Members: Dallas Stinger, Wenlong Huang, Aaron Phillips Advisor: Gregory Donohoe, Ph.D.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Human Computation CSC4170 Web Intelligence and Social Computing Tutorial 7 Tutor: Tom Chao Zhou
Scalable Text Mining with Sparse Generative Models
Words & Definitions By: Naftaly Garcia Birruete. Address Bar  The space provided on a web browser that shows the addresses of websites.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
IIIT HyderabadUMASS AMHERST Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna 1, Anand Kumar 1, C. V. Jawahar 1, R. Manmatha.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166.
1 Probabilistic Language-Model Based Document Retrieval.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
CC 2007, 2011 attrbution - R.B. Allen Text and Text Processing.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
MSS Technologies and the AIIM Grand Canyon Chapter present: Electronic Document Management System Needs Analysis.
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
--Caesar Cat.  Write an optical character recognition application that identifies and recognizes printed text within an image.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Brain Wave Analysis in Optimal Color Allocation for Children’s Electronic Book Design Wu, Chih-Hung Liu, Chang Ju Tzeng, Yi-Lin.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
1 BILC SEMINAR 2009 Speech Recognition: Is It for Real? Tony Mirabito Defense Language Institute English Language Center (DLIELC) DLIELC.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
Egypt. Egypt is a country on the continent of Africa.
A A B B C C F F D D E E G G J J K K N N H H I I L L M M Which letter shows the city of Thebes.
Posterior Regularization for Structured Latent Variable Models Li Zhonghua I2R SMT Reading Group.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore
Imaged Document Text Retrieval without OCR IEEE Trans. on PAMI vol.24, no.6 June, 2002 報告人:周遵儒.
An exercise in preservation and applied technology Making an Electronic Text.
OCR at INIS Branko Krznarić. Outline  What is OCR?  OCR Objectives  Principles  Techniques  Software INIS Training Seminar October 2015, Vienna,
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Nikola Tesla Museum Clipping Library Saša Malkov Nenad Mitić Žarko Mijajlović 3 rd SEEDI Int.Conf. Cetinje, Montenegro 14. September 2007.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
BASIC WORD PROCESSORS WEEK 5. BASIC WORD PROCESSORS Word Processor Word processor is a program which is used to edit text files and format them with font,
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval Walid Magdy Kareem Darwish Mohsen Rashwan.
IFLA Newspapers pre-conference Geneva, Arturs Zogla
Applying Deep Neural Network to Enhance EMPI Searching
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Famous cities and fantastic monuments By: Hossam KshK
S.Rajeswari Head , Scientific Information Resource Division
Walid Magdy Gareth Jones
Presenter: Ibrahim A. Zedan
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Statistical Methods for Text Error Correction
Murat Açar - Zeynep Çipiloğlu Yıldız
Unit# 6: ICT Applications
Arabic Language Challenges
Presentation transcript:

Omni Font OCR Error Correction with Effect on Retrieval Walid Magdy 1 Kareem Darwish 2 1 Faculty of Engineering, Cairo University, Egypt 1 School of Computing, Dublin City University. Ireland 2 Faculty of Computers and Information, Cairo University, Egypt 2 Cairo Microsoft Innovation Center, Microsoft Research, Egypt ISDA, 30 November 2010

What do I mean by Printed text is converted into digital text through optical character recognition (OCR) process. Some errors can exist, which affect search Printecl text i8 convenled into diyital tex throuyh optical chavacter recognition (0CK) process. Some ettors can exist, which attect search Omni Font OCR Error Correction with Effect on Retrieval

State-of-the-art Omni Font OCR Error Correction with Effect on Retrieval Error Model: Printed ↔ Printecl  d ↔ cl Needs manual effort Needs accurate algorithm for alignment Dependent on font

Question of Research Omni Font OCR Error Correction with Effect on Retrieval Can we create a correction model for OCR: Font independent (Omni font) Totally unsupervised Comparable with state-of-the-art Correction ability Retrieval effectiveness

Approach Error Model Language Model OCR Text Generate Candidates Select Correction List of poss. corr. Corr. Text Context Calculate Edit Distance (ED) cokkection

Initial Long List of Candidates cokkection: collection, correction, …, pyramids Index the dictionary of words collection (index): {c, o, l, l, e, c, t, i, o, n, #c, co, ol, ll, le, ec, ct, ti, io, on, n#, #co, col, oll, lle, lec, ect, cti, tio, ion, on#, 10} cokkection (search): {c, o, k, k, e, c, t, i, o, n, #c, co, ok, kk, ke, ec, ct, ti, io, on, n#, #co, cok, okk, kke, kec, ect, cti, tio, ion, on#, 10} 1000 initial candidates to calculate ED ED + Unigram probability = Prior probability LM probability of trigrams words = posterior probability

Experimental Setup Two Arabic OCR document collections: ZAD: religious book  WER = 39% TREC AFP: newspapers  WER = 31% Correction using Error Model (EM) ZAD: 2000 training words AFP: 4000 training words Two domain specific language models Test EM vs ED correction: Error reduction Retrieval effectiveness

Error Reduction CorrectionWER Error Reduction ZAD WER = 39% ED 17%56% EM (ref) 12%70% AFP WER = 31% ED 7.3%76% EM (ref) 5.9%81%

Retrieval results for ZAD

Conclusion Omni font correction: Reduces errors up to 75% Slightly lower than correction based on error model (EM) Statistically indistinguishable from EM correction for search No training required Independent on font or language

Advices Enjoy your stay in Egypt Cairo: Pyramids, Nile Luxur, Aswan: Temples, Nile Sharm El-Shiekh: Red Sea, Safari Do not drive unless you are Egyptian Do not cross the road alone Do not ask questions Thank you

Equations S ED (w i ) =