Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Segmentation of Touching Characters in Devnagari & Bangla Scripts Using Fuzzy MultiFactorial Analysis Presented By: Sanjeev Maharjan St. Xavier’s College.
Color Coded Templates Categories are mentioned in the Titles.

Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Aletheia Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom
Processing Digital Images. Filtering Analysis –Recognition Transmission.
Prénom Nom Document Analysis: TextRecognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Document Image Retrieval LBSC 796/CMSC 828o Douglas W. Oard April 12, 2004 mostly adapted from A lecture by David Doermann.
LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.
1/20 Document Segmentation for Image Compression 27/10/2005 Emma Jonasson Supervisor: Dr. Peter Tischer.
Indexing and Retrieving Images of Documents LBSC 796/INFM 718R David Doermann, UMIACS October 29 th, 2007.
Document Image Analysis CSE 717 An Introduction. Document Image Analysis  DIA is the theory and practice of recovering the symbol structures of digital.
California Car License Plate Recognition System ZhengHui Hu Advisor: Dr. Kang.
Scanned Documents LBSC 796/INFM 718R Douglas W. Oard Week 8, March 30, 2011.
Author’s Name Goes Here, Author’s Name Goes Here
Healthy Food Systems Healthy Environments Healthy Communities Healthy Californians Author’s Name/s Goes Here Title goes here Acknowledgement This poster.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Beyond Text INFM 718X/LBSC 708X Session 10 Douglas W. Oard.
Image Processing David Kauchak cs458 Fall 2012 Empirical Evaluation of Dissimilarity Measures for Color and Texture Jan Puzicha, Joachim M. Buhmann, Yossi.
Working with Graphics. Objectives Understand bitmap and vector graphics Place a graphic into a frame Work with the content indicator Transform frame contents.
With Microsoft Office 2007 Introductory, 3e© 2010 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft.
Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.
Image and Video Retrieval INST 734 Doug Oard Module 13.

BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa.
S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.
Problem description and pipeline
Poster title goes here, sentence case, key words designed to attract the right audience Author’s name here, Author’s name here, Author’s name here Address.
Intro to Scanners. A scanner works by creating a digital image. When you scan a document, you are making a picture of it. This digital image can be used.
First… Check with conference organisers on their specifications of size and orientation, before you start your poster eg. maximum poster size; landscape,
Author’s Name/s Goes Here, Author’s Name/s Goes Here
Million Book Bibliotheca Alexandrina Youssef Eldakar 19 November 2006.
Graphics & Images What File Format Do I Use?. Graphics & Images …..are visual images presented on some form of media (drawings, print, web, digital video)
Image and Video Retrieval INST 734 Doug Oard Module 13.
Dr. István Marosi Scansoft-Recognita, Inc., Hungary SSIP 2005, Szeged Character Recognition Internals.
Poster title goes here, containing strictly only the essential number of words... Introduction First… Check with conference organisers on their specifications.
Poster Title (Resist the temptation for long titles) Author A, Author B, Author C, Author D and Author E Address or affiliation, Address or affiliation,
First… Check with conference organisers on their specifications of size and orientation, before you start your poster eg. maximum poster size; landscape,
+ Accessible Document Basics Cindy Compeán Accessibility/Assistive Technology Specialist
HOW SCANNERS WORK A scanner is a device that uses a light source to electronically convert an image into binary data (0s and 1s). This binary data can.
Handwriting Recognition
Automatic Script Identification. Why do we need Script Identification OCRs are generally language dependent. Document layout analysis is sometimes language.
Arabic Handwriting Recognition Thomas Taylor. Roadmap  Introduction to Handwriting Recognition  Introduction to Arabic Language  Challenges of Recognition.
Poster title goes here Author’s Name/s Goes Here Address/es Goes Here First… Check with conference organisers on their specifications of size and orientation,
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
NLP&CC 2012 报告人:许灿辉 单 位:北京大学计算机科学技术研究所 Integration of Text Information and Graphic Composite for PDF Document Analysis 基于复合图文整合的 PDF 文档分析 Integration of.
License Plate Recognition of A Vehicle using MATLAB
Optical Character Recognition
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these.
Author’s Name/s Goes Here, Author’s Name/s Goes Here, Author’s Name/s Goes Here Address/es Goes Here, Address/es Goes Here, Address/es Goes Here Acknowledgement.
Visual Information Processing. Human Perception V.S. Machine Perception  Human perception: pictorial information improvement for human interpretation.
Poster title goes here, containing strictly only the essential number of words... Author’s Name Goes Here, Author’s Name Goes Here Biology Department,
Poster Title Author Name(s) PRINTING INFORMATION
Click here to add title Click here to add authors
S.Rajeswari Head , Scientific Information Resource Division
Click here to add title Click here to add authors
UN Workshop on Data Capture, Bangkok Session 7 Data Capture
Author’s Name/s Goes Here
Poster Title Author Name(s) PRINTING INFORMATION
UN Workshop on Data Capture, Dar es Salaam Session 7 Data Capture
Dr. István Marosi Recosoft Ltd., Hungary
Poster Guideline 1st International Conference on Quality Assurance in Higher Education on December 18-19,2017 The Poster should be 4x6 feet (i.e. 4 feet.
Address/es Goes Here, Address/es Goes Here, Address/es Goes Here
Glassy–Winged Sharpshooter: Farmer’s Scourge
Poster Guideline 2nd International Conference on Quality Assurance in Higher Education on April 23-25, 2019 The Poster should be 4x6 feet (i.e. 4 feet.
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Author’s Name/s Goes Here Title goes here
Presentation transcript:

Scanned Documents INST 734 Module 10 Doug Oard

Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these slides

Images A collection of dots called “pixels” –Pixels often binary-valued (black, white) Greyscale or color is sometimes needed –Arranged in a grid and called a “bitmap” 300 Dots per Inch (dpi) gives good results –Often stored in TIFF or PDF format Images are fairly large (~1 MB per page) “Content” is in the relation between pixels –Image analysis seeks to mimic human visual behavior

Document Image Analysis

Page Analysis Skew correction –Based on finding the primary orientation of lines Image and text region detection –Based on texture and dominant orientation Structural classification –Infer logical structure from physical layout Text region classification –Title, author, letterhead, signature block, etc.

Page Layer Segmentation

Image Detection

Text Region Detection

Application to Page Segmentation Printed text Handwriting Noise

Language Identification Language-independent skew detection –Accommodate horizontal and vertical writing Script class recognition –Asian scripts have blocky characters –Connected scripts can’t be segmented easily Language identification –Shape statistics work well for western languages –Competing classifiers work for Asian languages

Optical Character Recognition Pattern-matching approach –Standard approach in commercial systems –Segment individual characters –Recognize using a neural network classifier Hidden Markov model approach –Experimental approach –Segment into sub-character slices –Limited lookahead to find best character choice –Useful for connected scripts (e.g., Arabic)

OCR Accuracy Problems Character segmentation errors –In English, segmentation often changes “m” to “rn” Character confusion –Characters with similar shapes often confounded OCR on copies is much worse than on originals –Pixel bloom, character splitting, binding bend Uncommon fonts can cause problems –If not used to train the neural network character recornizers

Improving OCR Accuracy Image preprocessing –Mathematical morphology for bloom and splitting –Particularly important for degraded images “Voting” between several OCR engines helps –Individual systems depend on specific training data Linguistic analysis can correct some errors –Use confusion statistics, word lists, syntax, … –But more harmful errors might be introduced

Logical Page Analysis (Reading Order) Can be hard to guess in some cases –Newspaper columns, figure captions, appendices, … Sometimes there are explicit guides –“Continued on page 4” (but page 4 may be big!) Structural cues can help –Column 1 might continue to column 2 Content analysis is also useful –Word co-occurrence statistics, syntax analysis

Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these slides