Optical Character Recognition

Slides:



Advertisements
Similar presentations
Standard Grade Notes General Purpose Packages. These are Software packages which allow the user to solve a range of problems.
Advertisements

 Use the Left and Right arrow keys or the Page Up and Page Down keys to move between the pages. You can also click on the pages to move forward.  To.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
Microsoft Word 2003 Tutorial 2 – Editing and Formatting a Document.
Lesson 2 — Working with Text
1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
1 Computing for Todays Lecture 4 Yumei Huo Fall 2006.
With Alex Conger – President of Webmajik.com FrontPage 2002 Level I (Intro & Training) FrontPage 2002 Level I (Intro & Training)
Adobe Forms THE FORM ELEMENT PANEL. Creating a form using the Adobe FormsCentral is a quick and easy way to distribute a variety of forms including surveys.
Word Processing Standard Grade Computing LA/LM. Word processor a computer program that allows you to manipulate text What is?
ADOBE INDESIGN CS3 Chapter 2 WORKING WITH TEXT.
XP New Perspectives on Microsoft Word 2002 Tutorial 21 Microsoft Word 2002 Tutorial 2 – Editing and Formatting a Document.
Committed to Shaping the Next Generation of IT Experts. Exploring Microsoft Office Word 2007 Chapter 3: Enhancing a Document Robert Grauer, Keith Mulbery,
Using a Template to Create a Resume and Sharing a Finished Document
Page 1 Storing and Organizing Your Digital Images Most people don’t do a good job of storing and organizing digital images. This includes: Your Digital.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
Lesson 2 – Editing a Document Microsoft Word
XP 1 Microsoft Word 2002 Tutorial 1 – Creating a Document.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
Microsoft ® Word 2010 Training Create your first Word document I.
By: Ms. Abeer Helwa 1. WORD WEB APP 2 Word Web App is a limited version of Word, enabling you to edit, format, and share documents online. Word Web App.
By: Ms. Abeer Helwa 1. WORD WEB APP 2 Word Web App is a limited version of Word, enabling you to edit, format, and share documents online. Word Web App.
Accessible PDF’s using Adobe Acrobat Standard or Professional Jarilyn Weber 06/11/2014 “Leading for educational excellence and equity. Every day for every.
MS WORD INFORMATION TECHNOLOGY MANAGEMENT SERVICE Training & Research Division.
1 Word Processing Intermediate Using Microsoft Office 2000.
Desktop Publishing Lesson 2 — Working with Text. Lesson 2 – Working with Text2 Objectives  Create a blank document.  Work with text boxes.  Work with.
Software Development Languages and Environments. Computer Languages Just as there are many human languages, there are many computer programming languages.
Welcome to the Basic Microsoft Word Guide. Before you start this Guide, you will need to complete “Basic Computer”; “Basic Windows” and know how to type.
Understanding Microsoft Excel
Creating Accessible PDFs
Microsoft Word 2016 Lesson 1.
Windows 7 and file management
Understanding Microsoft Excel
AHG Advanced Techniques for PDF Accessibility
Computer Literacy BASICS
Tutorial 2 – Editing and Formatting a Document
With Microsoft FrontPage 2000
S.Rajeswari Head , Scientific Information Resource Division
Microsoft Word 2010.
Lesson 2 Tables and Charts
GO! with Microsoft Access 2016
Creating Accessible PDFs from Word Docs
Unit 2 Terms Word Processing.
Lecturer: Dalia Mirghani
Microsoft® Word 2010 Training
Tutorial 1 – Creating a Document
Template for IBI poster 56 inches in width and 36 inches in height The title occupies no more than two lines First Author1, Second Author2, Third Author1,2.
The How-to-Guide for Using Word
Understanding Microsoft Excel
Computer Vision Lecture 5: Binary Image Processing
UN Workshop on Data Capture, Bangkok Session 7 Data Capture
UN Workshop on Data Capture, Dar es Salaam Session 7 Data Capture
Quality Checking your Personnel Folders
Dr. István Marosi Recosoft Ltd., Hungary
Template for IBI poster 36 inches in width and 48 inches in height The title occupies no more than three lines First Author1, Second Author2, Third Author1,2.
Lesson 15 Working with Tables
Word Processing and Desktop Publishing Software
ICT Word Processing Lesson 1: Introduction to Word Processing
Template for IBI poster 48 inches in width and 36 inches in height The title occupies no more than two lines First Author1, Second Author2, Third Author1,2.
Understanding Microsoft Excel
Lesson 13 Editing and Formatting documents
INSTRUCTIONAL NOTES There are many similarities between Photoshop and Illustrator. We have attempted to place tools and commands in the context of where.
Key Applications Module Lesson 14 — Working with Tables
An Introduction to Microsoft Word
Welcome To Microsoft Word 2016
Quick and Dirty: the art of OCR
Presentation transcript:

Optical Character Recognition Senthil K.N Chimera technologies

What is OCR Optical Character Recognition, or OCR, is a technology that enables you to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data.

Why OCR Quick Digital Search: You can make use of the OCR software to convert scanned text into a word processing file and this enables you to do a quick digital search using keyword or phrases as reference. You may be able to search hundreds of documents and locate what you need within seconds using the digital search feature. Edit Texts: After you have scanned the documents, you may be able to edit texts easily. This can allow you to add any relevant information to the document at a later date. You do not have to search for the original document to do so and this can help save a lot of time. Access: You can make use of this application to scan any type of invoice, papers and other information. The scanned documents can be accessed from anywhere and this helps provide a lot of convenience. People may be able to access the invoice easily and this is really convenient for organizations that have branches in different locations. Retyping Can Be Avoided: Even if you accidentally erase a digital file, you do not have to worry as this can be easily replaced using the OCR software. All that you need to do is to scan the original document and this can help save a lot of effort. You do not have to type the document from scratch and this ensures that the work is done quickly. Storage Space: When you start scanning documents and storing them in digital form, you will end up saving a lot of space. A backup can also be created and this can protect the documents against any type of damage.

Where we use OCR Text-to-speech Text mining Pattern recognition Artificial intelligence  Computer vision

Architecture

Techniques

OpenSource Tesseract ocr GOCR VietOCR FreeOCR

Commercial Products Adobe Acrobat Pro Captiva Abbyy Informatica 

Tesseract OCR Steps involved in OCR 1. Training phase. 2. Test phase. Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006.Tesseract is considered one of the most accurate open source OCR engines currently available. Steps involved in OCR 1. Training phase. 2. Test phase.

Training phase Commands tesseract unicharset_extractor mftraining cntraining combine_tessdata

Train Tesseract for own font Create boxes Edit boxes manually Extract unicharset Shape clustering Combine files

Steps 1. Install Tesseract on your computer 2. Get your TIF’s in order First of all, you need some sample data to train the scanner with. Try to get images as clean as possible, shape them up with Photoshop or equivalent software, make the background white and the text black. The more samples you have — the better. You will need at least every character that you will need to recognize to be in the images. Cut out everything else from the image and put the characters on a single line. Make sure that each character is separated with enough space to not make them bleed over each other. Create at least five different images with the characters in and in different order. Remember that Tesseract will try to learn to recognize words, so be careful if you plan to scan codes that you don’t always put characters in the same order for each sample. Create a folder somewhere on your filesystem to keep all the training files in one place. Name the images with [language].[fontname].exp[samplenumber].tif eng.strangelabelmachinefont.exp0.tif EX: phone number image

3. Create the box files Now, for each of the sample files, run Tesseract to create the box files. A box file is a register of all the characters that Tesseract recognizes and at which position that character is. Open up that good ol’ terminal and type in for each of the TIF’s: tesseract [language].[fontname].exp[samplenumber].tif [language].[fontname].exp[samplenumber] batch.nochop makeboxor as in our case: tesseract eng.strangelabelmachinefont.exp0.tif eng.strangelabelmachinefont.exp0 batch.nochop makebox

4. Correct the box files + 27 21 57 50 0 4 65 20 89 55 0 ? 92 20 116 55 0 5 119 23 142 58 0 8 147 23 171 58 0 0 175 23 199 58 0 6 388 22 413 56 0 6 417 21 441 56 0 The leftmost character on each row is the character that Tesseract thought it did find. The rest is coordinates in pixels left/top and so on. As you can see it made a mistake with the character “7” and guessed it to be a “?”. The positioning of the characters would be VERY hard to guess unless you have an amazing talent for imagining pixels in your head. Luckily there are some tools available to help you with this step. The only tool that i found to work and/or to be useful is jTessboxeditor. You can get it here: http://vietocr.sourceforge.net/training.html Correct the characters that were wrong, and make sure that the surrounding boxes fit the entire characters and in place. If not — correct the values on the top row. Once you are happy and done, press save and move on to the next file.

6. Create the unicharset file 5. Training time Now that you have some good boxes its time to start the actual training of the scanner. For each of your TIF/Box pairs, run the following command tesseract [language].[fontname].exp[samplenumber].tif [language].[fontname].exp[samplenumber] box.train or tesseract eng.strangelabelmachinefont.exp0.tif eng.strangelabelmachinefont.exp0 box.train 6. Create the unicharset file Run the unicharset_extractor with each of the boxes as a parameter unicharset_extractor eng.strangelabelmachinefont.exp0.box eng.strangelabelmachinefont.exp1.box…You will probably not need to edit this file, unless you are on some strange old system like windows 95.

7. Create the font_properties file Create a new file and name it lang.font_properties. In this file, create a row for each font you are using in your training files. If you are like me, trying to scan a receipt or label with a strange unknown font, you will likely just need one row. Each row starts with the name of the font, then it will have a boolean value for each of the possible font properties. <fontname> <italic> <bold> <fixed> <serif> <fraktur> Example: somestrangelabelmachinefont 0 0 1 0 0 Important — Make sure to add an extra line break at the end of the file.

8. Clustering Time to cluster all the features of the trained font. Enter the following in the terminal: shapeclustering -F font_properties –U unicharset [language].[fontname].exp0.tr [language].[fontname].exp1.tr… 9. Shapetable mftraining -F font_properties –U unicharset [language].[fontname].exp0.tr [language].[fontname].exp1.tr…

11. unicharambigs 12. Wrappin everything up This file is manually created, and is supposed to have a list of commonly mistaken characters and what to substitute when. language.unicharambigs Also put in a blank line at the end of the file 12. Wrappin everything up Now you’re good to go ahead and create the final training file that will be used in your app unless you had ANY errors prior to this step.  language.filename Then run the final command combine_tessdata lang. language.traineddata

Test phase

Test Character/Word Pre-processing De-skew – If the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical. Despeckle – remove positive and negative spots, smoothing edges

Binarisation – Convert an image from color or greyscale to black-and-white (called a "binary image" because there are two colours). The task of binarisation is performed as a simple way of separating the text (or any other desired image component) from the background. Line removal – Cleans up non-glyph boxes and lines

Layout analysis – Identifies columns, paragraphs, captions, etc Layout analysis – Identifies columns, paragraphs, captions, etc. as distinct blocks. Line and word detection – Establishes baseline for word and character shapes, separates words if necessary.

segmentation – For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected. Vertical Grid based.

Character recognition Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching", "pattern recognition", or "image correlation". First pass – read character. Second Pass – understand character.

Post processing Doc format Stored in DB. Plain text format Annotated pdf.

Example

Refrence links https://blog.cedric.ws/how-to-train-tesseract-301 https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 https://en.wikipedia.org/wiki/Optical_character_recognition

Thank you