Presentation is loading. Please wait.

Presentation is loading. Please wait.

ادارة الوثائق الالكترونية Naji Shukri Alzaza University of Palestine February 2010.

Similar presentations


Presentation on theme: "ادارة الوثائق الالكترونية Naji Shukri Alzaza University of Palestine February 2010."— Presentation transcript:

1 ادارة الوثائق الالكترونية Naji Shukri Alzaza University of Palestine February 2010

2 Optical Character Recognition “OCR” Naji ShukriAlzaz, EDM, University of Palestine, February 2010

3 What is OCR? Optical character recognition, usually abbreviated to OCR. OCR is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. It is used to convert paper books and documents into electronic files, for instance, to computerize an old record-keeping system in an office, or to serve on a website. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

4 What is OCR?... By replacing each block of pixels that resembles a particular character (such as a letter, digit or punctuation mark) or word with that character or word, OCR makes it possible to edit printed text, search it for a given word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply such techniques as machine translation, text-to- speech and text mining to it. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

5 What is OCR?... OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques. OCR(using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

6 What is OCR?... Because very few applications survive that use true optical techniques, the OCR term has now been broadened to include digital image processing as well. Early systems required training to read a specific font; they needed to be programmed with images of each character, and it only worked on one font at a time. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

7 What is OCR?... Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

8 OCR technology’s Current State The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents. Typical accuracy rates on these exceed 99%; total accuracy can only be achieved by human review. Other areas—including recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those with a very large number of characters)—are still the subject of active research. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

9 OCR technology… Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct software finding non- existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was recognized with no incorrect letters. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

10 On-line Character Recognition On-line character recognition is sometimes confused with OCR. OCR is an instance of off-line character recognition, where the system recognizes the fixed static shape of the character, while on-line character recognition instead recognizes the dynamic motion during handwriting. For example, on-line recognition, such as that used for PDA or Tablet PC can tell whether a horizontal mark was drawn right-to-left, or left-to-right. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

11 On-line Character Recognition… On-line character recognition is also referred to by other terms such as dynamic character recognition, real-time character recognition, and Intelligent Character Recognition or ICR. On-line systems for recognizing hand-printed text on the fly have become well-known as commercial products in recent years. Among these are the input devices for personal digital assistants such as those running Palm OS. The Apple Newton pioneered this product. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

12 On-line Character Recognition… The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand- printed documents is still largely an open problem. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

13 OCR Accuracy Accuracy rates of 80% to 90% on neat, clean hand- printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications. Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand- printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

14 OCR Accuracy… For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

15 OCR Accuracy… The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script. It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications. For more complex recognition problems, intelligent character recognition systems are generally used, as artificial neural networks can be made indifferent to both affine and non-linear transformations. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

16 OCR Accuracy… A technique which is having considerable success in recognizing difficult words and character groups within documents generally amenable to computer OCR is to submit them automatically to humans in the reCAPTCHA system. reCAPTCHA is a system originally developed to help digitize the text of books while protecting websites from bots attempting to access restricted areas. On September 16, 2009, Google acquired reCAPTCHA. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

17 Sakhr OCR القارئ الآلي Sakhr OCR converts scans of Arabic printed documents into digital text. Sakhr is rated #1 in recognizing clean copy Arabic text, with an output accuracy of 99%. Sakhr is the leading OCR provider for the Middle East, U.S., and Europe security and business needs. Naji ShukriAlzaz, EDM, University of Palestine, February 2010

18 Sakhr OCR القارئ الآلي High performance 99.8% accuracy for high-quality documents 96% accuracy for low-quality documents Supports Arabic, Farsi, Pashto, Jawi, and Urdu Auto-detects translation language Supports bilingual documents Naji ShukriAlzaz, EDM, University of Palestine, February 2010

19 Sakhr OCR القارئ الآلي Features Available standalone SDK, or integrated with document management systems User-friendly output editor (WYSIWYG) Robust zoning with individual settings Multithreaded with concurrent recognition sessions Naji ShukriAlzaz, EDM, University of Palestine, February 2010

20 Sakhr OCR القارئ الآلي Challenges of Arabic OCR Sakhr’s powerful OCR engine overcomes numerous complexities of Arabic fonts and language, including: Arabic is written cursively, where several characters are connected to form "blocks of characters“. Arabic can be written in many fonts, so that a "block of characters" has more than one base line. Arabic uses many types of external objects such as dots, "Hamza" and "Madda". Naji ShukriAlzaz, EDM, University of Palestine, February 2010

21 Sakhr OCR القارئ الآلي Challenges of Arabic OCR … Arabic characters can have more than one shape according to their position inside the block of characters (initial, middle, final or standalone block of characters) Overlapping also makes it difficult to determine the spacing between blocks of characters and words Arabic font suppliers do not follow a common standard Naji ShukriAlzaz, EDM, University of Palestine, February 2010


Download ppt "ادارة الوثائق الالكترونية Naji Shukri Alzaza University of Palestine February 2010."

Similar presentations


Ads by Google