Presentation is loading. Please wait.

Presentation is loading. Please wait.

OCR at INIS Branko Krznarić. Outline  What is OCR?  OCR Objectives  Principles  Techniques  Software INIS Training Seminar 12-16 October 2015, Vienna,

Similar presentations


Presentation on theme: "OCR at INIS Branko Krznarić. Outline  What is OCR?  OCR Objectives  Principles  Techniques  Software INIS Training Seminar 12-16 October 2015, Vienna,"— Presentation transcript:

1 OCR at INIS Branko Krznarić

2 Outline  What is OCR?  OCR Objectives  Principles  Techniques  Software INIS Training Seminar 12-16 October 2015, Vienna, Austria 2

3 What is OCR? INIS Training Seminar 12-16 October 2015, Vienna, Austria 3 (source: pcmag.com)

4 Optical Character Recognition (OCR)  OCR is the “conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text.” [1]  Make digitized images of printed documents searchable.  Font encoding issues. INIS Training Seminar 12-16 October 2015, Vienna, Austria 4

5 OCR Objectives  Data entry from printed records.  OCR adds an extra value to your image.  OCR brings to life your digitized collection. We can “find the needle in the haystack” INIS Training Seminar 12-16 October 2015, Vienna, Austria 5

6 OCR Objectives (contd.) Method of digitizing printed texts  Electronically edited  Searched  Stored more compactly  Displayed on-line  Machine processes INIS Training Seminar 12-16 October 2015, Vienna, Austria 6

7 OCR Techniques  Pre-processing  De-skew  Despeckle  Binarization  Line removal  Layout analysis (zoning)  Post-processing (dictionary) INIS Training Seminar 12-16 October 2015, Vienna, Austria 7

8 Scanned vs. Vector Image INIS Training Seminar 12-16 October 2015, Vienna, Austria 8

9 “Do not look at the trees (letters) try to see the forest (sentences)“ F0R 488UR1N6 7H3 L0N63V17Y 0F 1NF0RM4710N, P3RH4P8 7H3 M087 1MP0R74N7 R0L3 1N 7H3 0P3R4710N 0F 4 D16174L 4RCH1V3 18 M4N461N6 7H3 1D3N717Y, 1N736R17Y 4ND QU4L17Y 0F 7H3 4RCH1V38 1783LF 48 4 7RU873D 80URC3 0F 7H3 CUL7UR4L R3C0RD. INIS Training Seminar 12-16 October 2015, Vienna, Austria 9

10 Verdana Font FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD. INIS Training Seminar 12-16 October 2015, Vienna, Austria 10

11 Brush Script MT (Windows Font) FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD. INIS Training Seminar 12-16 October 2015, Vienna, Austria 11

12 PCs ≠ Humans  OCR compares patterns and selects the closest match. It can be forced to a specific context, but requires customization.  People adapt to circumstances and can circumvent misspellings if context is clear. INIS Training Seminar 12-16 October 2015, Vienna, Austria 12

13 True or false Usually, printed text is adequately sampled if each line is at least two pixels wide: INIS Training Seminar 12-16 October 2015, Vienna, Austria 13

14 Zoom in INIS Training Seminar 12-16 October 2015, Vienna, Austria 14

15 Zoom in INIS Training Seminar 12-16 October 2015, Vienna, Austria 15

16 Results from OCR It is in this context that I… … and an additional protocol on the basis… INIS Training Seminar 12-16 October 2015, Vienna, Austria 16

17 Chinese Raster Image (scanned) INIS Training Seminar 12-16 October 2015, Vienna, Austria 17

18 Chinese Vector Image (OCR) 滤器 INIS Training Seminar 12-16 October 2015, Vienna, Austria 18

19 Arabic Raster Image (scanned) INIS Training Seminar 12-16 October 2015, Vienna, Austria 19

20 Arabic Vector Image (OCR) هذ ا وشملت INIS Training Seminar 12-16 October 2015, Vienna, Austria 20

21 Japanese Raster Image (scanned) INIS Training Seminar 12-16 October 2015, Vienna, Austria 21

22 Japanese Vector Image (OCR) INIS Training Seminar 12-16 October 2015, Vienna, Austria 22

23 Font Encoding INIS Training Seminar 12-16 October 2015, Vienna, Austria 23

24 Font Encoding (cont.) INIS Training Seminar 12-16 October 2015, Vienna, Austria 24

25 OCR Software  High degree of recognition accuracy  Reproducing formatted output  OCR Software at INIS:  Abbyy FineReader (multilingual OCR)  Adobe Acrobat  InftyReader INIS Training Seminar 12-16 October 2015, Vienna, Austria 25

26 Abbyy FineReader (interface) INIS Training Seminar 12-16 October 2015, Vienna, Austria 26

27 InftyReader - an OCR System for Math Documents INIS Training Seminar 12-16 October 2015, Vienna, Austria 27

28 Reference [1] “Optical character recognition” http://en.wikipedia.org/wiki/Optical_character_ recognition. Retrieved 2015-09-29. http://en.wikipedia.org/wiki/Optical_character_ recognition INIS Training Seminar 12-16 October 2015, Vienna, Austria 28

29 Thank you! INIS Training Seminar 12-16 October 2015, Vienna, Austria 29


Download ppt "OCR at INIS Branko Krznarić. Outline  What is OCR?  OCR Objectives  Principles  Techniques  Software INIS Training Seminar 12-16 October 2015, Vienna,"

Similar presentations


Ads by Google