Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prénom Nom Document Analysis: Introduction Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Similar presentations


Presentation on theme: "Prénom Nom Document Analysis: Introduction Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008."— Presentation transcript:

1 Prénom Nom Document Analysis: Introduction Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

2 © Prof. Rolf Ingold 2 Outline  Introduction: definition and aims  Applications overview  Methodologies  Possibility & limits  Experience of the DIVA research group  Course content and structure

3 © Prof. Rolf Ingold 3 What is a document ?  Data = abstract binary representation of any kind of information to be stored, transmitted or processed by computers  Information = data associated with an implicit or explicit interpretation  Document = piece of information that can be perceived and interpreted by humans  to be perceived documents have to be rendered  displayed  projected on screens  printed  played on speakers ……

4 © Prof. Rolf Ingold 4 Taxonomy of documents  Documents may be  Synthetic (structured) or captured (unstructured)‏  Static (non temporal, printable) or dynamic (temporal)‏  Viewable, audible or tactile Animation Synthetic data Captured data Static documentsDynamic documents AudioImages Graphics Text (printed)‏ Off-line handwriting On-line handwriting Off-line handwriting Video Audio Speech (synthetic)‏

5 © Prof. Rolf Ingold 5 What is document analysis ?  Document analysis aims of extracting symbolic information  text (words, expressions, continuous text)‏  graphics (vector graphics, shapes, symbols)‏  layout structures  logical structures  numeric data  writer / speaker identities, ... from different captured sources  images (scanned, camera based, synthesized)‏  video  on-line handwriting  sound

6 © Prof. Rolf Ingold 6 Importance of document structures  Document = Content + Structures  Structures convey abstract high level information  They are revealed by styles

7 © Prof. Rolf Ingold 7 Structural document analysis A Master / Slave Monitor … Network D.Jacobson … M. Shafiq M…  Document analysis = Image Analysis of static documents to extract content and structures  Document analysis is applicable on  captured images (from scanner, camera)‏  synthetic images of electronic documents, available in unstructured or purely structured form

8 © Prof. Rolf Ingold 8 Analysis of Electronic Documents  Most electronic documents are unstructured or poorly structured  Document understanding can be seen as a reverse-engineering task using a fixed-layout document format (such as PDF or XPS) as a pivot format ASCII

9 © Prof. Rolf Ingold 9 Visual Audio Processing Chain  Visual Audio aims at recovering sound from old records by image analysis

10 © Prof. Rolf Ingold 10 Usefulness of document analysis  Extracting information from captured documents is useful in different contexts  to avoid cumbersome keyboarding  to capture information remotely  to study the document’s content  to categorize, classify and index digitized documents  for digital libraries  culture preservation  to reuse document chunks  to reedit and restyle an existing document  to extract information for integrated applications  office automation  database management  information systems  to perform multimodal alignment

11 © Prof. Rolf Ingold 11 Typical applications of document analysis  Commercial products are available for  Text reading (OCR products)‏  Office automation (mail reading and dispatching)‏  Form Processing (for dedicated applications)‏  More Specialized products  Postal address reading  Check reading and processing

12 © Prof. Rolf Ingold 12 Form processing  Performance of form processing depends  on form complexity  on form variability  Fields are located easily  if their positions are fixed  when using different colors  Content recognition is hard for several reasons  degraded images  approximate positioning of symbols  variability of handwriting

13 © Prof. Rolf Ingold 13 Check reading  Check reading can be automated at >90%  difficulties: textured background, variability of writing  easiness: fixed vocabulary, redundancy (legal & courtesy amount), availability of contextual information (client database)‏ Legal Amount Payee name MICR Date Courtesy Amount Signature from

14 © Prof. Rolf Ingold 14 Table of contents recognition  Aim to extract information from TOC to index journals  associate titles and authors to page numbers  Advantages  Very precise goal  Regular layout for a given jounal  Difficuties  Complex layout  Great variability when considering journals universally

15 © Prof. Rolf Ingold 15 Analysis of historical documents  Aim to extract information to index historical documents  Challenges  degradations  irregular layout  rich typography, ornaments  old scripts (no OCR)‏  Possible approach  word spotting

16 © Prof. Rolf Ingold 16 Logical & physical document structures  Logical document structures  Reflecting the author’s point of view  Independent of presentation  Composed of application dependent logical entities  Chapters, sections  Specific to the application and document class  Physical document structures  Reflects the editor’s point of view  Composed of a hierarchy of physical entities  Text blocs, text lines and tokens  Graphical primitives  Universal and independent of the document class

17 © Prof. Rolf Ingold 17 Document processing cycle Physical Document Logical Document Paper Document Document Image FormattingPrinting Analysis and RecognitionDigitizing  Document analysis can be considered as the reverse of formatting Rendering

18 © Prof. Rolf Ingold 18 Relation between logical and physical structure analysis formatting Styles Logical Structure Physical Structure edit print display  Document formatting is straightforward...  But document analysis is a non trivial task that generally can not be fully automated

19 © Prof. Rolf Ingold 19 Processing chain Blocs Image Simple text Preprocessing Postanalysis OCR Segmentation Fonts OFR Doc understand. Structured docum. Layout analysis

20 © Prof. Rolf Ingold 20 Pre-processing  Pre-processing aims at preparing the document image for further analysis; it includes  Brightness / contrast enhancement  Noise removal  Skew / aberration correction  Binarization / color clustering  Shape smoothing

21 © Prof. Rolf Ingold 21 Segmentation  Document segmentation aims at splitting the image in regions of interests; it includes  Page segmentation into blocs  Text, graphics and images separation  Hairlines and frames detection  Text bloc segmentation into text lines, words and characters  In form processing, field separation  Graphics segmentation into vectors and symbols

22 © Prof. Rolf Ingold 22 Optical Character Recognition (OCR)‏  OCR aims at extracting character codes (ASCII) from text images;  OCR was one of the earliest computer vision application  Early patents were deposited in the 1910s, 30 years before computer age !  OCR deals with many situations  Isolated characters vs. complete words or phrases  Different character classes (digits, uppercase letters, full text, …)‏  Restricted or open vocabulary  Machine printed vs. handwritten text  Different languages (with various diacritics) and different scripts (Latin, Greek, Hebrew, Arabic, Farsi, various Asian scripts, …,)‏  Imperfect image quality (low resolution, textured background, distortions, noise, …)‏

23 © Prof. Rolf Ingold 23 Text recognition related problems  Text analysis must also consider other aspects  In case of printed text  Font recognition (family, size and style)‏  Font categorization (with/without serifs, fixed vs. proportional font)‏  In case of handwritten text  Scriber identification or verification  Scriber classification

24 © Prof. Rolf Ingold 24 Layout analysis  Layout analysis aims at extracting physical structures of documents; it consists of  locating, delimiting and identifying  text blocks  graphics  tables  formulas  handwritten text fields  annotations  associating figures and captions  locating and delimiting headers and footers  recovering the reading order (of multicolumn documents)‏

25 © Prof. Rolf Ingold 25 Example : layout modeling of scientific journals

26 © Prof. Rolf Ingold 26 Optical Font Recognition (OFR)‏  OFR aims at identifying the used fonts  OFR is useful  for improving OCR accuracy, by using dedicated classifiers  to distinguish “O” and “0”, “I” and “1”, …  for assigning logical labels, for logical structure recognition  Two strategies may be applied for OFR  A priori OFR (without considering the content)‏  A posteriori OFR (when the content is supposed to be known)‏

27 © Prof. Rolf Ingold 27 Document structure recognition  Document structure recognition (also referred to as document understanding) is the first step towards document interpretation  Document understanding is dealing with  Logical labeling  Logical structure recognition  Two levels of granularity are being considered  macro-structure analysis labeling paragraphs / blocks  micro-structure analysis labeling words / strings  Document structure recognition is still considered as an open issue  There is no universal approach  Solutions exist for dedicated document classes (museum notices, checks, table of contents, scientific papers, newspapers, …

28 © Prof. Rolf Ingold 28 Two Levels of Structural Document Analysis  Physical structure analysis (also layout analysis)‏  to locate and identify text block, graphics, tables, formulas, handwritten text fields, annotations, …  to recover the reading order  Logical structure analysis (also document understanding)‏  to assign a hierarchy of logical labels  first step towards interpretation

29 © Prof. Rolf Ingold 29 Use Case: Intelligent Newspaper Indexing  Full text indexing is not adequate for complex documents  Following items have to be identified  headlines  editorial  articles (with title, author & function, summary, content, links,...)‏  captions (associated to images)‏  reader’s letters  advertisement ...

30 © Prof. Rolf Ingold 30 Use case: Understanding Museum Notices Group Vedette: Area Title: Principal Title: End of the title: Area Address / Date: Address: Date: Area Collection: Group Cote: from A. Belaïd LORIA-CNRS Nancy Group Vedette: Area Title: Principal Title: End of the title: Area Address / Date: Address: Date: Area Collection: Group Cote: Group Vedette: Area Title: Principal Title: End of the title: Area Address / Date: Address: Date: Area Collection: Group Cote:

31 © Prof. Rolf Ingold 31 Possibilities and limits of DA  Layout analysis is considered as almost solved for printed documents  It can be achieved generically  Problems remain for textured backgrounds and degraded documents (historical & handwritten documents)‏  Document understanding is much less mature  Solutions are application dependent  Application of specific knowledge is needed (document models)‏

32 © Prof. Rolf Ingold 32 Need for Document Recognition Models  There is no universal approach !  Document recognition systems must be tuned  for specific applications  for specific document classes  Contextual information is required  Models provide information like  generic document structures (DTD or XML-schema)‏  geometrical and typographical attributes (style information)‏  semantic information (keywords, dictionaries, databases,...)‏  statistical information

33 © Prof. Rolf Ingold 33 Content of document models  Generic structure  Document Type Definition (DTD) or XML-schema  Style information  Absolute or relative positioning  Typographical attributes & formatting rules  Semantics (if available)‏  Linguistic information, keywords  Application specific ontology  Probabilistic information  Frequencies of items or sequences, co-occurrences

34 © Prof. Rolf Ingold 34 Trouble with document models  Document models are hard to produce and to maintain  implicit models (hard coded in the application)‏  => hard to modify, adapt, extend  explicit models, written in a formal language  => cumbersome to produce, needs high expertise  abstract models, learned automatically  => needs a lot of training data (with ground-truth!)‏  Need for more flexible tools:  assisted environments with friendly user interfaces  recognition improving with use  models are learned incrementally

35 © Prof. Rolf Ingold 35 Pattern Based Document Understanding (2-CREM) [Robaday 03]  Configurations consist of  Set of vertices  Labeled (type)‏  Attributed (pos, typo,...)‏  Edges between vertices  Labeled (neighborhood relation)‏  Attributed (geom,...)‏  Model consists of  Extraction rules  For each class  Attribute selector  List of pattern extraction configura- tion model classification document image rules patt. selector id

36 © Prof. Rolf Ingold 36 Performance evaluation  Performance evaluation is an important issue  to compare algorithms  to estimate corrections costs of real applications  Groundtruthed databases are required  cost reduction by document analysis tools (bootstrap)‏  synthetic data as alternative

37 © Prof. Rolf Ingold 37 List of Lessons 1.Introduction to document analysis and recognition 2.Document image processing 3.Fundamentals of pattern recognition I 4.Fundamentals of pattern recognition II 5.Printed text recognition 6.Font recognition 7.Layout analysis and segmentation 8.Logical structure analysis 9.Graphics recognition 10.Handwriting recognition 11.Reverse engineering of documents 12.Multimodal applications

38 © Prof. Rolf Ingold 38 Conclusion on document analysis  Document analysis is useful for many applications  Commercial systems solve some of them  Advanced document analysis prototypes are developed in many research labs over the world  No universal documentation system is on the way  User assisted approaches may be a good trade-off for midsize applications  Structural document analysis will not disappear with exclusive electronic document handling (paperless office)‏

39 © Prof. Rolf Ingold 39 Organization of the course  Professor : Rolf Ingold, Pérolles-2, B421, 026 300 84 66  Assistant : Jean-Luc Bloechle,, Pérolles-2, B440, 026 300 92 94  Course : Tuesday, 09:15-10:00 & 10:15-11:00  Exercise : Wednesday, 11:15-12:00  requirements: 2/3 of series returned, 1/2 considered satisfactory  Home work : estimated to 4-6 hours a week  Website : http://diuf.unifr.ch/diva/web/http://diuf.unifr.ch/diva/web/  Examination :  oral, 20 minutes (alternatively written, 120 min)‏  after spring semester (June 2008) or summer (August-September 2008)‏  Credits : 5ECTS


Download ppt "Prénom Nom Document Analysis: Introduction Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008."

Similar presentations


Ads by Google