Presentation is loading. Please wait.

Presentation is loading. Please wait.

UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox.

Similar presentations


Presentation on theme: "UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox."— Presentation transcript:

1 UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center

2 UC Berkeley CS294-9 Fall 20005- 2 The course so far…. Reminder: All course materials are online: http://www- inst.eecs.berkeley.edu/~cs294-9/ Overview of the DIA Research Field Some applications (Postal Addresses, Checks): Research Objectives: more systematic modeling, design Some basic engineering

3 UC Berkeley CS294-9 Fall 20005- 3 How well are we doing? Cost to achieve a useful result Compare digital version to – hand keying/ digitizing –verification – correction Correction cost may dominate total system cost

4 UC Berkeley CS294-9 Fall 20005- 4 When is a result nearly correct? Character Model –Correct –Reject –Error String model –Insertion –Deletion –Rejection –Substitution [wrong letter identification]

5 UC Berkeley CS294-9 Fall 20005- 5 Using ascii character labels ABCDEFGHIJKL = s1 ACD~~OIIUKL = s2 Insert B after A in s2 Substitute E for ~, F for ~ [~=reject] subst G for O in s2 subst H for I in s2 subst I for U … etc (really H was recognized as II, IJ was recognized as U)

6 UC Berkeley CS294-9 Fall 20005- 6 Ascii labels are inadequate Unicode + Font + Point size + Tag information..

7 UC Berkeley CS294-9 Fall 20005- 7 Simple measures may mislead Increase the rejection rate and this “error rate” decreases. Reject all characters to get 0/0? Some applications (e.g. post office) force very low error, even if (low confidence) correct results are sometimes rejected.

8 UC Berkeley CS294-9 Fall 20005- 8 Some errors are acceptable Keyword search: if the key word occurs many times and is occasionally rejected Erroneous (nonsense) words are unlikely to be found by a search Caveat: if a key word is consistently changed to a nearby word, it may be missed (e.g. search for durnptruck and never find it.)

9 UC Berkeley CS294-9 Fall 20005- 9 Example: UNLV-ISRI document collection 20 million pages of scientific, legal, official memos from DOE and contractors –Rock mining –Maps –Safe transportation of nuclear waste –Average length 44 pages

10 UC Berkeley CS294-9 Fall 20005- 10 Example: UNLV-ISRI document collection DOE’s Licensing Support System Prototype –104,000 Page images, 2,600 documents –Manually typed “correct” text –OCR text To determine relevance to queries, 3 methods used –Geology students ranking (0/1) –OCR keyword search –“correct” text search

11 UC Berkeley CS294-9 Fall 20005- 11 Example: UNLV-ISRI document collection Exact match on 71 queries. –632 returned by correct text –617 returned by OCR. –Essentially: OCR is OK for this application. Probabilistic ranking / frequency: –Excessive OCR errors affected ranking –On average, similar results Feedback on relevance was not helpful for poor OCR Benchmarking: similar relevance = good results

12 UC Berkeley CS294-9 Fall 20005- 12 Example: UNLV-ISRI document collection One surprising result is that for some standard tests of precision and recall, processing OCR did better than actual text. [Crummy OCR meant that some terms were not recognized; but the documents were irrelevant….]

13 UC Berkeley CS294-9 Fall 20005- 13 A theory for computing accuracy Consider the result of OCR to be a string –Idealization: most common errors involve mis-counting the number of spaces! –Ignores size/font/absolute position etc etc

14 UC Berkeley CS294-9 Fall 20005- 14 Computing the shortest edit distance Bio-informatics sequencingBio-informatics Associate a cost for each correspondence. For example, –Match or substitute (cost 0 or 1) –Insert or delete (cost 2)

15 UC Berkeley CS294-9 Fall 20005- 15 Attempt to align of AUGGAA to ACUGAUGUGA. Distances were calculated using following parameters: s(a,b) = 0 when a equals b; s(a,b) = 1 when a differs from b insert or delete cost = 2. One of the possible optimal paths is indicated by a solid line connecting cells. It corresponds to the following alignment: ACUGAUGUGA A-UG--G-AA [explain dynamic programming here?] AUGGAAAUGGAA A C U G A U G U G A 14

16 UC Berkeley CS294-9 Fall 20005- 16 Computing the shortest edit distance Also useful for other tasks (recognizing speech) Lots of ways of organization of dynamic programming, still O(n 2 ). Probably of more interest is word accuracy, or accuracy on non-stopwords (excluding and the of … etc.)

17 UC Berkeley CS294-9 Fall 20005- 17 Correct Zoning is essential Read order in multi-column pages How to compare competing programs on performance of repeated headers What to do with figures, logos. 123 456

18 UC Berkeley CS294-9 Fall 20005- 18 Document Attribute Format Specification : DAFS ``While many formats exist for composing a document from electronic storage onto paper, no satisfactory standard exists for the reverse process. DAFS is intended to be a standard DAFS for document decomposition. It will used in applications such as OCR and document image understanding. There are three storage formats: DAFS-Unicode, DAFS-ASCII and a more compact DAFS-Binary form. DAFS is a file format specification for documents with a variety of uses. It is developed under the Document Image Understanding (DIMUND) project funded by ARPA.’’ www.raf.com, Illuminator, UW CDRoms (English and Japanese)www.raf.com

19 UC Berkeley CS294-9 Fall 20005- 19 DAFS vs SGML DAFS= SGML+Unicode +CCITFax4 SGML requires DTD (document type definition) SGML is intended for structure, not appearance (e.g. not bold, italic) Images which accidentally contain ascii version of can be problematical –Solved by putting images in separate files!

20 UC Berkeley CS294-9 Fall 20005- 20 Perfect results: how to obtain ground truth? Painfully enter it by hand, or Painfully correct OCR results, or Compute some kind of average of OCR programs

21 UC Berkeley CS294-9 Fall 20005- 21 Perfect ground truth: a synthetic approach (Kanungo,UMD): start with TeX, –produce the ground truth for layout form TeX, –Extract character positions, glyphs by analyzing DVI files –This provides essentially every bit position of each character.

22 UC Berkeley CS294-9 Fall 20005- 22 Ground truth Next, commit to paper: –Print the DVI files –Scan a calibration page –Compute parameters of 2d  2d transformations T imposed by physics –Scan the printout –Align the page –Run the recognizer –Compare reported positions ( T -1 ) to correct ones

23 UC Berkeley CS294-9 Fall 20005- 23 Change of Pace Assignment 1 –What does it mean to write a program? Documentation Demo Instructions for use (perhaps optional) –Extensions, limitations, discussion Discussion questions


Download ppt "UC Berkeley CS294-9 Fall 20005- 1 Document Image Analysis Lecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox."

Similar presentations


Ads by Google