Presentation is loading. Please wait.

Presentation is loading. Please wait.

September 25, 2006 NASA Feasibility Study Status Update.

Similar presentations


Presentation on theme: "September 25, 2006 NASA Feasibility Study Status Update."— Presentation transcript:

1 September 25, 2006 NASA Feasibility Study Status Update

2 September 25, 2006 NASA Milestones A. Feasibility Study to identify the NASA document types –Report - May 31, 2006 B. Form identification and template development - Template set - Aug 31, 2006 C. Enhance classification algorithm for two specific classes – software packaged -Oct 31, 2006 D. Process study for inter-organizational collections – configuration software – Dec 1, 2006 E. Enhance engine to recognize two major classes – software packaged – Dec 15, 2006 F. Evaluation of extraction process – report – Feb 28,2006

3 September 25, 2006 Form Identification and Template Development August 31 Deliverable

4 September 25, 2006 Form Identification and Template Development August 31 Deliverable DEMO

5 September 25, 2006 Active Tasks for future NASA Milestones Standard Intermediate Representation of the Scanned Document (IDM) Design Classification Algorithm

6 September 25, 2006 Independent Document Model (IDM) Platform independent Document Model Motivation Dramatic XML Schema Change between Omnipage 14 and 15 Tie the template engine to stable specification Protects from linking directly to specific OCR product Allows us to include statistics for enhanced feature usage Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..) Supports Pointpage Detection, Classification Use XSLT 2.0 stylesheets to transform Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change Chain a series of sheets to add functionality (CleanML )

7 September 25, 2006 IDM Usage Each incoming XML schema requires specific XSLT 2.0 Stylesheet Resulting IDM Doc used for “Form Based” templates IDM transformed into CleanML for “Non-form” templates CleanML XML Doc docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelOther.xsl docTreeModelCleanML.xsl OmniPage 14 XML Doc OmniPage 15 XML Doc Other OCR Output XML Doc IDM XML Doc Form Based Extraction Non Form Extraction

8 September 25, 2006 Classification Algorithm Two approaches: Classification(switching) based on image classification Post-hoc classification via validation

9 September 25, 2006 Post-hoc classification via validation Attempt metadata extraction with all plausible templates Validate each results set, assigning confidence scores Field-specific validation rules, may combine - statistical models derived for each field of - text length - % of words from English dictionary - % of phrases from knowledge base prepared for that field - pattern matching Select metadata set with highest confidence score

10 September 25, 2006 Sample set of extracted metadata bindings Steven J. Zeil Old Dominion University Technical Report 2006-24 September 12, 2006 Validation of Extracted Metadata A lengthy discussion of techniques for validating metadata is

11 September 25, 2006 Validation template customized for the collection <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary">

12 September 25, 2006

13 September 25, 2006 Annotated version of the metadata bindings Steven J. Zeil <organization confidence="0.42" warning="inappropriate vocabulary">Old Dominion University Technical Report 2006-24 September 12, 2006 Validation of Extracted Metadata A lengthy discussion of techniques for validating metadata is


Download ppt "September 25, 2006 NASA Feasibility Study Status Update."

Similar presentations


Ads by Google