Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.

Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter Automated Metadata Extraction April 5, 2006 Kurt Maly maly@cs.odu.edu

Outline Background and Motivation Challenges and Approaches Metadata Extraction Experience at ODU CS Architecture for Metadata Extraction Experiments with DTIC Documents Experiments with limited GPO Documents Conclusions

Digital Libraries Content Creation New Content Publication Tools Kepler, Compopt (NSF, US Navy) Process Existing Content (DTIC) Content Sharing Centralized Model Harvesting OAI-PMH Arc/Archon (NSF) Kepler (NSF) TRI (NASA,LANL, SANDIA) DL Grid (Andrew Mellon) Secure DL (NSF, IBM) Real Time LFDL Distributed Model – P2P (NSF) Digital Library Research at ODU http://dlib.cs.odu.edu/

Motivation Metadata enhances the value of a document collectionMetadata enhances the value of a document collection –Using metadata helps resource discovery It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files. (estimation made by Mike Doane on DCMI 2003 workshop) –Using metadata helps make collections interoperable with OAI-PMH Manual metadata extraction is costly and time-consumingManual metadata extraction is costly and time-consuming –It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). Automatic metadata extraction tools are essential to reduce the cost. –Automatic extraction tools are essential for rapid dissemination at reasonable cost OCR is not sufficient for making ‘legacy’ documents searchable.OCR is not sufficient for making ‘legacy’ documents searchable.

Challenges A successful metadata extraction system must: extract metadata accurately scale to large document collections cope with heterogeneity within a collection maintain accuracy, with minimal reprogramming/training cost, as the collection evolves over time have a validation/correction process

Approaches Machine Learning –HMM –SVM Rule-Based –Ad Hoc –Expert Systems –Template-Based (ODU CS)

Comparison Machine-Learning Approach –Good adaptability but it has to be trained from samples – very time consuming –Performance degrades with increasing heterogeneity –Difficult to add new fields to be extracted –Difficult to select the right features for training Rule-based –No need for training from samples –Can extract different metadata from different documents –Rule writing may require significant technical expertise

Metadata Extraction Experience at ODU CS DTIC (2004, 2005) –developed software to automate the task of extracting metadata and basic structure from DTIC PDF documents explored alternatives including SVM, HMM, expert systems origin of the ODU template-based engine GPO (in progress) NASA (in progress) –Feasibility study to apply template-based approach to CASI collection

Meeting the Challenges All techniques achieved reasonable accuracy for small collections –possible to scale to large homogeneous collections Heterogeneity remains a problem –Ad hoc rule-based tend to complex monoliths –Expert systems tend to large rule sets with complex, poorly-understood interactions –Machine-learning must choose between reduced accuracy and confidence or state explosion Evolution problematic for machine-learning approaches –older documents may have higher rate of OCR errors –expensive retraining required to accommodate changes in collection –potential lag time during which accuracy decays until sufficient training instances acquired Validation: A largely unexplored area. –Machine-learning approaches offer some support via confidence measures

Architecture for Metadata Extraction

Our Approach: Meeting the Challenges Bi-level architecture –Classification based upon document similarity –Simple templates (rule-based) written for each emerging class

Our Approach: Meeting the Challenges Heterogeneity –Classification, in effect, reduces the problem to multiple homogeneous collections –Multiple templates required, but each template is comparatively simple only needs to accommodate one class of documents that share a common layout and style Evolution –New classes of documents accommodated by writing a new template templates are comparatively simple no lengthy retraining required potentially rapid response to changes in collection –Enriching the template engine by introducing new features to reduce complexity of templates Validation –Exploring a variety of techniques drawn from automated software testing & validation

Metadata Extraction – Template-based Template-based approach –Classify documents into classes based on similarity –For each document class, create a template, or a set of rules –Decoupling rules from coding A template is kept in a separate file Advantages –Easy to extend For a new document class, just create a template –Rules are simpler –Rules can be refined easily

Classes of documents

Template engine

Document features Layout features –Boldness, i.e., whether text is in bold font or not; –Font size, i.e., the font size used in text, e.g. font size 12, font size 14, etc; –Alignment, i.e. whether text is left, right, central, or adjusted alignment; –Geometric location, for example, a block starting with coordinates (0, 0) and ending with coordinates (100, 200); –Geometric relation, for example, a block located below the title block.

Document features Textual features –Special words, for example, a string starting with “abstract”; –Special patterns, for example, a string with regular expression “[1-2][0-9][0-9][0-9]”; –Statistics features, for example, a string with more than 20 words, a string with more than 100 letters, and a string with more than 50% letters in upper case; –Knowledge features, for example, a string containing a last name from a name dictionary.

Template language XML based Related to document features XML schema Simple document model –Document –page-zone-region-column-row- paragraphs-lines-words-character

Template sample

Sample document pdf

Scan OCR output

‘Clean XML output

Template (part)

Metadata extracted

Results Summary from DTIC Project

Experiment with Limited GPO Documents 14 GPO Documents having Technical Report Documentation Page 57 GPO Documents without Technical Report Documentation Page 16 Congressional Reports 16 Public Law Documents

GPO Report Documentation Page

GPO Document

Congressional Report

Public Law Document

Conclusions OCR software works very well on current documents Template based approach allows automatic metadata extraction from –Dynamically changing collections –Heterogeneous, large collections –Report document pages –High degree of accuracy Feasibility of structure (e.g., table of contents, tables, equations, sections) metadata extraction

Additional Slides

Metadata Extraction: Machine-Learning Approach Learn the relationship between input and output from samples and make predictions for new data This approach has good adaptability but it has to be trained from samples. HMM (hidden Markov Model) & SVM (Support Vector Machine)

Machine Learning - Hidden Markov Models “ Hidden Markov Modeling is a probabilistic technique for the study of observed items arranged in discrete-time series ” -- Alan B Poritz : Hidden Markov Models : A Guided Tour, ICASSP 1988 HMM is a probabilistic finite state automaton –Transit from state to state –Emit a symbol when visit each state –States are hidden ABCD

Hidden Markov Models A Hidden Markov Model consists of A set of hidden states (e.g. coin1, coin2, coin3) A set of observation symbols ( e.g. H and T) Transition probabilities: the probabilities from one state to another Emission probabilities: probability of emitting each symbol in each state Initial probabilities: probability of each state to be chosen as the first state

HMM - Metadata Extraction –A document is a sequence of words that is produced by some hidden states (title, author, etc.) –The parameters of HMM was learned from samples in advance. –Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.

Machine Learning: Support Vector Machines Binary Classifier (classify data into two classes) –It represents data with pre-defined features –It finds the plane with largest margin to separate the two classes from samples –It classifies data into two classes based on which side they located. Font size Line number hyperplane margin The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.

SVM - Metadata Extraction Widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, etc. Basic idea –Classes  metadata elements –Extract metadata from a document  classify each line (or block) into appropriate classes. –For example Extract document title from a document  Classify each line to see whether it is a part of title or not

Metadata Extraction: Rule-based Basic idea: –Use a set of rules to define how to extract metadata based on human observation. –For example, a rule may be “ The first line is title”. Advantage –Can be implemented straightforwardly –No need for training Disadvantage –Lack of adaptability (work for similar document) –Difficult to work with a large number of features –Difficult to tune the system when errors occur because rules are usually fixed

Metadata Extraction - Rule-based Expert system approach –Build a large rule base by using standard languages such as prolog –Use existed expert system engine (for example, SWI- prolog) Advantages –Can use existing engine Disadvantages –Building rule base is time- consuming Doc Parser Expert System Engine Knowledge Base Facts metadata

Metadata Extraction Experience at ODU CS We have knowledge database obtained from analyzing Arc and DTIC collections – Authors (4Mill strings from http://arc.cs.odu.edu) http://arc.cs.odu.edu –Organizations (79 from DTIC250, 200 from DTIC 600) –Universities (52 from DTIC250)

Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.

Similar presentations

Presentation on theme: "Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.

Similar presentations

Presentation on theme: "Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter."— Presentation transcript:

Similar presentations

About project

Feedback