Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Metadata Extraction Experiments with DTIC Collections Department of Computer Science Old Dominion University 2/25/2005 Work in Progress Metadata Extraction.

Similar presentations


Presentation on theme: "1 Metadata Extraction Experiments with DTIC Collections Department of Computer Science Old Dominion University 2/25/2005 Work in Progress Metadata Extraction."— Presentation transcript:

1 1 Metadata Extraction Experiments with DTIC Collections Department of Computer Science Old Dominion University 2/25/2005 Work in Progress Metadata Extraction Project Sponsored by DTIC

2 2 Outline 2004 Project Problem statement Motivations Objectives Delivery Tasks Background Digital Library and OAI Metadata Extraction System Architecture Experiments Image PDF Normal (text) PDF Color PDF Status 2005 Potential application to DTIC production ingestion process

3 3 2004 Project : Problem Statement Problems of Legacy documents “… most paper documents, even when represented as images …, in a dangerously inconvenient state: compared with encoded data, they are relatively illegible, unsearchable, and unbrowseable. ” Any information that is difficult to find, access and reuse risks being trapped in a second-class status ----Henry S. Baird, “ Digital Libraries and Document Image Analysis ” ICDAR 2003

4 4 2004 Project : Problems Statements Even for documents after OCR, they are still in “dangerous dangerously inconvenient state”: Lack of metadata available for these resources hampers their discovery and dispersion over the Web. It also hampers the interoperability between them and resources from other organizations. Lack of logical structure, which is very useful for better preservation, discovery and presentation.

5 5 2004 Project : Motivations for metadata extraction Using metadata helps resource discovery It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files. (estimation made by Mike Doane on DCMI 2003 workshop) Using metadata helps make collections interoperable with OAI- PMH

6 6 2004 Project : Motivations for metadata extraction However, creating metadata manually for a large collection is expensive It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop) Automatic metadata extraction tools are essential to reduce the cost.

7 7 2004 Project : Motivations - logical structure Converting a document into XML format with logical structure helps information preservation Information in a document can still be accessible and the document can still be presented in appropriate way when the software to open the document is not available any more. Converting a document into XML format with logical structure helps information presentation With different XSL, a XML document can be presented differently A XML document can be presented differently to different devices such as web browsers, PDA, etc. It allows different users who have different accesses. For example, registered users have full access to a document while Guests have access to only a part of the document such as introduction Converting a document into XML format with logical structure helps information discovery It allows logical component based retrieval, for example, searching only in introduction. It allows some special searches such as equation search.

8 8 2004 Project : Approach Organizations need to move towards: Effective conversion of existing corpora into a DTD-compliant XML format Integrate modern authoring tools in the publication process in a DTD-compliant XML format

9 9 2004 Project : Objectives Our Objective is to automate the task of extracting metadata and basic structure from DTIC PDF documents: Replace the current manual process for entering incoming documents into the DTIC digital library Batch process existing large collections for structural metadata

10 10 2004 Project : Deliverables A software package written mostly in Java, that will access PDF documents available on a file system, extract the metadata and store them in a local file system. Validate against the metadata manually extracted already for the selected pdf files. A viewer software, written in Java, to view/edit the extracted metadata.

11 11 2004 Project : Deliverables A technical report shows the results of the research and documentation to help in using the above listed software Feasibility report on extracting complex objects such as figures, equations, references, and tables from the document and representing them in a DTD- compliant XML format. an ingestion software, written in Java, to insert extracted metadata into the existing system used by DTIC

12 12 Background OAI and Digital Library Metadata Extraction Approaches

13 13 Digital Library and OAI Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Interoperability problem DLs are usually created separately by using different technologies and different metadata schemas.

14 14 Open Archive Initiatives (OAI) Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. It is based on metadata harvesting: a services provider can harvest metadata from a data provider. Data provider accepts OAI-PMH requests and provides metadata through network Service provider issues OAI-PMH requests to get metadata and build services on them. Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.

15 15

16 16 Dublin Core Metadata Set It supports 15 elements Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights All fields are optional http://dublincore.org/documents/dces/

17 17 Metadata Extraction: Rule-based Basic idea: Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title”. Advantage Can be implemented straightforward No need for training Disadvantage Lack of adaptabilities (work for similar document) Difficult to work with a large number of features Difficult to tune the system when errors occur because rules are usually fixed

18 18 Metadata Extraction: Machine-Learning Approach Learn the relationship between input and output from samples and make predictions for new data This approach has good adaptability but it has to be trained from samples. HMM (hidden Markov Model) & SVM (Support Vector Machine)

19 19 Hidden Markov Model -general Overview HMM was introduced by Baum in late 60s. HMM is a dominating technology for Speech recognition. It is widely use in other areas such as DNA segmentation and gene recognition. HMM has been used in Information Extraction recently Address parsing (borkar 2001, etc.) Name recognition (Klein 2003, etc.) Reference Parsing (borkar 2001) Metadata Extraction ( seymore 1999, Freitag 2000, etc.)

20 20 Support Vector Machine - general Overview It was introduced by Vapnik in late 70s It is now receiving increasing attentions It is widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, etc. A list of SVM applications is available at http://www.clopinet.com/isabelle/Projects/SVM/applist.html It is also used in text analysis ( Joachims 1998, etc.) and metadata extraction (Han 2003).

21 21 SVM - Metadata Extraction Basic idea Classes  metadata elements Extract metadata from a document  classify each line (or block) into appropriate classes. For example Extract document title from a document  Classify each line to see whether it is a part of title or not Related work Automatic Document Metadata Extraction Using Support Vector Machine (H. Han, 2003) Overall accuracy 92.9% was reported

22 22 System Architecture

23 23 System Architecture (cont.) Main components: OCR/Converter: Commercial OCR software is used to OCR image PDF files; For normal PDF files, we convert them to XML files. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine

24 24 Metadata Extraction - Rule-based Expert system approach Build a large rule base by using standard languages such as prolog Use existed expert system engine (for example, SWI- prolog) Advantages Can use existing engine Disadvantages Building rule base is time- consuming Doc Parser Expert System Engine Knowledge Base Facts metadata

25 25 Metadata Extraction – Rule-based Template-based approach Classify documents into classes based on similarity For each document class, create a template, or a set of rules Decoupling rules from coding A template is kept in a separate file Advantages Easy to extend For a new document class, just create a template Rules are simpler Rules can be refined easily Doc3 template2 Metadata Extraction Doc1 template1 Doc2 template2 metadata

26 26 Metadata Extraction – Machine Learning Approach SVM Feature Extraction Line level feature: line length, how many words in a line, percentage of capitalized words in a line, percentage of possible name in a line, etc. Word level feature: is an English word, is an person name, is a city name, is a state name, etc. Feature Extraction Knowledge base SVM Classifiers Metadata Doc Models SVM Learner Tagged Doc classifyinglearning

27 27 Experiments Image PDF Normal (text) PDF Color PDF

28 28 Performance Measure For individual metadata element Precision=a/(a+c) Recall=a/(a+b) Accuracy=(a+d)/(a+b+c+d) Overall accuracy is the ratio of the number of data that are classified correctly over the total number of data. ab cd Original Classified In class Not In class In classNot In class

29 29 Experiments with image PDF OCR OmniPage PrimeOCR Metadata Extraction SVM Template-based Structure Extraction & Markup

30 30 Image PDF : OCR ScanSoft Omnipage Pro 14.0 Features: Support PDF input, standard XML output, automatic process new added documents in its watched folder. We developed clean-up module to produce standard XML output PrimeOCR SDK Features: Support PDF input, does not support XML output We wrote code to use its API to process documents and output the results to our XML format

31 31 Image/metadata extraction: SVM Objective Evaluate the performances of SVM for different data sets to see how well SVM works for metadata extraction. Data Sets Data Set 1: Seymore935 Download from http://www-2.cs.cmu.edu/~kseymore/ie.html 935 manually tagged document headers Using the first 500 for training and the rest for test Data Set 2: DTIC100 Selected 100 PDF files from DTIC website based on Z39.18 standard OCR the first pages and convert to text format Manually tagged these 100 document headers Using the first 75 for training and the rest for test

32 32 Image/metadata extraction: SVM (cont.) Data Set 3: DTIC33 A subset of DTIC100 33 tagged document headers with identical layout Using the first 24 for training and the rest for test DTIC33 Seymore945 DTIC100 More heterogeneous

33 33 Image/metadata extraction: SVM (cont.) Overall accuracy of title, author, affiliation and date

34 34 Image/metadata extraction: SVM (cont.)

35 35 Experiments/Image/metadata: SVM (cont.)

36 36 Image/metadata extraction: Template Objective Evaluate the performance of our rule-based approach – defining a template for each class. Experiment Use data set DTIC100: 100 XML files with font size and bold information It is divided into 7 classes according to layout information For each class, a template is developed after checking the first one or two documents in this class. This template is applied to the remaining documents to get performance data

37 37 Image/metadata extraction: Template (cont.)

38 38 Image/metadata extraction: Template (cont.)

39 39 Image/metadata extraction: Template (cont.) We applied this approach to enlarged data set Extract metadata from raw XML files obtained after OCR. We divide the 546 documents into 15 classes and use a template for each document class. Demo2 is a demo to show metadata extraction from 546 documents with 15 classes. Demo2

40 40 Image/metadata extraction: Template(cont.)

41 41 Image PDF/Structure Markup Extract document logical structure and represent the document in XML format by using hardcode rules. Classify lines to see whether they are subtitles or not (by line length, layout information, etc.) For subtitles, group them into classes based on their features. Each class represents a level in final hierarchy structure See results for 2 PDF files at structure extraction ( The metadata part was extracted by template-based approach)structure extraction A mockup example is shown how a markup document looks like in the future. A mockup example

42 42 Text PDF Metadata extraction with expert system approach Cover page detection To classify whether a page is cover page or not SF298 form detection and processing Given a document, find its SF298 form page and extract information from the form

43 43 Text PDF/metadata extraction: Expert System Objective: Can an expert system be used to recognize documents that do not fall into known categories Preliminary Experiment We used prolog and SWI-Prolog engine Wrote rules for title extraction, encoding documents into prolog data structure, and extract title from these documents. Results: We succeed to extract ‘title’ from about 85% in randomly selected DTIC documents

44 44 Text PDF/metadata extraction: Cover Page detection We wrote a program to classify whether a page is a “cover page” or not based on some rules such as: A cover page contains less words, less lines Less number of words per line Has some lines at the second half page, etc. Currently, we hardcode these rules in our code We tested our code with about 30 DTIC documents (see classification result).classification result All cover pages were found 2 documents with no cover page were identified

45 45 Text PDF/metadata extraction: SF298 form detection and processing We downloaded about 1000 PDF files from DTIC collection and wrote code to locate and process SF298 forms Locate SF298 form by matching some special strings Detect feature changes in a SF298 form to determine whether a string is a part of a field name or a part of a field value Convert multi-line field name into a single line When a filed value and a field name are in a same line, separate them Find its field name for a string A field name with minimum distance On the top of it Demo6 shows the results of SF298 form processing with these 1000 PDF documents. Demo6 100% of SF298 pages were located in 1000 documents (when present) About 95% of major 4 fields were correctly identified* About 90% of up to 27 (maximum) fields were correctly extracted* *based on random sample validation

46 46 Experiments with color Pdfs We downloaded 18 documents and Explored both: OCR with Omni to produce RawXML and applied our cleanup module to produce CleanXML OCR with PrimeOCR having modified it to produce CleanXML Applied the template approach, found 8 classes, and extracted the metadata correctly from 17 of 18 documents

47 47 Status We have gauged accuracy of individual components of architecture through experiments on DTIC collection We have implementations for metadata extraction modules: Template based SVM HMM Rule-based expert system

48 48 Status We have Feature set to express rules templates for 15 classes of DTIC documents We have automated process: Manual OCR set of pdf files -> directory Automated for all files: Extract metadata Create OAI-compliant, XML metadata record Insert into OAI-compliant Digital Library

49 49 Status We have Java editor for XML records We have OCR cleanup modules for Input Image pdf Text pdf OCR Output WordML Raw XML Cleanup Output: CleanXML We have a very preliminary prototype structure markup module

50 50 Experiment summary

51 51 Status We have knowledge database obtained from analyzing arc and DTIC collections Authors (4Mill strings from http://arc.cs.odu.edu) http://arc.cs.odu.edu Organizations (79 from DTIC250) Universities (52 from DTIC250)

52 52 Proposal 2005-06 Goal: Develop computer assisted production process and supporting software to Extract metadata for color pdf Insert records into DTIC collection Assumptions On order of 10,000 documents/yr On order of 100 types of documents

53 53 Proposal 2005-06 Needed additional software Analyze existing DTIC collection and create knowledge database of metadata (e.g., all known authors, organizations,..) Create software environment for humans to interact with system generated analyses to correct metadata Create voting module to predict need for human intervention

54 54 Proposal 2005-06 Tasks Phase 1 (5 months): Analyze DTIC system environment Develop software modules Phase 2 (4 months): Integrate software to insert records into DTIC collection Train learning modules Create initial set of templates for rule module

55 55 Proposal 2005-06 Tasks Phase 3 (2 months): Observe humans on actual process and gather data on real production Develop editors for templates, features to handle new types of documents not handled by system

56 56 Proposal 2005-06 Tasks Phase 4 (1 month): Monitored production run.

57 57 Proposal 2005-06: Outcome We expect to be able to run (with developer’s support the system) the system such that only 20% of documents will need human intervention We expect that in the case of human intervention we will reduce the processing time by 80%

58 58

59 59

60 60 Metadata Extraction: Rule-based Related works Automated labeling algorithms for biomedical document images (Kim J, 2003 ) Extract metadata from first pages of biomedical journals Accuracy: title 100%, author 95.64%, abstract 95.85%, affiliation 63.13% (76 articles are used for test) Document Structure Analysis Based on Layout and Textual Features (Stefan Klink, 2000) Extract metadata from U-Wash document corpus with 979 journal pages Good results for some elements (such as page-number has 90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)

61 61 2004 Project : Tasks Working with DTIC in identifying the set of documents and the metadata of interest. Developing software for metadata and structure extraction from the selected set of PDF documents Feasibility study for extracting complex objects and representing the complete document in XML

62 62 End

63 63 Summary For image PDF documents, OCR software is very important. In order to automate the whole process, OCR software has to either support auto processing or provide API. We chose Omnipage and PrimeOCR SDK based on our research. Lesson Learned: Different OCR software may use different format. A desirable way is to use an internal format. An OCR software may output many formats, some are difficult to process. Choose which format to work with is important. Developing a good post-OCR procession tool is very important. According to our experience, a bad post-OCR procession may downgrade overall performance a lot.

64 64 Summary (cont.) For text PDF documents, to avoid OCR errors, we converted them to XML format(without OCR). Lesson learned: sometimes, reordering the text strings are necessary, especially for SF298 forms. We showed SVM is a feasible way to extract metadata by applying it to several data sets. Lesson learned: The less heterogeneous the data set, the better the performance.

65 65 Summary (cont.) Our template-based approach got high accuracy result while keeping a template for a class very simple. It also provided a way to extract different metadata from different document classes. Lesson learned: It is challenge to make it scalable to a large collection when you do not know how many classes.

66 66 Summary (cont.) Our expert system approach is aim to develop more general rules for extracting metadata from a large collection. Lesson learned: building a rule base for a large heterogonous collection is time-consuming. A more feasible way is to develop general rules under a situation, for example, extracting title from cover page. And then combine them together.

67 67 Summary (cont.) We showed it’s feasible to classify whether a page is a “cover page” or not by some simple rules. We handled a special case: SF298 from by detecting it from a document and extracting information from it.

68 68 Conclusion and Future works Our template-based approach can provide high accuracy with a set of simple rules. In the future, we need: Develop a tool to classify documents into different classes or assign an new document to a class. For documents does not belong to a know class, either leave them out for user to define more templates or use an expert system with general rules to process them. We believe that integration machine learning approach will improve the performance of our system. To improve the performance, a feed back loop need to be implemented to let users check the results and learn from users action.

69 69

70 70 Future Works: Metadata Extraction

71 71 Hidden Markov Model -general “Hidden Markov Modeling is a probabilistic technique for the study of observed items arranged in discrete-time series” -- Alan B Poritz : Hidden Markov Models : A Guided Tour, ICASSP 1988 HMM is a probabilistic finite state automaton Transit from state to state Emit a symbol when visit each state States are hidden ABCD

72 72 Hidden Markov Model -general A Hidden Markov Model consists of A set of hidden states (e.g. coin1, coin2, coin3) A set of observation symbols ( e.g. H and T) Transition probabilities: the probabilities from one state to another Emission probabilities: probability of emitting each symbol in each state Initial probabilities: probability of each state to be chosen as the first state

73 73 HMM - Metadata Extraction A document is a sequence of words that is produced by some hidden states (title, author, etc.) The parameters of HMM was learned from samples in advance. Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.

74 74 Challenges in Building Federation Services over Harvested Metadata, Kurt Maly, Mohammad Zubair, 2003 Challenges in Building Federation … Kurt Maly … 2003 … title Challenges in Building Federation … Kurt Maly … 2003 author date …

75 75 HMM - Metadata Extraction Related work K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. Result: overall accuracy 90.1% was reported

76 76 Support Vector Machine - general Binary Classifier (classify data into two classes) It represents data with pre- defined features It finds the plane with largest margin to separate the two classes from samples It classifies data into two classes based on which side they located. Font size Line number hyperplane margin The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.

77 77 Extension to non-linear decision boundary  ( )  (.)  ( ) Feature space Input space Support Vector Machine - general

78 78 Support Vector Machine - general Extension to multiple classes One-vs-rest Classes: in this class or not in this class Positive training samples: data in this class Negative training samples: the rest K binary SVM (k the number of the classes) One-vs-One Classes: in class one or in class two Positive training samples: data in this class Negative training samples: data in the other class K(K-1)/2 binary SVM


Download ppt "1 Metadata Extraction Experiments with DTIC Collections Department of Computer Science Old Dominion University 2/25/2005 Work in Progress Metadata Extraction."

Similar presentations


Ads by Google