1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,

1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December, 2004

2 Problem Statement Manual metadata extraction and logical structure extraction is expensive Metadata improves discovery and interoperability (OAI-PMH). Logical structure for preservation and supporting different presentation formats (e.g., mobile devices in future)

3 Motivations – Metadata Extraction Using metadata helps resource discovery It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files. (estimation made by Mike Doane on DCMI 2003 workshop) Using metadata helps make collections interoperable with OAI- PMH On the other hand, creating metadata manually for a large collection is expensive It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop)

4 Motivations – Logical Structure Extraction Converting a document into XML format with logical structure helps information preservation Information in a document can still be accessible and the document can still be presented in appropriate way when the software to open the document is not available any more. Converting a document into XML format with logical structure helps information presentation With different XSL, a XML document can be presented differently A XML document can be presented differently to different devices such as web browsers, PDA, etc. It allows different users have different accesses. For example, a registered user may see all parts of a document while a Guest can only access introduction section. Converting a document into XML format with logical structure helps information discovery It allows logical component based retrieval, for example, searching only in introduction. It allows some special searches such as equation search.

5 Objectives To develop a flexible and adaptable approach for extracting metadata from physical collections with focus on DTIC (Defense Technical Information Center) collections. To develop techniques to extract basic logical structure of the scanned full text documents. To develop techniques to extract and represent complex objects such as equations, figures, etc.

6 Background Metadata Extraction Rule-based approach Machine-Learning approach Hidden Markov Model Support Vector Machine Logical Structure Extraction Basic logical structure extraction Reference Extraction & Reference Linking OAI and Digital Library Note: OAI, Open Archive Initiatives Protocols for Metadata Harvesting, is a framework to provide interoperability among distributed collections.

7 Background - Metadata Extraction Rule-based approach Machine-Learning approach Hidden Markov Model Support Vector Machine

8 Metadata Extraction: Rule-based Basic idea: Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title”. Advantage Can be implemented straightforward No need for training Disadvantage Lack of adaptabilities, (work for similar document) Difficult to work with a large number of features Difficult to tune the system when errors occur because rules are usually fixed

9 Metadata Extraction: Rule-based Related works Automated labeling algorithms for biomedical document images (Kim J, 2003 ) Extract metadata from first pages of biomedical journals Accuracy: title 100%, author 95.64%, abstract 95.85%, affiliation 63.13% (76 articles are used for test) Document Structure Analysis Based on Layout and Textual Features (Stefan Klink, 2000) Extract metadata from U-Wash document corpus with 979 journal pages Good results for some elements (such as page-number has 90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)

10 Metadata Extraction: Machine-Learning Approach Basic idea: Learn the relationship between input and output from samples and make predictions for new data This approach has good adaptability but it has to be trained from samples.

11 HMM - Metadata Extraction A document is a sequence of words that is produced by some hidden states (title, author, etc.) The parameters of HMM was learned from samples in advance. Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.

12 Challenges in Building Federation Services over Harvested Metadata, Kurt Maly, Mohammad Zubair, 2003 Challenges in Building Federation … Kurt Maly … 2003 …

13 HMM - Metadata Extraction Related work K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. Result: overall accuracy 90.1% was reported

14 Support Vector Machine - general Overview It was introduced by Vapnik in late 70s It is now receiving increasing attentions It is widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, etc. A list of SVM applications is available at http://www.clopinet.com/isabelle/Projects/SVM/applist.html It is also used in text analysis ( Joachims 1998, etc.) and metadata extraction (Han 2003).

15 Support Vector Machine - general Many decision boundaries can separate these two classes Which one should we choose? Class 1 Class 2 Courtesy: Martin Law

16 Support Vector Machine - general Class 1 Class 2 Basic idea Choose the one to separate two classes with largest margin margin hyperplane Support Vector

17 Support Vector Machine - general Binary Classifier (classify data into two classes) It represents data with pre- defined features It finds the plane with largest margin to separate the two classes from samples It classifies data into two classes based on which side they located. Font size Line number hyperplane margin The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.

18 Multi-Class SVMs Combining into multi-class classifier One-vs-rest Classes: in this class or not in this class Positive training samples: data in this class Negative training samples: the rest K binary SVM (k the number of the classes) One-vs-One Classes: in class one or in class two Positive training samples: data in this class Negative training samples: data in the other class K(K-1)/2 binary SVM

19 SVM - Metadata Extraction Basic idea Classes  metadata elements Extract metadata from a document  classify each line (or block) into appropriate classes. For example Extract document title from a document  Classify each line to see whether it is a part of title or not Related work Automatic Document Metadata Extraction Using Support Vector Machine (H. Han, 2003) Overall accuracy 92.9% was reported

20 Logical Structure Extraction Physical Structure

21 Structure Extraction Logical Structure

22 Digital Library and OAI Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Interoperability problem DLs are usually created separately by using different technologies and different metadata schemas.

23 Open Archive Initiatives (OAI) Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. It is based on metadata harvesting: a services provider can harvest metadata from a data provider. Data provider accepts OAI-PMH requests and provides metadata through network Service provider issues OAI-PMH requests to get metadata and build services on them. Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.

25 Dublin Core Metadata Set It supports 15 elements Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights All fields are optional http://dublincore.org/documents/dces/

26 What does it mean making an existing digital library OAI enabled ? Digital Library Storage OAI Layer Exposing metadata to OAI service providers – DC and Parallel metadata sets ONLY METADATA

27 - OAI Request for Metadata is embedded in HTTP. - OAI Response to OAI Request is encoded in XML. - XML Schema specification for OAI Response is provided in OAI-PMH document. RCDL 2003, St. Petersburg OAI Request and OAI Response

28 OAI Mechanics Request is encoded in http Response is encoded in XML XML Schemas for the responses are defined in the OAI-PMH document Courtesy: Michael Nelson

29 Overall Approach & Architecture* *This is our overall vision and only some components of this architecture are being implemented as part of the current contract

30 Overall Approach & Architecture Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. Object Digitization: Convert documents into XML format for better preservation and better presentation. The main works: Extraction of complex objects such as figures Extraction of document logical structure Extraction of references and reference linking. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine

31 Metadata Extraction A challenge is how to reach desirable accuracy for a large heterogeneous collection Humanly defining a set of rules to cover all situations in advance is difficult Machine Learning Required a lot of labeled samples, for example, an HMM-based name recognizer used data with about 1.2 million words for training in order to achieve high accuracy (According to Douglas E. Appelt). Accuracy is the ratio of the number of those tagged correctly over the total number.

32 Metadata Extraction Feasible solution Classify documents into classes Documents in a same class have similar layout Work on each document class only instead of working on the whole large collection.

33 Metadata Extraction (Cont.) Overall Approach for Handling a Large Collection Manual Classification This approach assumes it is possible to humanly classify the large set of documents into classes ( based on time period, source organizations, etc. ) For each class, randomly select, say 100, documents develop a template. Evaluate the template by statistically sampling and refine the template till error is under a tolerance level. Next apply the refined template to the whole set. Auto-Classification This approach assumes it is not humanly possible to classify the large set of documents. In this case we develop a higher- set of rules on a smaller sample for classification. Evaluate the classification approach based on statistical sampling. Next develop the template for each class, apply, and refine as outlined in the manual classification approach.

34 Metadata Extraction (Cont.)

35 Preliminary Experiments Performance Measures SVM Experiments with different data sets Pure rule-based experiment

36 DEMO

1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,

Similar presentations

Presentation on theme: "1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,

Similar presentations

Presentation on theme: "1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,"— Presentation transcript:

Similar presentations

About project

Feedback