1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February, 2005

2 Contents Introduction Background Overall Architecture Metadata Extraction Approach Experiments Screenshots

3 Documents SCAN & OCR Online Documents ?

4 Introduction Why need go further Lack of metadata available for these resources hampers their discovery and dispersion over the Web. Lack of metadata available for these resources hampers the interoperability between them and resources from other organizations. Benefits of using metadata Using metadata helps resource discovery It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files. (estimation made by Mike Doane on DCMI 2003 workshop) Using metadata helps make collections interoperable with OAI- PMH

5 Introduction (cont.) How to get these metadata Creating metadata manually for a large collection is expensive It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop) These enormous costs for manual metadata creation make a great demand of the automated metadata extraction tools.

6 Introduction (cont.) Our main objective is to automate the task of building an interoperable digital library starting with a legacy collection consisting of printed version of documents or scanned version of documents in TIFF or PDF formats  To develop a flexible and adaptable approach for extracting metadata from physical collections with focus on DTIC (Defense Technical Information Center) collections.  To develop efficient ways of integrating OCR, extraction processes with an interoperable digital library.  To integrate the techniques and tools developed for metadata extraction to develop a test bed that DTIC legacy collection into an interoperable digital library framework  To evaluate the effectiveness of the automation process.

7 Background OAI and Digital Library Metadata Extraction Rule-based approach Machine-Learning approach Hidden Markov Model Support Vector Machine

8 Digital Library and OAI Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Interoperability problem DLs are usually created separately by using different technologies and different metadata schemas.

9 Open Archive Initiatives (OAI) Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. It is based on metadata harvesting: a services provider can harvest metadata from a data provider. Data provider accepts OAI-PMH requests and provides metadata through network Service provider issues OAI-PMH requests to get metadata and build services on them. Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.

11 Dublin Core Metadata Set It supports 15 elements Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights All fields are optional http://dublincore.org/documents/dces/

12 Metadata Extraction: Rule-based Basic idea: Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title”. Advantage Can be implemented straightforward No need for training Disadvantage Lack of adaptabilities, (work for similar document) Difficult to work with a large number of features Difficult to tune the system when errors occur because rules are usually fixed

13 Metadata Extraction: Rule-based Related works Automated labeling algorithms for biomedical document images (Kim J, 2003 ) Extract metadata from first pages of biomedical journals Accuracy: title 100%, author 95.64%, abstract 95.85%, affiliation 63.13% (76 articles are used for test) Document Structure Analysis Based on Layout and Textual Features (Stefan Klink, 2000) Extract metadata from U-Wash document corpus with 979 journal pages Good results for some elements (such as page-number has 90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)

14 Metadata Extraction: Machine-Learning Approach Basic idea: Learn the relationship between input and output from samples and make predictions for new data This approach has good adaptability but it has to be trained from samples.

15 Hidden Markov Model -general Overview HMM was introduced by Baum in late 60s. HMM is a dominating technology for Speech recognition. It is widely use in other areas such as DNA segmentation and gene recognition. HMM has been used in Information Extraction recently Address parsing (borkar 2001, etc.) Name recognition (Klein 2003, etc.) Reference Parsing (borkar 2001) Metadata Extraction ( seymore 1999, Freitag 2000, etc.)

16 Hidden Markov Model -general “Hidden Markov Modeling is a probabilistic technique for the study of observed items arranged in discrete-time series” -- Alan B Poritz : Hidden Markov Models : A Guided Tour, ICASSP 1988 HMM is a probabilistic finite state automaton Transit from state to state Emit a symbol when visit each state States are hidden ABCD

17 HMM - Metadata Extraction A document is a sequence of words that is produced by some hidden states (title, author, etc.) The parameters of HMM was learned from samples in advance. Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.

18 Challenges in Building Federation Services over Harvested Metadata, Kurt Maly, Mohammad Zubair, 2003 Challenges in Building Federation … Kurt Maly … 2003 … title Challenges in Building Federation … Kurt Maly … 2003 author date …

19 HMM - Metadata Extraction Related work K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. Result: overall accuracy 90.1% was reported

20 Support Vector Machine - general Overview It was introduced by Vapnik in late 70s It is now receiving increasing attentions It is widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, etc. A list of SVM applications is available at http://www.clopinet.com/isabelle/Projects/SVM/applist.html It is also used in text analysis ( Joachim's 1998, etc.) and metadata extraction (Han 2003).

21 Support Vector Machine - general Binary Classifier (classify data into two classes) It represents data with pre- defined features It finds the plane with largest margin to separate the two classes from samples It classifies data into two classes based on which side they located. Font size Line number hyperplane margin The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.

22 Multi-Class SVMs Combining into multi-class classifier One-vs-rest Classes: in this class or not in this class Positive training samples: data in this class Negative training samples: the rest K binary SVM (k the number of the classes) One-vs-One Classes: in class one or in class two Positive training samples: data in this class Negative training samples: data in the other class K(K-1)/2 binary SVM

23 SVM - Metadata Extraction Basic idea Classes  metadata elements Extract metadata from a document  classify each line (or block) into appropriate classes. For example Extract document title from a document  Classify each line to see whether it is a part of title or not Related work Automatic Document Metadata Extraction Using Support Vector Machine (H. Han, 2003) Overall accuracy 92.9% was reported

24 System Architecture

25 System Architecture (cont.) Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine

26 Metadata Extraction (Cont.)

27 Metadata Extraction (cont.) Rule-based module Classify documents into classes based on similarity For each document class, create a template, or a set of rules Decoupling rules from coding A template is kept in a separate file Benefits Easy to extend For a new document class, just create a template Rules are simpler Rules can be refined easily Doc3 template2 Metadata Extraction Doc1 template1 Doc2 template2 metadata

28 Metadata Extraction (cont.) Machine-learning module -- SVM with HMM SVM is good at working with a large number of features but is not good at catching correlated features a section before an author section is most possible a title section HMM is good at working with events in a sequence but is expensive to handle a large number of features Integration SVM works with a large number of features to produce probabilistic results ( title 54%, author 30%, abstract 16%) HMM works with results from SVM and the probabilities transiting from one metadata element to another element to produce final results.

29 Metadata Extraction Approach (Cont.) Integration Rule-based Approach with Machine-learning Approach integrate machine-learning approach with our rule-based approach is to overcome two drawbacks of rule-based system Lack of auto-correction ability Lack of statistical fundamentals: Integrate the results from two modules directly

30 Experiments Performance Measures SVM Experiments with different data sets Pure rule-based experiment

31 Performance Measure For individual metadata element Precision=TT/(TT+FT) Recall=TT/(TT+TF) Accuracy=(TT+FF)/(TT+TF+FT+FF) Overall accuracy is the ratio of the number of data that are classified correctly over the total number of data. TTTF FTFF Original Classified In class Not In class In classNot In class

32 SVM Experiments with different data sets Objective: Evaluate the performances of SVM for different data sets to see how well SVM works for metadata extraction. Data Sets Data Set 1: Seymore935 Download from http://www-2.cs.cmu.edu/~kseymore/ie.html 935 manually tagged document headers Using the first 500 for training and the rest for test Data Set 2: DTIC100 Selected 100 PDF files from DTIC website based on Z39.18 standard OCR the first pages and convert to text format Manually tagged these 100 document headers Using the first 75 for training and the rest for test

33 SVM Experiments with different data sets Data Set 3: DTIC33 A subset of DTIC100 33 tagged document headers with identical layout Using the first 24 for training and the rest for test DTIC33 Seymore945 DTIC100 More heterogeneous

34 SVM Experiments with different data sets Overall accuracy of title, author, affiliation and date

35 Pure rule-based experiment Objective Evaluate the performance of our rule-based approach – defining a template for each class. Experiment Use data set DTIC100: 100 XML files with font size and bold information It is divided into 7 classes according to layout information For each class, a template is developed after checking the first one or two documents in this class. This template is applied to the remaining documents to get performance data

36 Pure rule-based experiment

37 Pure rule-based experiment

38 Screenshots – OAI

39 Screenshots – Search Engine

40 Thanks

41 DTIC Samples

42 What does it mean making an existing digital library OAI enabled ? Digital Library Storage OAI Layer Exposing metadata to OAI service providers – DC and Parallel metadata sets ONLY METADATA

43 - OAI Request for Metadata is embedded in HTTP. - OAI Response to OAI Request is encoded in XML. - XML Schema specification for OAI Response is provided in OAI-PMH document. RCDL 2003, St. Petersburg OAI Request and OAI Response

44 OAI Mechanics Request is encoded in http Response is encoded in XML XML Schemas for the responses are defined in the OAI-PMH document Courtesy: Michael Nelson

45 Hidden Markov Model -general A simple Example – Tossing coins Your friend in a room is tossing three coins – three states You are outside the room and can not see which coin he tossed –states are hidden You are shown the tossing result, a sequence of header/tail, for example HTTHHHTT… (observation symbols) The tossing result is affected by The probability of producing header for each coin The transition probabilities from coin to coin Which coin to be started with

46 Hidden Markov Model -general A Hidden Markov Model consists of A set of hidden states (e.g. coin1, coin2, coin3) A set of observation symbols ( e.g. H and T) Transition probabilities: the probabilities from one state to another Emission probabilities: probability of emitting each symbol in each state Initial probabilities: probability of each state to be chosen as the first state

47 Hidden Markov Model -general Uses associated with HMM Evaluation: Consider the problem where we have a number of HMMs describing different systems, and a sequence of observations. We may want to know which HMM most probably generated the given sequence. Decoding: Finding the most probable sequence of hidden states given some observations. Learning: Generating a HMM from a sequence of observations. Information in this slide comes from http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html

48 Support Vector Machine - general Many decision boundaries can separate these two classes Which one should we choose? Class 1 Class 2 Courtesy: Martin Law

49 Support Vector Machine - general Class 1 Class 2 Basic idea Choose the one to separate two classes with largest margin margin hyperplane Support Vector

50 SVM Experiments with different data sets

53 SVM with different feature sets

57 HMM experiment result Data Set: Seymore935 One state per field (tag) Using the first 500 for training and the rest for test Experimental Result Overall accuracy=93.0%

1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

Similar presentations

Presentation on theme: "1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,

Similar presentations

Presentation on theme: "1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,"— Presentation transcript:

Similar presentations

About project

Feedback