Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.

Slides:



Advertisements
Similar presentations
Heinrich Stamerjohanns Institute for Science Networking Distributed Open Archives Dr. Heinrich Stamerjohanns Institute for Science Networking at the University.
Advertisements

1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Information Retrieval in Practice
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
OLC Spring Chapter Conferences Metadata, Schmetadata … Tell Me Why I Should Care? OLC Spring Chapter Conferences, 2004 Margaret.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Supplement 02CASE Tools1 Supplement 02 - Case Tools And Franchise Colleges By MANSHA NAWAZ.
Overview of Search Engines
Release 4 of the COUNTER Code of Practice for e- Resources and new usage- based measures of impact Peter Shepherd COUNTER May 2014.
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Publishing Digital Content to a LOR Publishing Digital Content to a LOR 1.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
METS-Based Cataloging Toolkit for Digital Library Management System Dong, Li Tsinghua University Library
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
Dspace 1 Introduction to DSpace Mukesh Pund Scientist NISCAIR, New Delhi.
Rapid Visual OAI Tool S. Kothamasa, K. Maly, M. Zubair (Old Dominion University) X. Liu (Los Alamos National Laboratory) RCDL 2003, St. Petersburg.
5-7 November 2014 DR Workflow Practical Digital Content Management from Digital Libraries & Archives Perspective.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
Extending the Scope of Learning Objects with XML Bill Tait COLMSCT Associate Teaching Fellow The Open University ALT-C Conference Sep 2007.
1 Metadata Extraction Experiments with DTIC Collections Department of Computer Science Old Dominion University 2/25/2005 Work in Progress Metadata Extraction.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Developing a Concept Extraction Technique with Ensemble Pathway Prat Tanapaisankit (NJIT), Min Song (NJIT), and Edward A. Fox (Virginia Tech) Abstract.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Approved for Public Release U.S. Government Work (17 USC§105) Not copyrighted in the U.S. Defense Research & Engineering Information for the Warfighter.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
Introduction to Omeka. What is Omeka? - An Open Source web publishing platform - Used by libraries, archives, museums, and scholars through a set of commonly.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
1 1 Aeronautical Information Services Brief to AIXM User Group 27 February 2007.
Introduction to metadata
ICCTA September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,
Agenda Why discuss Digital Libraries What is a digital Library History Meta-data FEDORA NSDL D Space.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
May 26-28ICNEE 2003 ARCHON: BUILDING LEARNING ENVIRONMENTS THROUGH EXTENDED DIGITAL LIBRARY SERVICES Hesham Anan, Kurt Maly, Mohammad Zubair,et al. Digital.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February,
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
Harokopio University of Athens – Department of Informatics and Telematics HAROKOPIOUNIVERSITY A Distributed Architecture for Building Federated Digital.
Santi Thompson - Metadata Coordinator Annie Wu - Head, Metadata and Bibliographic Services 2013 TCDL Conference Austin, TX.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
June 3-6, 2003E-Society Lisbon Automatic Metadata Discovery from Non-cooperative Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
Metadata & Repositories Jackie Knowles RSP Support Officer.
Information Retrieval in Practice
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Software Documentation
VI-SEEM Data Repository
Introduction to DSpace
Presentation transcript:

Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science Old Dominion University February, 2005

Feb 21-25, 2005ICM 2005 Mumbai2 Contents Introduction Background Overall Architecture Metadata Extraction Approach Experiments Conclusion

Feb 21-25, 2005ICM 2005 Mumbai3 Documents SCAN & OCR Online Documents ?

Feb 21-25, 2005ICM 2005 Mumbai4 Introduction Why need go further Lack of metadata available for these resources hampers their discovery and dispersion over the Web. Lack of metadata available for these resources hampers the interoperability between them and resources from other organizations. Benefits of using metadata Using metadata helps resource discovery It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files. (estimation made by Mike Doane on DCMI 2003 workshop) Using metadata helps make collections interoperable with OAI- PMH

Feb 21-25, 2005ICM 2005 Mumbai5 Introduction (cont.) How to get these metadata Creating metadata manually for a large collection is expensive It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop) These enormous costs for manual metadata creation make a great demand of the automated metadata extraction tools.

Feb 21-25, 2005ICM 2005 Mumbai6 Introduction (cont.) Our main objective is to automate the task of building an interoperable digital library starting with a legacy collection consisting of printed version of documents or scanned version of documents in TIFF or PDF formats  To develop a flexible and adaptable approach for extracting metadata from physical collections with focus on DTIC (Defense Technical Information Center) collections.  To develop efficient ways of integrating OCR extraction processes with an interoperable digital library.  To integrate the techniques and tools developed for metadata extraction to develop a test bed that moves the DTIC legacy collection into an interoperable digital library framework  To evaluate the effectiveness of the automation process.

Feb 21-25, 2005ICM 2005 Mumbai7 Background OAI and Digital Library Metadata Extraction Approaches

Feb 21-25, 2005ICM 2005 Mumbai8 Digital Library and OAI Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Interoperability problem DLs are usually created separately by using different technologies and different metadata schemas.

Feb 21-25, 2005ICM 2005 Mumbai9 Open Archive Initiatives (OAI) Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. It is based on metadata harvesting: a services provider can harvest metadata from a data provider. Data provider accepts OAI-PMH requests and provides metadata through network Service provider issues OAI-PMH requests to get metadata and build services on them. Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.

Feb 21-25, 2005ICM 2005 Mumbai10

Feb 21-25, 2005ICM 2005 Mumbai11 Dublin Core Metadata Set It supports 15 elements Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights All fields are optional

Feb 21-25, 2005ICM 2005 Mumbai12 Metadata Extraction: Rule-based Basic idea: Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title”. Advantage Can be implemented straightforward No need for training Disadvantage Lack of adaptabilities (work for similar document) Difficult to work with a large number of features Difficult to tune the system when errors occur because rules are usually fixed

Feb 21-25, 2005ICM 2005 Mumbai13 Metadata Extraction: Rule-based Related works Automated labeling algorithms for biomedical document images (Kim J, 2003 ) Extract metadata from first pages of biomedical journals Accuracy: title 100%, author 95.64%, abstract 95.85%, affiliation 63.13% (76 articles are used for test) Document Structure Analysis Based on Layout and Textual Features (Stefan Klink, 2000) Extract metadata from U-Wash document corpus with 979 journal pages Good results for some elements (such as page-number has 90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)

Feb 21-25, 2005ICM 2005 Mumbai14 Metadata Extraction: Machine-Learning Approach Basic idea: Learn the relationship between input and output from samples and make predictions for new data This approach has good adaptability but it has to be trained from samples.

Feb 21-25, 2005ICM 2005 Mumbai15 System Architecture

Feb 21-25, 2005ICM 2005 Mumbai16 System Architecture (cont.) Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine

Feb 21-25, 2005ICM 2005 Mumbai17 Metadata Extraction (Cont.)

Feb 21-25, 2005ICM 2005 Mumbai18 Metadata Extraction (cont.) Rule-based module Classify documents into classes based on similarity For each document class, create a template, or a set of rules Decoupling rules from coding A template is kept in a separate file Benefits Easy to extend For a new document class, just create a template Rules are simpler Rules can be refined easily Doc3 template2 Metadata Extraction Doc1 template1 Doc2 template2 metadata

Feb 21-25, 2005ICM 2005 Mumbai19 Metadata Extraction (cont.) Machine-learning module -- SVM with HMM SVM is good at working with a large number of features but is not good at catching correlated features a section before an author section is most possible a title section HMM is good at working with events in a sequence but is expensive to handle a large number of features Integration SVM works with a large number of features to produce probabilistic results ( title 54%, author 30%, abstract 16%) HMM works with results from SVM and the probabilities transiting from one metadata element to another element to produce final results.

Feb 21-25, 2005ICM 2005 Mumbai20 Metadata Extraction Approach (Cont.) Integration Rule-based Approach with Machine-learning Approach integrate machine-learning approach with our rule-based approach is to overcome two drawbacks of rule-based system Lack of auto-correction ability Lack of statistical fundamentals: Integrate the results from two modules directly

Feb 21-25, 2005ICM 2005 Mumbai21 Experiments Performance Measures SVM Experiments with different data sets Pure rule-based experiment

Feb 21-25, 2005ICM 2005 Mumbai22 Performance Measure For individual metadata element Precision=TT/(TT+FT) Recall=TT/(TT+TF) Accuracy=(TT+FF)/(TT+TF+FT+FF) Overall accuracy is the ratio of the number of data that are classified correctly over the total number of data. TTTF FTFF Original Classified In class Not In class In classNot In class

Feb 21-25, 2005ICM 2005 Mumbai23 SVM Experiments with different data sets Objective: Evaluate the performances of SVM for different data sets to see how well SVM works for metadata extraction. Data Sets Data Set 1: Seymore935 Download from manually tagged document headers Using the first 500 for training and the rest for test Data Set 2: DTIC100 Selected 100 PDF files from DTIC website based on Z39.18 standard OCR the first pages and convert to text format Manually tagged these 100 document headers Using the first 75 for training and the rest for test

Feb 21-25, 2005ICM 2005 Mumbai24 SVM Experiments with different data sets Data Set 3: DTIC33 A subset of DTIC tagged document headers with identical layout Using the first 24 for training and the rest for test DTIC33 Seymore945 DTIC100 More heterogeneous

Feb 21-25, 2005ICM 2005 Mumbai25 SVM Experiments with different data sets Overall accuracy of title, author, affiliation and date

Feb 21-25, 2005ICM 2005 Mumbai26 Pure rule-based experiment Objective Evaluate the performance of our rule-based approach – defining a template for each class. Experiment Use data set DTIC100: 100 XML files with font size and bold information It is divided into 7 classes according to layout information For each class, a template is developed after checking the first one or two documents in this class. This template is applied to the remaining documents to get performance data

Feb 21-25, 2005ICM 2005 Mumbai27 Pure rule-based experiment

Feb 21-25, 2005ICM 2005 Mumbai28 Pure rule-based experiment

Feb 21-25, 2005ICM 2005 Mumbai29 Screenshots – OAI

Feb 21-25, 2005ICM 2005 Mumbai30 Screenshots – Search Engine

Feb 21-25, 2005ICM 2005 Mumbai31 Conclusion It is feasible to extract metadata with higher accuracy from scanned documents of a homogeneous collection Future Issues: Heterogeneous Collection Extracting whole structure including complex objects