Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Slides:



Advertisements
Similar presentations
CNES implementation of the ISO standard An extension of the current CNES implementation of the ISO metadata standard.
Advertisements

Open Scholarship 2006 Bielefeld Academic Search Engine a Scientific Search Service for Institutional Repositories Open Scholarship 2006 New Challenges.
28 March 2003e-MapScholar: content management system The e-MapScholar Content Management System (CMS) David Medyckyj-Scott Project Director.
Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.
OAI in DigiTool DigiTool Version 3.0.
OAI-PMH Dawn Petherick, University Web Services Team Manager, Information Services, University of Birmingham MIDESS Dissemination.
National Science Digital Library (NSDL) Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
UCLA Digital Library UC Digital Library Forum August 5, 2002 UCLA Digital Library Presenter: Curtis Fornadley Senior Programmer/Analyst.
OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.
Basic Concepts Architecture Topology Protocols Basic Concepts Open e-Print Archive Open Archive -- generalization of e-print Data Provider and Service.
A Digital Library Repository Utilizing the Open Archives Initiative Developed to meet the needs of UTK Library Special Collections.
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, Digital Library Research Laboratory Virginia Tech.
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Collaborative Approach to Open Access: Experience from Bioline International Leslie Chan Associate Director Bioline International University of Toronto.
Metadata Harvesting Interoperable digital collections.
Databases and Library Catalogs Global Index Medicus/Global Health Library PubMed Source Bibliographic Database: International Health and Disability.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Semantics and Syntax of Dublin Core Usage in Open Archives Initiative Data Providers of Cultural Heritage Materials Arwen Hutt, University of Tennessee.
OAI-PMH The Open Archives Initiative Protocol for Metadata Harvesting Presenter: Knud Möller Friday,
OARE Module 5A: Scopus (Elsevier). Table of Contents About Scopus (Elsevier) Using Scopus Search Page Results/Refine Search Pages Download, PDF, Export,
07/11/2002Thomas Baron - JACoW Workshop1 CERN Library Requirements T. Baron CERN ETT-DH-CDS.
Avano an OAI harvester for the marine and aquatic siences Fred Merceur IAMSLIC's 32nd annual conference Every Continent, Every Ocean October 8-12, 2006.
SCIELO AS AN OPEN ARCHIVE: the development of SciELO / OpenArchives data provider interface Prof. Carlos H. Marcondes Federal Fluminense University/ Information.
OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,
WDC-MARE – World Data Center for Marine Environmental Sciences Data portal based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler,
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Metadata harvesting in regional digital libraries in PIONIER Network Cezary Mazurek, Maciej Stroiński, Marcin Werla, Jan Węglarz.
AQUATIC COMMONS INITIATIVE: a model for resource sharing in marine and aquatic sciences - presentation to IODE XIX, AQUATIC COMMONS INITIATIVE: a model.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
Archimer Ifremer’s institutional repository Fred Merceur IAMSLIC's 32nd annual conference Every Continent, Every Ocean October 8-12, 2006 Portland, Oregon,
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.
Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Integrating Access to Digital Content Sarah Shreeves University of Illinois at Urbana-Champaign Visual Resources Association 23 rd Annual Conference Miami.
Search Interoperability, OAI, and Metadata Sarah Shreeves University of Illinois at Urbana-Champaign Basics and Beyond Grainger Engineering Library April.
Aquatic Commons Initiative: the year in review Presented for the Aquatic Commons Implementation Task Force by Stephanie Haas, University of Florida IAMSLIC:
The OAI: technical overview OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University -- Computer Science.
The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University
Open Archives Initiative Protocol for Metadata Harvesting.
The Aquatic Commons repository: an international collaboration Stephanie Ronan.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
Standards OAI-Protocol Metadata: DC - Agris - MODS Marc Goovaerts Hasselt University Library ODIN-PI TRAINING OSTENDE, May 2008.
2/22/2016J Ammerman1 Open Archives Initiative What is it? What’s it good for?
NSDL & the Open Archives Initiative A Brief Introduction to OAI Timothy W. Cole Mathematics Librarian & Professor of Library Administration.
Metadata-based Discovery: Experience in Crystallography UKOLN is supported by: Monica Duke UKOLN, University of Bath, UK A centre of.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
OAI and ODL Building Digital Libraries from Components Hussein Suleman Virginia Tech DLRL 12 September 2002.
Open your Alfresco Data
Bielefeld Academic Search Engine
Getting a Leg Up on OAI for the NSDL
OceanDocs Digital Repository of Marine Science Research Outputs
YugNIRO Digitization Proposal 2012
Repository Software Marc Goovaerts, Hasselt University Library
Georges Arnaout Chaitanya Krishna
Comments on ASFA Input Helen Wibley, FAO 2016 ASFA Advisory Board Meeting – Hanoi, Viet Nam.
A step-by-step guide to DOI registration
OAI and Metadata Harvesting
Digitometric Services for Open Archives Environments
The New Face of Information Retrieval: The Ankara University Open Access Platform Prof. Dr. Sekine Karakaş Prof. Dr. Doğan.
OAI 11/20/07.
Introduction to Information Retrieval
Presentation transcript:

Avano, an OAI harvester for marine and aquatic sciences Fred Merceur What could be improved in OAI-PMH protocol and in repositories implementation?

Table of contents Main technical ideas of OAI-PMH Avano presentation General information Filtering aquatic and marine records Demonstrations What could be improved in OAI-PMH protocol and in repositories implementation?

Main technical ideas of OAI-PMH Open Archives Protocol for Metadata Harvesting

Definitions and concepts A protocol to share bibliographic records The digital objects (documentation, images, dataset…) stay inside the repositories Two groups of players OAI harvesters OAI harvesters OAI server OAI server HTTP / XML Data providers (Open Archives, Institutional Repositories, Commercial publishers, e.g., Aquatic Commons, OceanDocs, MBL/WHOI) Service providers, or harvesters including AVANO A simple protocol OAI-PMH is based on major web standard : HTTP, XML, Dublin Core

Harvesters issue repositories with simple HTTP requests. There are 6 request types (verbs) that can be issued by harvesters: Identify Retrieve information about a repository (administrator , information about deleted records strategy…) ListMetadataformats Retrieve the metadata formats available from a repository (XML DTD). All repositories must at least allow the sharing of theirs records in unqualified Dublin Core ListSets Get the optional list of Set suggested by the Data Provider to harvest a selection of records (Thematic sets, type of documents, full text available…) ListIdentifiers Get the list of record identifiers available from a data provider GetRecord Get the complete record for the identifier sent as parameter ListRecords Get a list of complete records available from a data provider

Some parameters to issue a repository from - until (optional) Specify the range of dates of records to harvest (This applies to the last date of modification and not to the date of publication ) Set (optional) Specify the set of records to retrieve (Thematic sets, type of document, full text available…) metadataPrefix (mandatory) Specify in which format (XML DTD) the record must be returned One example: metadataPrefix=oai_dc

Minimal OAI compliant metadata consists of the unqualified 15 fields Dublin Core metadata : TITLE CREATOR SUBJECT DESCRIPTION PUBLISHER CONTRIBUTOR DATE TYPE FORMAT IDENTIFIER SOURCE LANGUAGE RELATION COVERAGE RIGHTS

Avano, a thematic OAI-PMH harvester implementation example

General informations Avano was launched in September It is available at : A part of the system is based on the University of Illinois Open Archives Initiative Metadata Harvesting Project The publication web site and the filtering system are Ifremer In- House developments It handles marine resources but also freshwater resources (rivers, lakes, ground waters, drinking water treatment,...) Avano harvests Open Archives, Institutional repositories and a few commercial publishers (E.g. : HighWire) When possible, if a subset is available, we only harvest records with Full-Text Repositories are not loaded if there is no full-text subset and if the repository contains mainly records with no full-text. Repositories are not loaded if they offer records with link to digital objects stored outside the repository server

Harvesting marine repositories The full content of these 9 marine repositories is automatically loaded into Avano ( records) 9 marine repositories harvested : ePic, Alfred Wegener Institute : 2679 records Aquatic Commons, Iamslic : 269 records ArchiMer, Ifremer : 2241 records DRS, National Institute Of Oceanography of India : 637 records IBSS, Institute of Biology of the Southern Seas : 181 records Marine & Ocean Science Plymouth : 1974 records OceanDocs, Africa and Latin America marine pub. : 1568 records Plankton*Net (AWI and Roscoff marine station) : 7686 images WHOAS (Woods Hole) : 1660 records OAI-PMH

146 non-marine repositories Temporary table records … fishery fishes fishing% … Ocean Dynamics Ocean Engineering Ocean Modelling Ocean Navigator Ocean Research … abietinaria inconstans abietinaria kincaidi abietinaria labrata abietinaria pacifica … Manual checking ( records removed manually) Aquatic and marine terms or expression Filters Journal titles Aquatic species scientific names … Avano (88000 records) OAI-PMH Harvesting non-marine repositories

Harvest non-marine repositories Records that contain aquatic journal title, aquatic expressions or scientific names of aquatic species are automatically loaded into Avano. Avano is then already using: An aquatic journal title list from ASFA A list of scientific names of fishes from FishBase A list of scientific names of aquatic species from the FAO Several lists of scientific names of aquatic species from the NODC But if you have lists of scientific names for aquatic algae, fungi, plants, mollusks, gastropods, insects, birds, mammals, if they contain only aquatic species, Please contact me!

Keyword filtering method deficits It’s a time consuming method We may validate records (1 or 2%?) that don’t match any Avano subject We may also miss a few records from non-marine repositories (1 or 2%?) especially when : The records are poor (no abstract) The record is only available in local language But this is the only way we found to get the 80% of Avano records that come from general repositories

Avano now contains more than records from 156 Open Archives and 4 commercial editors

Publication year of documents available from Avano

The number of connections to Avano is increasing Number of connections

An international public

Demonstrations Filtering module Public web site:

One year of harvester management review W hat could be improved in OAI-PMH protocol and in repositories implementation?

OAI-PMH, what could be improved? Repository stabilities Many repositories (10-20%?) are difficult to harvest because of bad reliability: Un-documented errors occurred during harvesting HTTP time out errors during harvesting OAI-PMH protocol not completely supported (some repositories can only be harvested via the GetRecords method, some others via the ListIdentifier method, some do not return the same number of records via the GetRecords method and via the ListIdentifier method) OAI-PMH server URL changed without notification …

OAI-PMH, what could be improved? XML encoding, UTF8 errors Many repositories deliver incorrect XML stream or records that contain UTF8 errors (encoding character errors). This is a problem for some harvesters (E.g. : Avano) if they are using XML parsers that cannot bypass these XML encoding or UTF-8 errors. Records with UTF-8 errors are not loaded in Avano Repositories with XML encoding errors cannot be harvested via the GetRecords method by Avano (which is a problem when the ListIndentifier method doesn’t work either) …

OAI-PMH, what could be improved? Big or slow repository harvesting Big or slow repositories can take several days to be harvested This is a problem for unreliable repositories. If one error occurs, the harvesting must be restarted from the beginning (no way to start from where the harvesting stopped) For some of these repositories, an intermediary solution would consist in dividing the harvesting by range of date but it cannot be applied all the time

OAI-PMH, what could be improved? Duplicated records This can happen if, for example, a publication is written in collaboration with several institutions. If so, this publication may be archived on each institution server. The international deposit rate is so low, especially for life sciences, that it is not really a problem nowadays. Some national projects are also aggregating a selection of IR and re-exposing the records in OAI-PMH. For example, HAL is a French national Open Archive. Some French scientific organizations are using this platform to build their IR (IN2P3, INSERM…). All the records loaded in these IR are exposed twice (via the national platform and via the IR). If harvesters manager did not heard about these specific national projects, then can load these duplicated IR (e.g. all IN2P3, INSERM… records are duplicated in Oaister)

OAI-PMH, what could be improved? Deleted records Many repositories don’t support a mechanism (transient or persistent) that indicates to the harvesters that a record has been deleted Harvesters then have to re-harvest completely (instead of using incremental harvests) the repositories to detect deleted records (which is a major problem for big, slow or not reliable repositories that need several days to be reharvested)

OAI-PMH, what could be improved? Type Field of the records available in Avano have no type field A few (>500) have a type field which is impossible to normalize A1 Airticle 8 Treball Final de Carrera …. All these records will be removed from results if the end-user limits his query to a set type

OAI-PMH, what could be improved? Publication Date Field of the records available in Avano have no publication date A few (>500) have bad-formatted date: Montréal, 2000 [196-?] …. All these records will be removed from results list if the end- user limits his query to a range date All these records will be displayed at the end of the hitlist if the enduser selects to sort the hitlist by date.

OAI-PMH, what could be improved? Poor records Some repositories contain poor records (no abstract, no keyword, no author…). Some others contain records only available in national languages. These records will have a bad visibility in harvester search engine because harvester only indexes the bibliographic data and often displays their result-list sorted by rank.

OAI-PMH, what could be improved? Aggregating documentation and dataset records This could be a problem for harvester if dataset records do not have the same granularity as the documentation records. E.g. : Pangaea is a publishing network for geological and environmental data. It contains thousands of records that are almost identical (only a few geographical references can be different in these records)

E.g. : Pangaea contains 1389 almost identical records that contain the “color reflectance“ expression. If an end-user wants to find the few documentation records that also contain this expression he will have no chance to find them in this list of results:

OAI-PMH, what could be improved? Records without free access to the digital object : maybe the main problem ! Many Open Archive and IR now contain records without fulltext, records with pay per view fulltext (E.g. : BePress/ProQuest) or records with restricted access to the full-text. It should not be a problem if harvesters had the possibility to offer information to their end-users about the access to the full-text (and offer, as an option, the possibility to filter them). But this is not the case! We still have to convince scientists and end- users that Open Access is useful and/or necessary. Immediate and free access to the full text is maybe the main argument to convince them. It is my opinion that hiding records with free full text among records with inaccessible full text is not helpful.

OAI-PMH, what could be improved? Thematic harvesting Thematic harvesting is supposed to be available via the Set method In practice, no repository offers Set that matches exactly with the range of Avano The OAI-PMH protocol does not allow the harvest of records that belong to several sets. As an example it would not have been possible to harvest “Full-Text” set and “Marine and aquatic” set at the same time. This limitation led to the development of the key-word spotting system to filter marine and aquatic records in general repositories

Conclusion (1/2) What do harvesters need to be able to find their place between Google and commercial bibliographic databases? An higher Open Access deposit rate (less than 3% in marine/aquatic sciences?) and/or more commercial publishers to expose their records in OAI-PMH in order to cover the main part of the international scientific production A new version of OAI-PMH that would offer a more reliable way to harvest OA and more qualified mandatory information (date and type field, information about access to the full text…), so that harvesters will be able to offer more powerfull and reliable search options

Conclusion (2/2) Please, test and comment Avano. Do not hesitate to suggest modifications! check if your repository is already harvested by Avano and, if no, please register! contact me if you have lists of scientific names for aquatic algae, fungi, plants, mollusks, gastropods, insects, birds, mammals, if they contain only aquatic species!