Presentation is loading. Please wait.

Presentation is loading. Please wait.

Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.

Similar presentations


Presentation on theme: "Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer."— Presentation transcript:

1 Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer 2, Michael Diepenbroek 1 1 PANGAEA ® Group at MARUM, University of Bremen, Bremen, Germany 2 Alfred Wegener Institute for Polar and Marine Research, Bremerhaven, Germany

2 Uwe SchindlerGES 2007 – May 2-4, 2007 Metadata Portals & Grid WDC-MARE with its information system PANGAEA ® currently provides data portals for several EU/international projects: Not all data are stored centralized, so all datasets provided in portals must be consolidated from different sources! Features: –Data stays at the data providers –Metadata is harvested by the portal –Search queries are handled by the centralized catalogue (Google-like search speed!) –Scientist gets link to data at the provider Metadata portal software is sufficient for C3-Grid, too!

3 Uwe SchindlerGES 2007 – May 2-4, 2007 Metadata in C3-Grid Goal: build up an infrastructure for earth system community in Germany Problem: we need an architecture which makes it possible to: –Collect metadata files from data providers –Store them in a “central index” –Provide a fast, generic access to this data for our users Solution Data Information Service

4 Uwe SchindlerGES 2007 – May 2-4, 2007 Open Archives Protocol The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed by the Open Archives Initiative. Almost all digital libraries support it (most famous ones: Fedora, arXiv and the CERN Document Server) Portals by Scientific Commons, OAIster, SUB uses it during web crawling (if available) Very simple to implement (XML over HTTP-REST) Repository software for databases or file system metadata providers is widely available (C3 uses mostly DLESE jOAI software on the data provider side)

5 Uwe SchindlerGES 2007 – May 2-4, 2007 Metadata in C3-Grid Goal: build up an infrastructure for earth system community in Germany Problem: we need an architecture which makes it possible to: –Collect metadata files from data providers –Store them in a “central index” –Provide a fast, generic access to this data for our users Solution Data Information Service

6 Uwe SchindlerGES 2007 – May 2-4, 2007 Central indexing requirements 1.Open for any XML metadata format 2.Any mappings to document fields should be done by XPath 3.Possibility to map incompatible XML schemas during harvesting by XSLT on-the-fly 4.On-the-fly validation of (transformed) documents during harvesting 5.No relational database, only a full text search engine, that contains everything needed for operation 6.Range queries on specific fields (date/time or numeric) 7.Web service interface / programming API for the end user interface that is accessible from any language (Java/JSP, PHP, Perl,...)

7 Uwe SchindlerGES 2007 – May 2-4, 2007 features Ranked searching - best results returned first Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries for date time values and numbers Fielded searching. All fields are searchable as a whole, each field separately (e.g. for author, parameter), or mixed. Any combination of boolean operators between search terms (AND, OR, NOT, exact phrase) Sorting by any field Multiple-index searching with merged results Simultaneous searching and updates due to high- performance indexing

8 Uwe SchindlerGES 2007 – May 2-4, 2007 Generic Framework

9 Uwe SchindlerGES 2007 – May 2-4, 2007 Metadata in C3-Grid Goal: build up an infrastructure for earth system community in Germany Problem: we need an architecture which makes it possible to: –Collect metadata files from data providers –Store them in a “central index” –Provide a fast, generic access to this data for our users Solution Data Information Service

10 Uwe SchindlerGES 2007 – May 2-4, 2007 Search Interface Supports all standard Lucene search features Additional support for fast range queries to enable bounding boxes, etc.: –implemented by redundant storage of “numerical terms” in different precisions –recursive reduction of distinct terms (every numerical value is a term) on range query –search time no longer dependent on index size Accessible via Java API or AXIS web service

11 Uwe SchindlerGES 2007 – May 2-4, 2007 Metadata in C3-Grid Goal: build up an infrastructure for earth system community in Germany Problem: we need an architecture which makes it possible to: –Collect metadata files from data providers –Store them in a “central index” –Provide a fast, generic access to this data for our users Solution Data Information Service

12 Uwe SchindlerGES 2007 – May 2-4, 2007 C3 Implementation Fig. by T. Langhammer, ZIB web service frontend Portal CERAPANGAEA ® Other Data Provider Google-style and range queries DIS Metadata1.xml, Metadata2.xml, Metadata3.xml, Metadata4.xml,... FieldTermDocument identifierABC:1232 identifierXYZ:2236 identifierMI6:00712 abstractregion2,23,112 abstractpressure3,23 abstracthumid4,33,215 min_lat030.431 min_lat-023.232 data_urihttp://...4 Apache Lucene index document cache indexing of selected fields OAI-PMH full-text index harvesting backend

13 Uwe SchindlerGES 2007 – May 2-4, 2007 Future metadata of data metadata of workflow s workflow query data query assembl e workflow processin g

14 Uwe SchindlerGES 2007 – May 2-4, 2007 Thank You! Software will be available soon as open source on Sourceforge.net! News: http://wiki.pangaea.de/wiki/Portalhttp://wiki.pangaea.de/wiki/Portal


Download ppt "Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer."

Similar presentations


Ads by Google