Presentation is loading. Please wait.

Presentation is loading. Please wait.

The GBIF Information Architecture Technological integration at the global level Donald Hobern GBIF Program Officer for Data Access and Database Interoperability.

Similar presentations


Presentation on theme: "The GBIF Information Architecture Technological integration at the global level Donald Hobern GBIF Program Officer for Data Access and Database Interoperability."— Presentation transcript:

1 The GBIF Information Architecture Technological integration at the global level Donald Hobern GBIF Program Officer for Data Access and Database Interoperability March 2005

2 GBIF and TDWG GBIF – Global Biodiversity Information Facility Megascience activity involving 43 countries/economies and 28 international organisations Secretariat based in Copenhagen, Denmark Mission Free and universal access to world’s biodiversity data via Internet Sharing primary biodiversity data for society, science and a sustainable future Products Registry of biodiversity data resources Index of biodiversity data Software tools Web portals (http://www.gbif.net) and data serviceshttp://www.gbif.net TDWG – Taxonomic Databases Working Group Not-for-profit scientific and educational association Affiliated to the International Union of Biological Sciences Mission To provide an international forum for biological data projects To develop and promote the use of standards To facilitate data exchange Products Standards/guidelines for recording/exchanging data about organisms Promotion of use of these standards Forum for discussion (especially annual meeting)

3 Vernacular (FR): Pyrale du maïs Vernacular (ES): Piral del maíz Vernacular (DE): Maiszünsler Diagnosis: Wingspan 26-30mm; sexually dimorphic;male: forewings ochreous to dark brown; female: forewings pale yellow; … Foodplant: Zea mais L. 1753 Biodiversity data Species: Ostrinia nubilalis (Hübner, 1796) Family: Pyralidae Order: Lepidoptera Class: Insecta Genus: Ostrinia Hübner, 1825 Vernacular (EN): European Corn-borer Family: Gramineae Taxonomic Names Collection:DGH Lepidoptera Record id:DGHEUR_003217 Country:France Coordinates:03.047˚E 48.730˚N Date:28 June 2003 Collector:Donald Hobern Specimens and Observations Ecological Interactions Locus:AAL35331 Definition:acyl-CoA Z/E11 desaturase 1 mvpyattadg hpekdecfed... Sequence Data Average Rainfall Location: 48.82°N 2.29°E Jan Feb Mar Apr... 182.3 120.6 158.1 204.9... Abiotic Data Taxonomic Descriptions Pheromones of Ostrinia http://www.nysaes.cornell.edu/fst/faculty/acree /pheronet/phlist/ostrinia.html Digital Literature and Web Resources Synonym: Pyralis nubilalis Hübner, 1796

4 Resources vary in interoperability and scope for dynamic integration Images, audio, etc. Require text (metadata) for interpretation Very hard for software to interpret automatically (Normally) link to resource as a unit (e.g. via URL in web page) Unstructured text documents (including e.g. PDF, Word, HTML) Can be indexed for full-text search (Google) Hard for software to interpret automatically Normally link to document as a unit (e.g. URL in web page) Structured XML documents Can be indexed for full-text search Can be parsed and interpreted by software tools (for known schemas) Can link to document as a unit or to fragments of a document Can use structure to develop highly-specific searches Normally process whole documents (not random-access) Databases Can be exposed as structured XML documents Can process any subset of the data (random-access selection of specific records) Data resources

5 2003-06-08 DGH DGH Lepidoptera DGHEUR_0002976 Dichomeris marginella (Fabricius, 1781) O Animalia Lepidoptera Gelechiidae Dichomeris marginella (Fabricius, 1781) Donald Hobern 2003 06 08 Europe Denmark Københavns Amt Merianvej, Hellerup 12.538 55.737 100 1 1 in Skinner trap June 2003 S M T W T F S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Observation record formatted using the Darwin Core Structured XML data

6 Some TDWG Data Standards Darwin Core Simple XML data model to represent taxon occurrence records (only core attributes) Extensions to handle e.g. curation details, microbial data, image data ABCD Schema – Access to Biological Collection Data More complex XML data model to represent collection or observation data Detailed document structure including features for different communities DiGIR – Distributed Generic Information Retrieval XML protocol for searching remote data resources Suitable for use with a wide range of different data models BioCASe Protocol XML protocol for searching remote data resources with more complex schema (e.g. ABCD) Derived from DiGIR – new unified DiGIR/BioCASe protocol being developed (”TAPIR”) Taxon Concept Schema XML data model currently under development for exchange of nomenclatural/taxonomic data First version to be used for implementation in 2005 SDD Schema – Structured Descriptive Data XML data model for descriptive data relating to taxa or specimens (highly generalised) Suitable for representation of character tables, diagnostic keys, etc.

7 DiGIR Protocol <request xmlns="http://digir.net/schema/protocol/2003/1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:darwin="http://digir.net/schema/conceptual/darwin/2003/1.0"> 1.0.0 2004-07-20T09:56:00-0500 127.0.0.1 http://any.net/digir/DiGIR.php search Thomas, J. Rubus true $Revision: 1.96 $ 2004-07-20T09:56:03-0500 http://any.net/digir/DiGIR.php 127.0.0.1 search INS COL 27694 Rubus brasiliensis -23.6625 -46.7738 INS COL 27907 Rubus brasiliensis -23.412 -46.7899 INS COL 27908 Rubus urticaefolius -23.412 -46.7899 3600 191,1,1 26 3 false RequestResponse

8 DiGIR-BioCASe Protocol and Nested Networks User Get Darwin Core records where darwin:ScientificName equals Puma concolor from any provider. MaNIS Provider Darwin Core Curatorial Taxon Occurrence OBIS Provider Darwin Core Marine Taxon Occurrence IPGRI Banana Provider Darwin Core IPGRI Passport Banana Descriptor Taxon Occurrence IPGRI Soy Bean Provider Darwin Core IPGRI Passport Soy Bean Descriptor Taxon Occurrence BioCASe Provider Darwin Core ABCD Taxon Occurrence BioCASe Provider Darwin Core ABCD Taxon Occurrence Get standard plant genetic resource Passport data for all crop types. Get full set of Soy Bean crop descriptors. Get complete ABCD documents from each BioCASe provider Get DiGIR-style records each with a set of Darwin Core descriptors and a complete ABCD Unit

9 Data navigation Specimen Catalogue Number Specimen Catalogue Number Specimen Catalogue Number Taxon Concept Name Taxon Concept Name Identified as Taxon Concept Name Publication Title, Year Publication Title, Year Publication Title, Year Synonymy Relationship Author Name Author Name Collector Name Collector Name Sequence Data Barcode Data Barcode Data Barcode Data Type material Identified as PDF XML Referenced material Character Data Character Data Character Data Type material

10 Integration strategies Different approaches to integrating distributed information: Data Warehouse oBring all data together oData owners can still maintain control over their data oSingle agreed central data model oAll data in one place  any complex query can be constructed Federated Network oData remain with original owners oData owners can manage data as best suits them oAgreed data models for exchange oQueries can become expensive as number of resources increases Indexed Network oData remain with original owners oData owners can manage data as best suits them oAgreed data models for exchange oMost frequently-issued queries can be handled in index Selection may be influenced by community culture Network may include warehouses/indices for specific areas Key driver should be user requirements (use cases)

11 Central services GBIF plans to offer the following central services: Service Registry – discover data resources Data Index and Aggregated Data Services – find distributed data Schema Repository – interpret structured data Feedback Mechanisms – add value by connecting users to providers Data Validation Services – report potential inconsistencies within data Usage Reporting Mechanisms – recognise providers for sharing data Globally Unique Identifier Services – provide persistent references to data records

12 Networked data resources Specimens: Flowering Plants of Africa Specimens: Proteaceae of the World Taxon Names: Proteaceae of the World Observations: Birds of Central America Observations: Butterflies of Belize Checklist: Birds of Belize Specimens: Mammals of North Europe Taxon Names: Mammals of the World Specimens: Bacteria Cultures Taxon Names: Bacteria Further Links: Bacteria Further Links: Mammals Museum A Museum C University D Observer Network B GBIF Network

13 Service registry Data NodeType of dataTaxonRegionRecords Museum ASpecimen/ObservationFlowering PlantsAfrica327000 Specimen/ObservationProteaceaeWorld23000 Taxonomic NamesProteaceaeWorld1500 Observer Network BSpecimen/ObservationBirdsCentral America68500 Specimen/ObservationButterfliesBelize4200 Name ListBirdsBelize587 Museum CSpecimen/ObservationMammalsNorth Europe1800 Taxonomic NamesMammalsWorld8000 General ResourcesMammalsWorld600 University DSpecimen/ObservationBacteriaWorld1200 Taxonomic NamesBacteriaWorld5000 General ResourcesBacteriaWorld400

14 Data index Catalogue of Life Biodiversity Data Access Biodiversity Data Index Taxonomic Name Service (ECAT) User requests GBIF Data Nodes Specimen Data Links to other data Specimen Data Name Lists Specimen Data Observation Data Specimen Data

15 Data index

16 Distributed data access GBIF Service Registry GBIF Global PortalRegional PortalThematic Portal Taxonomy / Nomenclature PDF HTML web page Struc- tured XML Image GBIF Data IndexEntrez Software Applications Discovery Indexing Presentation & Analysis Resources Specimens Descriptive Data Sequences / Barcodes OBIS Network- specific Registries StructuredUnstructured DatabasesDocuments Access XML TCS XML ABCD DwC XML SDD XML TaXMLit URL CGIAR

17 Handling of unstructured data – a possible approach Develop set of keywords to classify web pages –Need standard classification of biodiversity information (probably with varying levels of granularity) –Register pages under { taxonName, url, keywords, audience, language,... } –Sufficient immediately to construct tables of links as part of taxon-specific web pages Develop corresponding XML tags to mark up sections of web text (,, etc.) –See TaXMLit from Biologia Centrali-Americana project for related approach –Possible to extract sections for re-use elsewhere –Example: a web site requires a description of common garden birds Find suitable text by audience and language for each taxon? Maximise benefits from effort expended in developing content –All information developed in any format can be indexed and discovered –All resources can contribute to the global data pool –Can develop plug-and-play taxonomic/regional/thematic portals based on the entire network of registered data elements, structured and unstructured

18 Data validation Administrator Specimen Researcher Data Input Database Web Server Internet Indexing Service Validation Service Validation service: Extensible framework for checking XML data Basic checks for XML validity (e.g. UTF-8) Validation against schema Checks that element content has expected structure or uses expected values Geospatial checks – coordinates in expected countries or on land / in sea Taxonomic checks – is name known? does it appear in referenced authority list? is higher taxonomy consistent?

19 Schema repository – version history Schema A 1.0 1.1 Schema B 1.01.11.2  Structured store of data models for each schema  Modelling of version history for each schema  Single location from which to generate new machine-readable formats  Should such a tool also provide an environment in which to develop new data models?.xsd.dtd.xsd.???.dtd Elements Datatypes Cardinality Enumerated values Annotations

20 Schema repository – documentation  Automated generation of standardised HTML, PDF and other human readable documentation for each standard  Can easily generate documentation e.g. of inter-version differences Schema A 1.0 1.1 Schema B 1.01.11.2

21 Schema repository – conceptual mappings Schema A 1.0 1.1 Schema B 1.01.11.2  Identify relationships between elements in different versions of the same schema or different schemas (need interfaces to allow these linkages to be defined)  Generate software products to automate transformation between different versions and schemas  How should we handle different content models for related concepts?  Should mappings simply be between different imported schemas, or should they all be made against some central set of concepts?.xslt.pl ?.java ?

22 Aggregator DiGIR provider Museum B DiGIR provider Museum A DiGIR provider Problems to solve – data served from multiple locations  Different networks of providers will serve up the same records through different routes, and software tools will not be able to detect all duplicate records  Aggregator nodes may in many cases themselves be the sole provider for other important records  The same situation will also occur e.g. if a taxonomist serves data for all specimens from many institutions for a single genus, including some that may also have been digitised by the institutions themselves  No reliable mechanism to relate Agg:MusA/CollX to MusA:CollX or Agg:MusA/CollX:1001 to MusA:CollX:1001 InstCollCatNo MusACollX1001 MusACollX1002 InstCollCatNo MusBCollY202-1 MusBCollY465-2 InstCollCatNo AggMusA/CollX1001 AggMusA/CollX1002 AggMiscOther1000 AggMiscOther1001 Global Provider Registry KeyService Name 1Museum A DiGIR 2Museum B DiGIR 3Aggregator DiGIR Global Data Index ServiceInstCollCatNo 1MusACollX1001 1MusACollX1002 2MusBCollY202-1 2MusBCollY465-2 3AggMusA/CollX1001 3AggMusA/CollX1002 3AggMiscOther1000 3AggMiscOther1001 Import by aggregator

23 Museum A DiGIR provider http://www.museumA.org/digir.php Problems to solve – referring to data from outside network  Data records have no stable URLs on the network (even using constructed DiGIR queries)  Provider endpoints may change, and specimens may move to different institutions  There needs to be a reliable way to reference each data record from other web sites and in external media  The solution needs to handle situations in which data providers move to new locations  It should ideally also be possible to refer to a particular version of each record in case of modifications InstCollCatNo MusACollX1001 MusACollX1002 Global Data Index http://www.gbif.net/data/digir.jsp ServiceInstCollCatNo 1MusACollX1001 1MusACollX1002 Revision of the genus Aenetus Material examined Museum A, Collection X 1001, 1002 Revision of the genus Aenetus Material examined Museum A, Collection X 1001, 1002 Aenetus virescens Distribution Auckland (see 1001) Wellington (see 1002) ? ?

24 Taxonomy Server B Concept: Xus bus sensu J. Owen, 2002 == Aus bus sensu P. Brown, 1949 Concept: Yus dus sensu J. Owen,2002 == Aus dus sensu P. Brown, 1949 Taxonomy Server A Concept: Aus bus sensu Smith, 2003 == Aus bus sensu Jones, 1965 == Aus cus bus sensu Black, 1926 > Aus bus sensu Brown, 1949 > Aus dus sensu Brown, 1949 Problems to solve – referring to taxon concepts  No standardised way to reference taxonomic publications or taxon concepts to allow reliable machine processing of the relationships between them (is ”P. Brown, 1949” the same as ”Brown, 1949”?)  Need mechanisms to allow data providers to refer to taxon concepts in a reliable and consistent way User What is the relationship between the concepts from Taxonomy Server A and Taxonomy Server B?

25 Universal reusable mechanism which can be applied to any type of biodiversity data record (specimen, collection, institution, taxon concept, taxonomic publication, image, etc.) Identifiers must refer uniquely to a single data element Users must be able to resolve each identifier to locate the data to which it relates Identifiers must be resolvable even if data changes ownership or the server is moved to a new endpoint Identifiers must be suitable for use by the wider life science community and others It should be as simple as possible for researchers –to create new identifiers –to discover existing identifiers associated with data items –to resolve identifiers to retrieve the associated data Globally unique identifiers – requirements

26 Potential model – Life Science Identifiers (LSIDs)  Specification for Uniform Resource Name model from EMBL, IBM, I3C and OMG  ”A straightforward approach to naming and identifying data resources stored in multiple, distributed data stores in a manner that overcomes the limitations of naming schemes in use today”  All LSIDs are formed from 5 or 6 colon-delimited components: urn, lsid, the domain name of the authority for the identifier, a namespace for the identifier, the identifier for the object within the namespace, and an optional revision number Format urn:lsid: : : [: ] Example: referencing a PubMed article urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434 Example: referencing first version of the 1AFT protein in the Protein Data Bank urn:lsid:pdb.org:pdb:1AFT:1 Potential example: referencing specimen record in GBIF Network (identifiers assigned centrally) urn:lsid:gbif.net:Specimen:2706712 Potential example: name record from IPNI urn:lsid:ipni.org:TaxonName:82090-3:1.1

27 Web-based taxonomy? GBIF Registry GBIF Global PortalThematic Portal Taxonomy / Nomenclature PDF HTML web page Struc- tured XML Image GBIF Data Index Discovery Indexing Presentation & Analysis Resources Specimens Descriptive Data Sequences / Barcodes Taxonomic Collaboration Environment StructuredUnstructured DatabasesDocuments Sequences / Barcodes Taxonomy / Nomenclature Specimens Characters Literature Network resources Group resources Network data Group data Taxonomist input Revision

28 Taxonomy / Nomenclature PDF HTML web page Struc- tured XML Image Specimens Descriptive Data Sequences / Barcodes What is Species Bank? GBIF Data Portal GBIF Data Index User Browser / Client Tool Map Service Taxon Portal HTML web pages Regional Portal Species Bank ?

29 Links Global Biodiversity Information Facility http://www.gbif.org/ Communications Portal http://www.gbif.net/ Data Portal http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architecture Architecture documents Taxonomic Databases Working Group http://www.tdwg.org/ Including access to working groups


Download ppt "The GBIF Information Architecture Technological integration at the global level Donald Hobern GBIF Program Officer for Data Access and Database Interoperability."

Similar presentations


Ads by Google