Presentation on theme: "Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database."— Presentation transcript:
Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database Interoperability November 2004
TDWG and GBIF TDWG – Taxonomic Databases Working Group Not-for-profit scientific and educational association Affiliated to the International Union of Biological Sciences Mission To provide an international forum for biological data projects To develop and promote the use of standards To facilitate data exchange Products Standards/guidelines for recording/exchanging data about organisms Promotion of use of these standards Forum for discussion (especially annual meeting) GBIF – Global Biodiversity Information Facility Megascience activity involving 42 countries/economies and 28 international organisations Secretariat based in Copenhagen, Denmark Mission Free and universal access to world’s biodiversity data via Internet Sharing primary biodiversity data for society, science and a sustainable future Products Registry of biodiversity data resources Index of biodiversity data Software tools Web portals (http://www.gbif.net) and data serviceshttp://www.gbif.net
Vernacular (FR): Pyrale du maïs Vernacular (ES): Piral del maíz Vernacular (DE): Maiszünsler Diagnosis: Wingspan 26-30mm; sexually dimorphic;male: forewings ochreous to dark brown; female: forewings pale yellow; … Foodplant: Zea mais L Primary biodiversity data Species: Ostrinia nubilalis (Hübner, 1796) Family: Pyralidae Order: Lepidoptera Class: Insecta Genus: Ostrinia Hübner, 1825 Vernacular (EN): European Corn-borer Family: Gramineae Taxonomic Names Collection:DGH Lepidoptera Record id:DGHEUR_ Country:France Coordinates:03.047˚E ˚N Date:28 June 2003 Collector:Donald Hobern Specimens and Observations Ecological Interactions Locus:AAL35331 Definition:acyl-CoA Z/E11 desaturase 1 mvpyattadg hpekdecfed... Sequence Data Average Rainfall Location: 48.82°N 2.29°E Jan Feb Mar Apr Abiotic Data Taxonomic Descriptions Pheromones of Ostrinia /pheronet/phlist/ostrinia.html Digital Literature and Web Resources Synonym: Pyralis nubilalis Hübner, 1796
Standardised structured data DGH DGH Lepidoptera DGHEUR_ Dichomeris marginella (Fabricius, 1781) O Animalia Lepidoptera Gelechiidae Dichomeris marginella (Fabricius, 1781) Donald Hobern Europe Denmark Københavns Amt Merianvej, Hellerup in Skinner trap June 2003 S M T W T F S Observation record formatted using the Darwin Core
TDWG Data Standards Darwin Core Simple XML data model to represent taxon occurrence records (only core attributes) Extensions to handle e.g. curation details, microbial data, image data ABCD Schema – Access to Biological Collection Data More complex XML data model to represent collection or observation data Detailed document structure including features for different communities DiGIR – Distributed Generic Information Retrieval XML protocol for searching remote data resources Suitable for use with a wide range of different data models BioCASe Protocol XML protocol for searching remote data resources with more complex schema (e.g. ABCD) Derived from DiGIR – new unified DiGIR/BioCASe protocol being developed Taxon Concept Schema XML data model currently under development for exchange of nomenclatural/taxonomic data First version to be used for implementation in 2005 SDD Schema – Structured Descriptive Data XML data model for descriptive data relating to taxa or specimens (highly generalised) Suitable for representation of character tables, diagnostic keys, etc.
BioCASe-ABCD 2.3 (#46, Jul , 18:54:32) [MSC v bit (Intel)] T22:22:40+02: search BGBM Bridel Herbar Botanic Garden and Botanical Museum Berlin-Dahlem Andrea Hahn +49 (0) The use of the data is allowed only for non-profit scientific use and for non-profit nature conservation purpose. Botanic Garden and Botanical Museum Berlin-Dahlem BGBM No part of this data base may be copied or reproduced without written permission from the legal owner. The Intellectual Property Rights are held by the legal owner or, in case of living persons, by the collector or determinator. No responsibility is accepted for the accuracy of the information in this data base. Bridel Pottiaceae Plantae Leucophanes octoblepharioides Brid Leucophanes octoblepharioides Brid Allen, Noris Salazar Asia NP Nepal OK SPECIMEN COLLECTION (INCLUDING METADATA) PROTOCOL Note that structure of record elements is part of the content schema (ABCD), not part of the protocol
DiGIR-Darwin Core $Revision: 1.14 $ T13:48: search Botanische Staatssammlung München Infocomp P1 Wedelia longifolia Mart. ex Baker Label Baker, J.G. Holotypus s.n. Martius, C.F.P. von South America Brazil '... in prov. S. Paulo, inter herbas locis irriguis ad Lorena... ' (op. cit.) Tribe: Heliantheae, Reference protologue: Martius, C.F.P. von: Flora Brasiliensis 6(3): , ,5,2 1 true SPECIMEN PROTOCOL Note that structure of record elements is part of the protocol – the content schema (Darwin Core) only defines the attributes describing the record
BioCASe-ABCD compared to DiGIR-Darwin Core BioCASe-ABCD model Document-based (response document includes metadata and records as a structured package) Strengths No problem with modelling complex nested structures and repeating elements Fits perfectly with UBIF proposal – ABCD DataSet elements and ABCD Metadata could readily be standardised with the DataSet/Metadata structures from other TDWG standards such as Structured Descriptive Data (SDD) and Taxon Concept Schema (TCS) – with rather little work. DataSets from all three of these could be combined to form a single document with cross-references between sections. Possible weaknesses Not simple for specialist networks to extend the structure with additional elements of their own (requires well-planned open extension points to be designed into the schema), especially if a provider wishes simultaneously to be part of more than one such specialist network. (At present) all elements in the ABCD schema are versioned together. Handling an updated version of the schema requires significant additional effort on the part of providers and users. For example, adding new elements to support plant genetic resource data – without changing the elements for museum/herbarium specimens – requires all users to handle a new version of the schema. DiGIR-Darwin Core model Record-based (response returns a set of records which may contain descriptor elements from any schema) Strengths Massively flexible and extensible model allowing different networks to use a common protocol and shared core elements alongside their own network- specific extensions. (In integrated protocol version) could return ABCD elements as part of response records. If ABCD is treated as a library of elements, this fits even better. Model maps well to supporting a flexible object-oriented data model for biodiversity informatics. Possible weaknesses (In existing version) cannot readily handle complex data structures with nested repeating elements. Records have no intrinsic data type – currently relies on an implicit understanding between user and data provider.
Exchange via web services Heterogenous DatabasesWeb Services … Standardised Structured Data UsersInternet … …
GBIF network of biodiversity data nodes Specimens: Flowering Plants of Africa Specimens: Proteaceae of the World Taxon Names: Proteaceae of the World Observations: Birds of Central America Observations: Butterflies of Belize Checklist: Birds of Belize Specimens: Mammals of North Europe Taxon Names: Mammals of the World Specimens: Bacteria Cultures Taxon Names: Bacteria Further Links: Bacteria Further Links: Mammals Museum A Museum C University D Observer Network B GBIF Network DiGIR-DarwinCore BioCASe-ABCD Taxon Concept Schema
Central GBIF registry of data nodes Data NodeType of dataTaxonRegionRecords Museum ASpecimen/ObservationFlowering PlantsAfrica Specimen/ObservationProteaceaeWorld23000 Taxonomic NamesProteaceaeWorld1500 Observer Network BSpecimen/ObservationBirdsCentral America68500 Specimen/ObservationButterfliesBelize4200 Name ListBirdsBelize587 Museum CSpecimen/ObservationMammalsNorth Europe1800 Taxonomic NamesMammalsWorld8000 General ResourcesMammalsWorld600 University DSpecimen/ObservationBacteriaWorld1200 Taxonomic NamesBacteriaWorld5000 General ResourcesBacteriaWorld400
DiGIR-BioCASe Protocol and Nested Networks User Get Darwin Core records where darwin:ScientificName equals Puma concolor from any provider. MaNIS Provider Darwin Core Curatorial Taxon Occurrence OBIS Provider Darwin Core Marine Taxon Occurrence IPGRI Banana Provider Darwin Core IPGRI Passport Banana Descriptor Taxon Occurrence IPGRI Soy Bean Provider Darwin Core IPGRI Passport Soy Bean Descriptor Taxon Occurrence BioCASe Provider Darwin Core ABCD Taxon Occurrence BioCASe Provider Darwin Core ABCD Taxon Occurrence Get standard plant genetic resource Passport data for all crop types. Get full set of Soy Bean crop descriptors. Get complete ABCD documents from each BioCASe provider Get DiGIR-style records each with a set of Darwin Core descriptors and a complete ABCD Unit
GBIF index to biodiversity data Catalogue of Life Biodiversity Data Access Biodiversity Data Index Taxonomic Name Service (ECAT) User requests GBIF Data Nodes Specimen Data Links to other data Specimen Data Name Lists Specimen Data Observation Data Specimen Data DiGIR/BiOCASe Taxon Concept
GBIF data index
Central portal to biodiversity data Show specimen records for Erinaceus europaeus GBIF Portal 6 records 35 records 17 records 0 records 58 records: 1.Museum AParis 2.Museum ANice 3.Museum AParis 4.Museum AAvignon 5.Museum AAvignon 6.Museum AMarseille 7.Observer BNorwich 8.Observer BNorwich 9.Observer BSouthampton...
GBIF Data Portal
Participant Nodes with tailored information Show specimen records for Erinaceus europaeus from France 58 GBIF records: 1. Museum AParis 2. Museum ANice 3. Museum AParis 4. Museum AAvignon 5. Museum AAvignon 6. Museum AMarseille 7. Observer BNorwich 8. Observer BNorwich 9. Observer BSouthampton Museum CToulouse GBIF Portal Geographic Services 26 records: 1. Museum AParis 2. Museum ANice 3. Museum AParis 4. Museum AAvignon 5. Museum AAvignon 6. Museum AMarseille 23. Observer BCalais 29. Observer BParis Museum CToulouse GBIF France Show occurrence of Hérisson d’Europe
Flexible applications Provide key to identify reportable Curculionidae GBIF WANTED List of names of reportable pest species Descriptive data 1.Elytra brown2 Elytra not brown5 2.Thorax black Thorax brown3 3.Hind tibia blackNon-pest Hind tibia brown4 4.Hind femur brown Hind femur blackNon-pest... A customs official discovers specimens of a possible pest species of weevil (Curculionidae) on a consignment of agricultural produce at a port of entry. The GBIF Network generates an identification key to support identification of pest weevil species to allow the official to determine appropriate response. This application requires access to data from a wide range of sources, including those GBIF participants that are organisations.
Monitoring of data usage Show specimen records for Upupa epops GBIF Portal Data Usage Reports 81 records: 1.Museum AParis 2.Museum ANice 3.Museum AParis... Show bird specimen records from Nice 126 records: 1.Museum AUpupa epops 2.Museum AApus apus 3.Museum AAthene noctua... Data Usage Logs GBIF Usage: Museum A 16 August 2003 Search: Upupa epops 5 records returned 18 Augúst 2003 Search: Birds from Nice 16 records returned GBIF Usage: Observer B 16 August 2003 Search: Upupa epops 2 records returned
Future activity Globally unique identifiers TDWG-GBIF collaboration to develop models to allow data providers to attach persistent identifiers to their data records Allow software to detect multiple instances of the same record Allow users to save resolvable references to specimens, collections, taxon concepts, etc. Schema repository Central library of information on data models Resource for discovering documentation or mappings between different schemas Better support for intelligent software applications Data validation tools Framework for running sets of validation tests against XML data (content values, controlled vocabularies, relating georeference data to named localities, etc.) Support different uses (data providers to locate possible problems in data; users to assure themselves of suitability of data; GBIF to provide metadata on data completeness/coherence) Access to a wide range of taxonomic name data Taxonomic/nomenclatural authorities (nomenclators, global species databases, revisions, etc.) Lists used by different communities/organisations (red lists, pest species, regional checklists, etc.) Customised portals Organised according to taxon lists used by each user Notifications of new data based on user profiles (taxonomic, geographic, etc.)
Links Taxonomic Databases Working Group Including access to working groups Global Biodiversity Information Facility Communications Portal Data Portal Architecture documents