Presentation is loading. Please wait.

Presentation is loading. Please wait.

Richard White Biodiversity Informatics. Part One An introduction to biodiversity data.

Similar presentations


Presentation on theme: "Richard White Biodiversity Informatics. Part One An introduction to biodiversity data."— Presentation transcript:

1 Richard White Biodiversity Informatics

2 Part One An introduction to biodiversity data

3 Outline Biodiversity: what is it? – Definitions: is biodiversity: A resource? Something which can be measured? How to measure it – Who is it for? Data providers Researchers Users Biodiversity Informatics – Research into techniques for handling data

4 Threats to the planet Human activities Legal issues Economics ConservationManagement Exploitation Habitat conservation Species conservation Ecological diversity Species diversity Genetic diversity Ecology Evolution Genetics Molecular biology Genetic resources Information services

5 Biodiversity data types Kinds of biodiversity information: – Data about areas, habitats, etc. – Data about individual specimens – Data about species Biodiversity data dimensions – Species – Diverse information types Descriptive, geographical, chemical, genomic etc.

6 Data about areas, habitats, etc. Species lists, for –Conservation –Management –Legal obligations Ecological processes –Modelling ecosystems –Predicting impacts

7 Information about individual specimens Curatorial data about the management of each specimen Data describing characteristics of the specimen itself (which can also describe an entire species)

8 Curatorial information Collection event – Date and place of collection – Collector’s name Identifications (determinations) – Species name (see data about individual specimens and species) – Who identified it, date, etc. Management information – Location within the specimen collection (storage) – Treatments given to specimen, etc.

9 Data about specimens and species curatorial data nomenclatural data descriptive data geographical data, maps images bibliographic data

10 Data describing specimens and species (1) Genetic diversity –Allele and chromosome frequencies Molecular bioinformatics –Molecular data – enzyme properties, etc. –Molecular sequences – DNA, protein, polysaccharides, etc. “Traditional” data used in taxonomy etc. –See next slide

11 Data describing specimens and species (2) Nomenclature – accepted name, synonyms Taxonomy – higher taxa Geographical data – distribution (range) Chemical constituents (especially in plants) Behavioural information (animals) Descriptive data –Anatomical and morphological descriptors –Images Bibliographic data (source references, especially for species data)

12 Geographical data - storage Database may store: –Individual locations of specimens or sightings –Status in an area based on a number of specimens or sightings: (present, absent, introduced, etc.) Locations may be stored as –Area names (languages, synonyms, hierarchies, overlaps) –Grid coordinates (various systems)

13 Geographical data - use May be used to generate –summary distributions (e.g. for species distribution from specimen data) –Maps (point locations or shaded areas) May be used to allow searching by location or area – user may specify a point or an area name

14 Descriptive data Should be carefully designed, because it is complex and may be used for many purposes It should be –Structured –Consistently applied It may include data types suitable for statistical and multivariate analysis Special problems exist

15 Descriptive data Structured, for Querying Classification, phylogenetic analysis Identification Documentation and dissemination

16 Descriptive data Consistency and comparability: Consistent terminology (c.f. attempts to standardise terms for indexing purposes, as in BioCASE Thesaurus) Same characters for all specimens or taxa Characters precisely defined –Discontinuous - set of character states –Continuous – units, precision

17 Descriptive data – special problems Variability –specimens within a species –repeated structures within a specimen Character dependence (inapplicable characters) Taxonomic hierarchy issues, e.g. –Is the data for a species in agreement with the data for a genus? –Can the data be stored at the appropriate taxonomic level only?

18 Images Type –Bitmap files, e.g. JPEG –Vector graphics, e.g. drawings, diagrams Location –In the local database –Elsewhere in a separate image bank The Web makes the latter option easy – just store the URL in the database

19 Biodiversity organisations Database level: – ILDIS (International Legume Database and Information Service) - www.ildis.org Data portal level: – Species 2000 - www.sp2000.org – GBIF (Global Biodiversity Information Facility) - www.gbif.org International agencies: – CBD, CITES, WCMC, etc.... Standards, etc. – TDWG (Taxonomic Database Working Group) Lots more

20 Practical session In the practical session, we will – Look at what some of the various biodiversity organisations are doing – Try some of their data portals – Evaluate some of the biodiversity information systems available (introduced in Part Two of this talk, to follow), from the points of view of scientific and professional users and the general public

21 Part Two Biodiversity information systems (Some of this material appeared in the Computing for Bioinformatics module)

22 Thoughts Role of biodiversity data in bioinformatics – assisting with organising and retrieving bioinformatic (molecular) data – a separate area with different users (taxonomy, ecology, conservation, resource management …) Demand from users for taxonomic and species diversity information on the Web Pressure on the taxonomic community to deliver Demand for more sophisticated use of available data: interoperability = online analysis, not just browsing

23 Assembling biodiversity information sources Delivering species diversity information by assembling, merging & linking databases and publishing on the Web, with special emphasis on linking

24 Issues in assembling and linking biodiversity information sources Assembling a web-site (ERMS) Assembling databases by merging (ILDIS) Linking on-line databases through a gateway (Species 2000 and SPICE) Onward links to related information Checking the reliability of links (LITCHI) Intelligent linking Persistent identifiers

25 Assembling species databases First of all, before we start merging and linking databases, let’s assemble a database from scratch: ERMS (European Register of Marine Species) Now at www.marbef.org/data/erms.php

26 ERMS

27 Incoming data Approximately 100 separate lists for different taxonomic groups Mostly compiled as spreadsheets Scientific names, synonyms, geography (at least Atlantic or Mediterranean) Some optional fields Objective to create a book and a web-site, partially supported by a database

28 List conversion was carried out in several stages: Excel spreadsheets were exported to text files Tab-delimited text files were imported into a client- server database (MySQL) Database queries results are passed through templates to generate either RTF (for the printed publication) or HTML (for the Web site)

29 Variations on a theme Fields may be combined or separated e.g. genus species authority date Higher taxa may be: –repeated in fields of the species record –given once in separate preceding records in various different formats Synonyms may be: –in a separate field of the species record, or mixed with other remarks, with various delimiters and separators – in separate records, linked by code or by name or even abbreviated –implied, e.g. Genus1 specname (Smith as Genus2) Geographical information is often free text

30 ERMS book page

31 Osteichthyes: brief checklist

32 Reptilia: full details

33 Taxonomic hierarchy for Reptilia

34 Merging versus linking Merging databases to create a single larger database Linking databases to create a distributed information system

35 Merging species databases 1The original databases are physically copied into a new combined database. 2The user interacts with the new combined database.

36 Linking 1The user interacts with an access system which does not itself contain data. 2When the user requests data, it is fetched from the appropriate database.

37 Assembling databases by merging Now we have some databases, let’s build a bigger one by merging: ILDIS (International Legume Database and Information Service)

38 ILDIS International Legume Database and Information Service International collaborative project –10 Regional Centres –30 Taxonomic Coordinators Its goals include –building, maintaining and enhancing the ILDIS World Database of Legumes –designing and providing services from it to users, including: ILDIS LegumeWeb via Species 2000

39 ILDIS World Database of Legumes v. 7.00 Taxa Species 15,500 Subspecies 1,600 Varieties 2,400 19,500 Names Accepted names 19,500 Synonyms 19,000 39,500

40 ILDIS’s data model: core data A core taxonomic checklist, assembled from regional data sets and nearing completion, provides a consensus taxonomy - a unified taxonomic treatment or backbone on which other data can be hung Various kinds of additional data may be attached to this backbone (see later)

41 Features of ILDIS LegumeWeb We’ll look at examples of the use of LegumeWeb, to show a couple of features: Two-stage access with “synonymic indexing” A gateway to external information - “onward links” (direct species name links) to further sources of information

42

43

44 User access to LegumeWeb: Step 1 The user types in a name, which may be incomplete (or wrong!) LegumeWeb responds by showing a list of the species names which fit the user’s specification

45

46 User access to LegumeWeb: Step 2 The user chooses one of the species names provided (which may be synonym or an accepted name) –In this example, the user chooses Abrus cyaneus (a synonym for Abrus precatorius) LegumeWeb responds by showing a standard set of information about the chosen species

47

48 Synonymic indexing Automated synonymic indexing synonym entered  accepted name found (name  taxon) taxon found  synonyms listed Types of synonyms –Unambiguous –Ambiguous pro parte homonyms misapplied names In these cases an explanation is offered to the user

49 Assembling databases by linking Now we have some biggish databases, let’s build something even bigger by linking databases together: Species 2000 –SPICE –Species 2000 Europa

50 Linking 1The user interacts with an access system which does not itself contain data. 2When the user requests data, it is fetched from the appropriate database.

51 The Catalogue of Life (Species 2000) An international collaborative project to provide access to an authoritative and up-to-date checklist of all the world’s species A distributed array of Global Species Databases (GSDs) can be accessed through a Web gateway or Central Access System (CAS) The array of GSDs provide an index to a further range of information about each species, using onward links (see later) www.sp2000.org

52 Species 2000 organisation Taxonomic hierarchy (or hierarchies) Species Global species databases (GSDs) and interim checklists: the species index GSD interim checklists Species information sources (SISs): regional faunas and floras, specialist or sectoral databases, web pages etc. SIS

53 Architecture of Species 2000 User interface Data collector (CAS) Wrapper GSD Wrapper GSD Wrapper GSD

54 Species 2000’s Common Access System Species 2000 gives users a single point of access to GSDs Access involves a two-stage search process similar to that used in LegumeWeb In the second stage, the user sees a screen of “standard data” about a species

55 The “standard data” This comprises the information about a species which Species 2000 wishes to provide: – Accepted name (with references) – Synonyms (with references) – Common Names (with references) – Family or other higher taxon – Geography – Comment – Scrutiny information – URL or URLs linking to further data sources for this species

56 Need for communication Different people are building the various components of the system: –GSDs –wrappers –CAS –user interface We need to ensure they all have a common understanding of the data to avoid mistakes

57 Common Data Model We use a Common Data Model (CDM) –A definition of the information being passed to and fro –Human-readable, not machine-readable –Helps to manage complexity –Used to create specific machine-readable implementations for Corba (IDL), CGI/XML (DTD, XML Schema), Web Services, etc.

58 What does the CDM look like? It defines the input (“request”) and output (“response”) for six fundamental operations which the system needs to be able to carry out

59 Request Types 0-6 –Type 0: Get CDM version supported by a GSD’s wrapper –Type 3: Get information about a GSD –Type 1: Search for a name in a GSD –Type 2: Fetch “standard data” about a chosen species –Type 4: Move up the taxonomic hierarchy (towards the root of the tree) –Type 5: Move down the taxonomic hierarchy (towards the species level)

60 Spice CAS in use Screen-shots of an old version of the Spice system in use:

61 Spice 1 CAS

62

63

64 Onward links to related external data Species databases such as ILDIS and federated systems such as Species 2000 envisage providing links from their data to external sources of related data, so- called “onward links” Example from ILDIS...

65 “Onward links” The user may follow a hyperlink to some other data source for further information, not managed by ILDIS –In this example, the user chooses to go to W 3 Tropicos at Missouri Botanical Garden to see more information In this way LegumeWeb acts as a gateway to other information about legume species

66 LegumeWeb page with onward links

67 Destination of an onward link

68 Further information obtained

69 Checking the reliability of links Whether in –merging data sets to construct a species database like ILDIS, or in –linking from one data set to another, it is necessary to ensure that the species concepts in the different databases do not conflict

70 Example 1 Database A Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym] Database B Caragana sibirica Medikus [accepted name] Caragana arborescens Lam. [synonym]

71 Example 2 Database A Caesalpinia crista L. [accepted name] Database B Caesalpinia crista L. [accepted name] Caesalpinia bonduc (L.) Roxb. [accepted name] Caesalpinia crista L., p.p. [synonym]

72 LITCHI project We modelled the knowledge integrity rules in a taxonomic treatment The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon Practical uses include –helping a taxonomist to detect and resolve taxonomic conflicts when merging or linking two databases –helping a non-taxonomist user follow links from one database to another, in which the species may be differently classified

73 Conflict display

74 Outcome of LITCHI project A prototype tool for merging checklists & checking integrity of individual checklists was implemented In the Species 2000 Europa project, we are now creating a completely new second version with a view to allowing: –dynamic linking (so-called “taxonomically intelligent links”) –Presentation of “attached data” to be organised, merged and used to support conflict resolution

75 “Intelligent” linking The Catalogue of Life (Species 2000) is –not just a catalogue (which lists things) –it is an index (which points to things) GSDs, and gateways to them such as the Catalogue of Life, can serve not only as catalogues of species but also as indexes giving access, potentially, to all species information on the Internet

76 “Intelligent” linking Species 2000 plans to provide links to take a user –from a species entry (from a GSD) –to further sources of information about that particular species (Species Information Sources or SISs)

77 Species 2000 organisation Taxonomic hierarchy (or hierarchies) Species Global species databases (GSDs) and interim checklists: the species index GSD interim checklists Species information sources (SISs): regional faunas and floras, specialist or sectoral databases, web pages etc. SIS

78 “Intelligent” species links Given that it is possible to detect many cases of potential taxonomic conflict when linking species databases, how can such links be managed? There are a number of choices in the ways links may be made and handled

79 Cross-mapping So how can we make intelligent links work, especially in the difficult cases where a species in one database does not have an exact match in the other ? –One way is to create and maintain “cross- maps” which describe how one or more taxa in one resource (such as the Species 2000 index) relate to one or more taxa in another resource

80 A dream A system for managing intelligent species links would maximise the potential of the plethora of species-based catalogues, indexes and rich species resources currently being assembled all over the world Perhaps on the Web, as with the current Spice/Species 2000 prototypes Or...

81 The Grid The Grid is often thought of as a new toy for particle physicists, with –very high bandwidth –distributed computational resources But it also provides opportunities for more structured and reliable access to data and information sources, using improved protocols with metadata –For example, access to such knowledge sources as these cross-maps

82 Using biodiversity information resources Helping Biodiversity Researchers to do their Work Collaborative e-Science and Virtual Organisations

83 Biodiversity analysis and modelling Scientists working with biodiversity information employ a wide variety of resources: data sources statistical analysis and modelling tools presentation or visualisation software which may be available on various local and remote computer platforms.

84 Examples of biodiversity resources Data sources: Names: Species 2000 & ITIS Catalogue of Life Data: GBIF, sequence databases Geography: Gazetteers Collections and distributions: BioCASE, MaNIS Analysis tools: Statistical and multivariate analysis Modelling Visualisation

85 Use of resources together Scientists frequently need to use several of these resources in sequence to carry out their research. Much effort is currently expended in initially acquiring resources installing and sometimes adapting them to run on the user’s own machine converting and transporting data sets between stages of the analysis process

86 Biodiversity research Biologists are working to understand the adaptation of organisms to their environmental niche, eventually by combining knowledge of all the levels of biological organisation and to predict their interactions with their environment genome transcription proteome metabolic pathways cell tissue organ individual whole organism population species evolutionary pathways

87 Workflows Resources are called into use in an appropriate sequence from an interactive workflow. The facility for scientists to be able to create their own workflows, without the need for regular assistance from computer scientists, is an essential part of the BDWorld system. Accessible tools for resource discovery and for workflow design, enactment and re- use are therefore required.

88 For example Changes in distribution in response to climate changes brought about by global warming

89 CSM: Climate-space modelling Modelling and predicting changes in distribution in response to climate changes such as those brought about by global warming An unreasonably brief explanation: Get current distribution of a species (e.g. specimen records) Get current or recent climate data for those localities Calculate a model for the climate space the species can occupy Predict the distribution the species would have in any specified climate (may be different to the climate used above) Project back on world map

90 Example work-flow (Climate-space Modelling) Projection Prediction SPICE Localities Climate Space Model Base Maps Climate Submit scientific name; retrieve accepted name & synonyms for species Retrieve distribution maps for species of interest Climate surfaces Model of climatic conditions where species is currently found Possibly different climate surfaces (e.g. predicted climate) World or regional maps Prediction of suitable regions for species of interest Projection of predicted distribution on to base map

91 Triana screen-shots 1 Creation (design, editing)

92 Triana screen-shots

93

94

95 Triana screen-shots 2 Execution (enactment, run-time)

96 Triana screen-shots

97

98

99

100 And finally …

101 Triana screen-shots

102 Elements of the BDWorld system What did the system have to do to make that example happen?

103 Role of the work-flow engine Create and edit a workflow –locate an appropriate resource –check interoperability –arrange any necessary transformations –record provenance of generated data sets Execute a workflow, passing data sets to and fro Create a log or ‘lab book’ for user

104 Difficulties with resources Finding the resources Knowing how to use these heterogeneous resources –Originally constructed for various reasons, often with little attention to standards or interoperability –Have to pass data sets from one to another –Some involve user interaction

105 Role of metadata Metadata is needed to enable discovery of resources and to indicate how they are to be used. Properties to help locate appropriate resources Check interoperability, suggest transformations Provenance of data sets Log of work-flows executed

106 What is biodiversity informatics? The preceding project, among others, shows that the challenges facing biodiversity informatics include not only – Describing the diversity of life at all levels of organisation, so that biologists can understand, conserve and exploit it, But also – Inventing ways to describe the ever-increasing diversity of information resources and analysis tools available, so that users can find and use them

107 A challenge to link resources It is potentially very difficult to link all these resources together Much attention is currently being given to: –Providing unique identifiers for data objects –Which can return metadata about themselves –Which can be stitched together into a distributed collaborative information system: see the biodiversity informatics organisations TDWG and GBIF (later)

108 Part Three Biodiversity organisations Current developments relating to biodiversity informatics

109 Biodiversity organisations Examples include: Database level: ILDIS (International Legume Database and Information Service) - www.ildis.org Data portal level: – Catalogue of Life - www.sp2000.org – Encyclopedia of Life – www.eol.org – GBIF (Global Biodiversity Information Facility) - www.gbif.org Standards, etc.: TDWG (Taxonomic Database Working Group) – www.tdwg.org International agencies: CBD, CITES, WCMC, etc.... Lots more

110 Current developments Encyclopedia of Life Biodiversity Heritage Library DNA barcoding Persistent identifiers

111 Encyclopedia of Life “A new project to create an online reference source and database for every one of the 1.8 million species that are named and known on this planet” http://www.eol.org/

112

113 Biodiversity Heritage Library Ten major natural history museum libraries, botanical libraries, and research institutions digitising the published literature of biodiversity held in their collections available through a global “biodiversity commons” The data can be accessed using Web Services now online: 1,479 titles 4,607 volumes 2,017,577 pages http://biodiversitylibrary.org

114 DNA barcoding “DNA barcoding is a new technique that uses a short DNA sequence from a standardized and agreed-upon position in the genome as a molecular diagnostic for species-level identification” “DNA barcode sequences are very short relative to the entire genome and they can be obtained reasonably quickly and cheaply” The “Folmer region” at the 5' end of the cytochrome c oxidase subunit 1 mitochondrial region (COI)

115 DNA barcoding organisations Consortium for the Barcode of Life (CBOL) –http://barcoding.si.edu Barcode sequences are submitted to GenBank Barcode of Life Data Systems (BOLD) – “an online workbench that aids collection, management, analysis, and use of DNA barcodes” –http://www.barcodinglife.org – “as of 28 January 2008, there are 341,825 barcode records from 35,798 species in the Barcode of Life Database”

116 Persistent identifiers GBIF’s “LSID-GUID Task Group” Introduction (Riccardi, White & Ó Tuama, 2009): www.tdwg.org/proceedings/article/view/568 Report: “Adoption of Persistent Identifiers for Biodiversity Informatics”, Cryer, Hyam, Miller, Nicolson, Ó Tuama, Page, Rees, Riccardi, Richards, White, 2009: www2.gbif.org/Persistent- Identifiers.pdf

117 End


Download ppt "Richard White Biodiversity Informatics. Part One An introduction to biodiversity data."

Similar presentations


Ads by Google