Presentation is loading. Please wait.

Presentation is loading. Please wait.

Describing Bioinformatic Metadata at EBI James Malone

Similar presentations


Presentation on theme: "Describing Bioinformatic Metadata at EBI James Malone"— Presentation transcript:

1 Describing Bioinformatic Metadata at EBI James Malone malone@ebi.ac.uk

2 Master headline2 Cross-Domain Data available from EBI Genomes DNA & RNA sequence Gene expression Protein sequence Protein families, motifs and domains Protein families, motifs and domains Protein structure Protein interactions Chemical entities Pathways Systems Literature and ontologies

3 Master headline The Sorts of Data we Serve We manage databases of biological data such as nucleic acid, protein sequences and macromolecular structures ENA: nucleotide sequencing information UniProt: protein sequence and functional information ArrayExpress: functional genomics data repository Ensembl: genome info for vertebrates and other eukaryotes InterPro: database of predictive protein "signatures" PDBe: data resource on biological macromolecular structures

4 Master headline Sorts of Metadata we need Low complexity – high volume (genome sequencing) High complexity – low volume (mouse phenotyping) 1000 genomes in order of magnitude physics data Provenance models Experimental variables Publication details Synonym and domain specific language Cross-domain mappings Metadata has existed and been captured for a while, e.g. InterPro IDs

5 Master headline

6 Metadata: Minimum Information Standards Minimum Information Standards specify minimum amount of meta data (and data) required to meet a specific aim (usually reporting data or submitting to public repository) MIAMI: Minimum Information About a Microarray Experiment MIARE: Minimum Information About an RNAi Experiment MIAPE: Minimum Information About a Proteomic Experiment MIFlowCyt: Minimum Information about a Flow Cytometry Experiment ISA: cross domain experiment reporting Some public repositories require some conformation, e.g. ArrayExpress – MIAME scoring

7 Master headline Ontologies As a method of representing knowledge in which concepts are described both by their meaning and their relationship to each other. Increasingly important component to formalise metadata Thriving bio-ontology community e.g. Gene Ontology ‘project to standarise the representation of gene and gene product attributes e.g. ChEBI ‘ontology of molecular entities focused on small chemical compounds’ e.g. Ontology of Biomedical Investigations ‘ontology to describe experimental protocols from inception to analysis’

8 Metadata that is Interoperable Goal: community is interoperable set reference ontologies Consumed by application ontologies for specific needs E.g. Experimental Factor Ontology @ www.ebi.ac.uk/efo Anatomy Reference Ontology Cell Type Ontology Chemical Entities of Biological Interest (ChEBI) Various Species Anatomy Ontologies Relation Ontology Disease Ontology

9 Master headline Applying Ontologies in Data Curation @ www.ebi.ac.uk/gxa Query for Cell adhesion genes in all ‘organism parts’ ‘View on EFO’ Ontologically Modeling Sample Variables in Gene Expression Data malone@ebi.ac.uk

10 Master headline Strategies for Integrating Multi-Domain Data Consuming reference ontologies and mapping to multiple ontologies where overlap exists offers us maximum interoperability Rdf triple QUERY Atlas Swiss Prot Amino Acid Ontology

11 Master headline ELIXIR Report Data Integration & Interoperability Recommendations – Jul 2009 ELIXIR should build a distributed data infrastructure based on a Service Oriented Architecture using WS technology Ontologies needed in areas of disease, anatomy and taxon Annotation systems for associating data to metadata Pan ‑ domain coordination and funding for reporting standards

12 Master headline Current Challenges Literature – data gap Curation relatively slow, more advanced tooling required Ontologies not interoperable yet and more needed Bio-ontology funding New high-throughput methods Assays Experiments

13 Master headline Challenges: Scaling World-wide sequencing data production is now just an order of magnitude behind CERN Large Hadron Collider produces 15 petabytes per year from single point source LHC grid is 140 computer centres - 33 countries centered at CERN (Tier 0) Sequencing is producing data in hundreds of centers in dozens of countries with Tier 0 sites (EBI & NCBI) More than 150 Terabytes of 1000genomes data in the Short Read Archive and this represents more than half of all the data in the archive Slide: Laura Clarke, EBI

14 Master headline Summary EBI uses combination of metadata strategies Minimal Information useful for reporting standards Ontologies provide powerful method describing domain knowledge Ontologies also allow community consensus to be built as well as strategies for data integration ELIXIR suggests : Infrastructures should be WS compatible Annotation tools required Pan-domain coordination is essential

15 Developing an Ontology from the Application Up malone@ebi.ac.uk Acknowledgements Ontology creation: James Malone, Tomasz Adamusiak, Ele Holloway, Helen Parkinson, Jie Zheng (U Penn) Atlas GUI Development Misha Kapushesky, Pasha Kurnosov, Anna Zhukova. Nikolay Kolesinkov External Review and anatomy: Jonathan Bard, Jie Zheng ArrayExpress Production Staff EBI Rebholz Group (Whatizit text mining tool) Many source ontologies for terms and definitions esp. Disease Ontology, Cell Type Ontology, FMA, NCIT, OBI Funders: EC (Gen2Phen,FELICS, MUGEN, EMERALD, ENGAGE, SLING), EMBL, NIH Eric Neumann, Joanne Luciano and Alan Ruttenberg HCLS Group - Eric Prud'hommeaux and Scott Marshall


Download ppt "Describing Bioinformatic Metadata at EBI James Malone"

Similar presentations


Ads by Google