Ontology-oriented databases: Chado and OBD Chris Mungall Lawrence Berkeley Labs.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Chapter 10: Designing Databases
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
RDB2RDF: Incorporating Domain Semantics in Structured Data Satya S. Sahoo Kno.e.sis CenterKno.e.sis Center, Computer Science and Engineering Department,
Chado Generic model organism database schema Presented at the NESCent GMOD Meeting 20 January, 2005 David Emmert
JSI Sensor Middleware. Slide 2 of x Embedded vs. Midleware based Architecture for Sensor Metadata Management Embedded approach assign an IP address to.
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
™ Suggestions for Semantic Web Interfaces to Relational Databases Mike Dean W3C Workshop on RDF Access to Relational Databases Cambridge,
Ontology Notes are from:
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Module 2b: Modeling Information Objects and Relationships IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.
GO Ontology Editing Workshop: Using Protege and OWL Hinxton Jan 2012.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Triple Stores.
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
Ontologies: Making Computers Smarter to Deal with Data Kei Cheung, PhD Yale Center for Medical Informatics CBB752, February 9, 2015, Yale University.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Logics for Data and Knowledge Representation SPARQL Protocol and RDF Query Language (SPARQL) Feroz Farazi.
Core 2: Bioinformatics CBio-Berkeley. Outline Berkeley group background Core 2 first round –what: aims, milestones –how: software lifecycle, interaction.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
Information Integration Intelligence with TopBraid Suite SemTech, San Jose, Holger Knublauch
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Chado and interoperability Chris Mungall, BDGP Pinglei Zhou, FlyBase-Harvard.
Chado for evolutionary science Chris Mungall HHMI (until June) National Center for Biomedical Ontologies (after June)
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
The Semantic Web Web Science Systems Development Spring 2015.
Open Biomedical Ontologies. Open Biomedical Ontologies (OBO) An umbrella project for grouping different ontologies in biological/medical field –a repository.
GMOD: Managing Genomic Data from Emerging Model Organisms Dave Clements 1, Hilmar Lapp 1, Brian Osborne 2, Todd J. Vision 1 1 National Evolutionary Synthesis.
Apollo Future Plans Nomi Harris, BDGP/FlyBase GMOD Meeting, Cambridge April 27, 2004.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Porting CHADO and GMOD Tools to Oracle and Integration with dictyBase Eric Just dictyBasehttp://dictybase.org Center for Genetic Medicine Northwestern.
Department of computer science and engineering Two Layer Mapping from Database to RDF Martin Švihla Research Group Webing Department.
Managing Next Generation Sequence Data with GMOD Dave Clements 1, Scott Cain 2, Paul Hohenlohe 3, Nicholas Stiffler 3, Paul Etter 3, Eric Johnson 3, William.
Digesting the Genome Glut Promoting the Use and Extension of GMOD To Emerging Model Organisms David Clements 1 Brian Osborne 2 Hilmar Lapp 1 Xianhua Liu.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
DReSS Engineering a Replay Application Based on RDF and OWL Chris Greenhalgh, Andy French, Jan Humble, Paul Tennent School of Computer Science, University.
Phenote Mark Gibson Berkeley Bioinformatics and Ontology Project (BBOP) National Center for Biomedical Ontologies(NCBO) Lawrence Berkeley National Lab.
LexGrid Philosophy, Model and Interfaces Harold R Solbrig Division of Biomedical Statistics and Informatics Mayo Clinic.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
Metadata : an overview XML and Educational Metadata, SBU, London, 10 July 2001 Pete Johnston UKOLN, University of Bath Bath, BA2 7AY UKOLN is supported.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
An Ontology-based Approach to Context Modeling and Reasoning in Pervasive Computing Dejene Ejigu, Marian Scuturici, Lionel Brunie Laboratoire INSA de Lyon,
Phenote Mark Gibson Berkeley Bioinformatics and Ontology Project (BBOP) National Center for Biomedical Ontologies(NCBO) Lawrence Berkeley National Lab.
1 An Introduction to Ontology for Scientists Barry Smith University at Buffalo
Towards Unifying Vector and Raster Data Models for Hybrid Spatial Regions Philip Dougherty.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Converting an Existing Taxonomic Data Resource to Employ an Ontology and LSIDS Jessie Kennedy Rob Gales, Robert Kukla.
IMDB: A Generic Insertional Mutagenesis Database Xiaokang Pan and Lincoln Stein Cold Spring Harbor Laboratory.
Ontology Technology applied to Catalogues Paul Kopp.
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
1 RDF Storage and Retrieval Systems Jan Pettersen Nytun, UiA.
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
Behavior and Phenotype in GMOD Natural Diversity in GMOD
Java-based curation tool with a spreadsheet-like interface
An ontology for e-Research
Presentation transcript:

Ontology-oriented databases: Chado and OBD Chris Mungall Lawrence Berkeley Labs

Outline Chado –GMOD & Model Organism Databases –Genomics data in Chado using SO OBD –NCBO & OBD Requirements –RDF and the semantic web –SPARQL endpoints

Chado: what is it? A relational database schema for biological data Part of the Generic Model Organism Database (GMOD) project – –Interoperable tools for Model Organism Databases Chado was originally built for MODs

A brief introduction to MODs Some Model Organism Databases: –FlyBase (D melanogaster) –WormBase (C elegans) –MGD (M musculus) –… What does a MOD organisation do? –Curate and integrate data on a specific species or taxon –Provide a web portal for the community What are the database requirements for a MOD?

Must store representations of genes and genomic entities –Sequence data –Exon-intron structure –Noncoding genes –Curated and computed features –Entities with unusual transcriptional properties –And more…

Must store other data types pertinent to that organism Including, but not limited to: –Expression –Interaction –Genetic and phenotypic Priorities amongst MODs differ –Different MOs have different biological and experimental characteristics –E.g. D melanogaster and genetics

Must house rich annotation data using ontologies GO (Gene Ontology); Anatomical Ontologies; Phenotype Ontologies

Must track provenance and evidence for data MOD data is often curated from the literature Other sources –Computes –High throughput data –Imaging

Must be an integrated source of data Must drive Web Portal – – – Links out to external resources –GO, Ensembl, UniProt, … –Substantial amount of records managed locally in single integrated database

Origins of Chado Chado was originally developed for FlyBase –Integration of GadFly (Berkeley) and previous FlyBase database Chado later adopted by GMOD and other some individual MODs –Popular amongst ‘newer’ MODs; eg Paramecium Also used outside MOD community –TIGR –Jenalia Farm Research Campus

Chado key concepts Tightly Integrated –foreign key relations between entities –Contrast with federated model Module System –New modules can be ‘slotted in’ –Some modules are mandatory Generic and extensible –uses ontologies and terminologies for typing –Highly normalised Community & open source

Chado modules Core –general (dbxrefs) –cv (ontologies) –pub (bibliographic) –audit Domains –sequence (genomics) –phenotype –expression –RAD –map –genetic –phylogeny –organism –event

Identifiers: dbxref s All public records identified using bipartite scheme –Not just external cross-references –DB Authority must be specified Distinct table –Can be associated with URIs (db, accession, version[optional]) Records can also get secondary dbxrefs Examples: –GO: , FlyBase:FBgn

Ontologies and terminologies are central to Chado Ontology - A formal representation of some portion of biological reality eye –what kinds of things exist? –what are the relationships between these things? ommatidium sense organ eye disc is_a part_of develops from

Ontologies: cv module Based on GO DB Schema and OBO format spec key concepts –cvterm (a term, or class in an ontology) –cvterm_relationship DAGs Subject-predicate- object –Cv (an ontology or terminology)

Subset of Sequence Ontology SubjectTypeObject exonIs_aTranscript region Part_oftranscript

Genomics: Sequence module some key concepts (a subset): –Feature A genomic entity (gene, intron, SNP, chromosome,..) –Featureloc A relative location in sequence coordinates –feature_relationship A pairwise relation between two features e.g. exon to transcript –Featureprop Tag-value data for a feature –feature_cvterm Ontology-based annotation

Feature table Features have sequences –Sequence are not independent entities –Embedded in feature table All features reside in same table –Genes, exons, chromosomes, SNPs,.. –Typed using Sequence Ontology (SO) Optional extra: Automatically generated SQL view layer

Feature Graphs: the feature_relationship table Feature graphs (FGs) –Subject-predicate-object –Predicates (types) are cvterms

Example: alternately spliced gene 7 features: –1 gene –2 transcripts –4 exons SubjectPredicateObject A (transcript)Part_ofG (gene) B (transcript)Part_ofG (gene) 1 (exon)Part_ofA (transcript) 2 (exon)Part_ofB (transcript) 3 (exon)Part_ofA (transcript) 3 (exon)Part_ofB (transcript) 4 (exon)Part_ofA (transcript) Not shown: –polypeptid e

Feature graph configurations are constrained by SO SO determines ontological relations between features Eg: Exon part_of transcript Standard rules for is_a –E.g. X is_a Y, Y part_of Z => X part_of Z –See OBO Relation ontology Rules must be encoded outside standard relational schema

Declarative programming: SQL Functions Powerful, but optional –PostgreSQL only Can be ported Separation of interface from implementation –Sequence operations Transcription, translation –Feature Graph operations Deduction of implicit features (eg introns) –Location Graph operations Projection, mereological relations Related: Tata S, Patel JM, Friedman JS, and Swaroop A Declarative querying for biological sequence databases Proc of the 22nd International Conference on Data Engineering (ICDE), April 3-7, Atlanta, GA, 2006.

Chado: ongoing work Chado for phenotype (EQ) data –With FlyBase, ZFIN, DictyBase Chado for evolutionary science –In collaboration with NESCENT Documentation! –Helpdesk (NESCENT) More GMOD integration –Unified Architecture for GMOD? Latest Obo format features –Allow for post-composition of complex terms

NCBO: OBO and OBD OBO: Open Bio Ontologies – – NCBO BioPortal; access to: –OBO ontologies –OBD annotations Current DBPs –Fly & fish mutant phenotype annotation Linking to disease –HIV Clinical trial analysis

OBD: Storing biomedical annotations Requirements different from Chado Domain scope –All of biology and biomedicine Ontologies used for annotation –Not just OBO Data integration –Index minimum amount of data –Link to external data where appropriate –Provide and use data services Requirements partially met by semantic web technology

The Semantic Web Datamodel Based on RDF triples –Subject-predicate-object Each element is a URI Various serialisations: –RDF/XML –N3, N-Triples Multiple APIs, QLs and storage options RDF Graphs constrained by ontologies –Expressed in RDF Schema, OWL

OBD ‘Schema’: formal ontology of annotation Within OBO Foundry Framework - uses OBO upper ontology

Implementing OBD using SemWeb technology OBD-Sesame –3rd party triplestore –Relational or in-memory –Lacks native OWL support –Performance issues OBD-SQL –Developed at Berkeley –Reuse Chado methodology, code –‘Triplestore’ with extras Reduces triple overhead with common patterns

Wrapping databases as SPARQL endpoints A lot of data in existing relational databases like Chado –Goal: make available as distributed resource in OBD compliant way –Solution: d2rq declarative mappings and SPARQL Progress: –GO Database SPARQL endpoint: –Chado and OBD mappings coming soon Application: –Integration of annotations through genome dashboard

GO annotations OBD Disease/pheno annotations Genome server MOD D2rq DAS Sesame Usage scenario: AJAX Gbrowse ( Annotation info sparql DAS/2 sparql

Conclusions Flexible hypernormalized schemas –Performance penalties –Too much freedom expression? Ontologies + reasoners provide some constraints; eg SO Open world assumption Federation vs tight integration –Tight integration is required for MODs –As more data types become available dynamic integration will be key RDF and SPARQL is one solution

Thanks LBL –Shengqiang Shu –Mark Gibson –Nicole Washington –Seth Carbon –John Day Richter –Chris Smith –Karen Eilbeck –Sima Misra –Suzanna Lewis FlyBase –Dave Emmert –Pinglei Zhou –Peili Zhang –Aubrey de Grey –Paul Leyland –William Gelbart HHMI –Gerry Rubin GMOD, Nescent –Scott Cain –Sohel Merchant –Eric Just –Sierra Moxon –Andrew Uzilov –Brian Osborne –Ian Holmes –Lincoln Stein

end

Feature localisation Interbase –Simplifies code All localisations relative –Location Graph (LG) –Recursive/nested locations allowed

Recursive location graphs Locations can be nested –Finished genomes typically flat; depth(LG)=1 –Unfinished genomes, heterochromatin may require 2 (rarely more) levels features located relative to contigs Contigs related relative to chrmosomes –May be a requirement to change coordinates at each level independently

Nested LGs FeatureLocSrcfeaturegroup exon [+]contig [+]chrom10 exon [+]chrom11 Redundant localisations can be used to ‘flatten’ LG Group>0 indicates denormalised/flattened LG - must be recalculated if group=0 coordinates change

Relational featurelocs A relation between two or more locations –Matches, sequence variants –Indicated using rank column Use case: SNPs –Simple way to query for variants introducing premature termination of translation –Combine relational featurelocs and redundant featurelocs 3+ featureloc pairs: –Sequence of SNP on reference and variant genome (+ location on reference) –Same on transcripts –Same on polypeptides

OWL entailment genomics use case SO defines ‘TE gene’ as: –A SO:gene which is part_of a SO:TE –In OWL: Class(TE_Gene complete Gene part_of(TE)) Result: –Queries for ‘SO:TE_gene’ return features not explicitly annotated as such Compare: Chado –Equivalent rules to be added PostgreSQL functions? Oboedit reasoner adapter?