Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Similar presentations


Presentation on theme: "A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project"— Presentation transcript:

1 A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project http://www.geneontology.org/doc/gobo.html

2 Semantic Interpretation: What is communication? An information transmission from a source to a receiver by means of encoding-decoding processes (including language). But what is meant, what is said, what is heard, and what is understood are not always the same thing. This has a simple consequence: it is only possible to communicate to the extent that we share rules of usage and have reciprocal understanding of the meaning.

3 Working towards a shared language for the description of sequence. Hey, know what I figured out? The meaning of words isn’t a fixed thing! Any word can mean anything! By giving words new meanings, ordinary English can become an exclusionary code! Two researchers can be divided by the same language! To that end, we’re inventing new definitions for common words, so we’ll be unable to communicate. Don’t you think that is totally excellent?

4 How to best describe biology? natural language highly expressive ambiguous hard to compute on why would I want to compute on it? database searching data mining knowledge transfer

5 The aims of SO 1. Develop a shared set of terms and concepts to annotate biological sequences. 2. Apply these in our separate projects to provide consistent query capabilities between them. 3. Provide a software resource to assist in the application and distribution of SO. 4. Meet the GOBO criteria.

6 SO: Phase I To provide a structured controlled vocabulary for the description of primary annotations of nucleic acid sequence Useful for the annotations shared by a DAS server. To provide a structured representation of these annotations within genomic databases. Making it possible to query all for example, all genes whose transcripts are edited, or trans-spliced, or are bound by a particular protein.

7 SourceTypeGroup SangerexonSequence Em:AP000546.C22.2.mRNA EBIexondJ68O2.C22.1.mRNA WUSTLCDSgene_is "001"; transcript_id "001.1"; Gadflyexongenegrp=CG18090; transgrp=CG18090-RA; WormbaseexonSequence "C27C7.7" Simple GFF: What is a transcript?

8 What is a pseudogene? Human Sequence similar to known protein but contains frameshift(s) and/or stop codons which disrupts the ORF. Neisseria A gene that is inactive - but may be activated by translocation (e.g. by gene conversion) to a new chromosome site. - note such a gene would be called a “cassette” in yeast.

9 SO so far 1280 terms Top levels Structural variation Locatable features Other sequence attributes

10 Approach Determine the top level orthogonal categories Domain, site, sequence type, location Specify the specializations homeo domain, phosphorylation site, DNA/RNA/AA Define inter-relationships between orthogonal categories ison, defines

11 primary transcript DNA sequence RNA nucleic acid sequence processed transcript defines ison nucleic acid sequence region sequence region DNA region gene regiontranscript regionexon RNA region

12 SourceTypeGroup Sangerexontranscript “Em:AP000546.C22.2” EBIexontranscript “dJ68O2.C22.1” WUSTLCDSgene "001"; transcript "001.1"; Gadflyexongene “CG18090”; transcript “CG18090-RA”; Wormbaseexontranscript "C27C7.7" GFF After

13 SO long(term) 1. Formalize the current phrase-based ontology to a description logic 2. Provide DAML+OIL/OWL representations 3. Add declarative rules and constraints to ensure consistency of annotations and aid annotation. 4. Extend the ontology so that it can be used as a full sequence knowledge base.

14 Description logics will make the ontology easier to maintain For example, it will enable cross-products within the ontology. Now: "tRNA alanyl", "tRNA coding gene alanyl", "tRNA primary transcript alanyl". tRNA class has a ‘slot’ for "amino-acid” and a slot for anti-codon. 'restrictions' effectively say "any instance of class tRNA that has the amino-acid slot value of alanine is of the class 'tRNA alanyl'". ‘checks’ for inconsistency between anticodon, amino-acid and class.

15 Computable definitions Human-readable text definitions are always desirable. But, lengthy text definitions will always be open to interpretation. …besides, much of the data will be provided by programmers, and programmers never read the instructions. If programmers write their own code for assigning these, this opens the possibility of inconsistencies of interpretation of the concept. Computable definitions/constraints are essential wherever possible to provide a set of declarative rules for checking and inference.

16 A SO Knowledge Base? SO could eventually be used not just as a way of categorizing sequence features, but as the data model for storing sequence and sequence feature data. Accomplish this by adding a few slots to the top level feature class - for instance for start and end coordinates. One could then have an entire sequence database in DAML+OIL/OWL format.

17 Declarative representation for spatial definitions Rules involving mathematical constructs cannot be usually be expressed in a Description Logic. There needs to be a declarative representation of these rules because enforcing the rules using a program written in an imperative language, is difficult to sustain. Declarative languages specify *what* is to be done, rather than *how* they should be done.

18 Give me 500 bases upstream of all 5’ exons. Define 5’ exon as being the first exon on the five prime end of a transcribed region. It would be very tedious for a curator to have to specifically annotate exons as being ”5' exon" as opposed to the more general "exon". There is no need for them to do this, as this is computable from rules.

19 Give me all the dicistronic genes Define a dicistronic gene in terms of the cardinality of the transcript to open-reading-frame relationship and their spatial arrangement.

20 Give me all 3’ exons that overlap 5’ untranslated regions. Define “exons with overlapping UTRs” as a spatial relationship coupled with being “partsof” different genes and being non-coding.

21 Loose and Flexible Rules are meant purely to ensure consistency There will always be fuzzy areas where we want to allow freedom, because normal biology is like that. Constraints are NOT meant to perform any predictive function, they just provide a consistent definition.

22 A single framework that integrates other biological ontologies One could have a 'knowledge base' centered around the genome. This KB would be amenable to reasoning. This is a significant change from relational, OO, or XML modeling, however, it is compatible with all these. SO could be a framework for integrating data with other ontologies. product features would have slots for standard GO annotations, variation features would have slots into phenotypic ontologies.

23 Build in a Bayesian Belief Network Probabilities may be assigned to annotations or used to suggest new annotations. Define a model for binding sites and regulatory regions on weight matrices, proximity to starts of genes and so forth. Curator can interactively explore and ask questions like "ok, i have evidence for there being such and such a binding site here, what if I alter the priors, how does that affect other nodes in the network (statements in the knowledge base) pertaining to pathways?”

24 To paraphrase Brunelleschi on the importance of tools, circa 1425 I am accustomed to think about and construct in my mind some unheard of invention making it possible to create great and wonderful things.

25 GOBO Criteria 1. The ontologies are "open" and can be used by all without any constraint other than that their origin must be acknowledged. 2. The ontologies are in, or can be instantiated in, the GO syntax, extensions of this syntax or in DAML+OIL.. 3. The ontologies are orthogonal to other ontologies already lodged with gobo. 4. The ontologies share an unique identifier space. 5. The ontologies include definitions of their terms.

26 Giving it a go Sanger Institute Richard Durbin, Tim Hubbard EBI Michael Ashburner, Ewan Birney Mouse Genome Database Judith Blake, Carol Bult BDGP Chris Mungall, Brad Marshall, John Richter, ShengQiang Shu Wormbase Lincoln Stein


Download ppt "A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project"

Similar presentations


Ads by Google