Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pre-SIG Genome Annotation Database Operations Suzanna Lewis FlyBase/Berkeley Drosophila Genome Project Gene Ontology Consortium.

Similar presentations


Presentation on theme: "Pre-SIG Genome Annotation Database Operations Suzanna Lewis FlyBase/Berkeley Drosophila Genome Project Gene Ontology Consortium."— Presentation transcript:

1 Pre-SIG Genome Annotation Database Operations Suzanna Lewis FlyBase/Berkeley Drosophila Genome Project Gene Ontology Consortium

2 Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data

3 Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data

4 Contradictions and Complications Assembly errors Missed gene merges Missed gene splits Complex adjustments Dicistronic genes Overlaps and intersections

5 Assembly dependencies

6 Missed merges

7 Missed Splits

8 Splerges

9 Dicistronic Genes

10 Shared 5 UTR

11 Shared UTRs

12 Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data deleted: 41 new: 179 merges: 31 splits: 26 reinstated: 32

13 Broad Institute TIGR JGI Baylor College of Medicine Washington University FlyBase Ensembl GMOD contributors meeting March 2004

14 The Essentials Visualization and manual editors

15 The Essentials Visualization and manual editors Combiners

16 The Essentials Visualization and manual editors Combiners Full-length cDNA sequences

17 The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies

18 The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies Annotation standards and verification

19 The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies Annotation standards and verification Evidence tracking and versioning

20 The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies Annotation standards and verification Evidence tracking and versioning Open source software components and standards are critical to long term success

21 The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies Annotation standards and verification Evidence tracking and versioning Open source software components and standards are critical to long term success

22 Annotation verification Community input On-line error reporting Curation of the literature Confirmation by comparison to cDNA sequences

23 SWISSPROT Comparison Perfect match 100% identity over 100% of lengths Single AA substitutions 99% identity over 100% of lengths The above account for 75% of all genes with a SWISSPROT cognate (2,771 out of 3,687).

24 SWISSPROT Comparison Significant mismatch spans of 40 residues or 20% peptide length, with at least 97% sequence identity No match poor or empty matches These remaining differences were due to lingering annotation errors or errors in the reported DNA sequence from SWISSPROT

25 Analysis of 8687 cDNAs (full inserts)

26 Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data

27 Annotation of all types of features Protein-coding13,410 tRNA291 microRNA23 snRNA32 snoRNA29 Pseudogenes19 Non-coding RNA36 Transposons1,572 Promoters(thousands) TSS(thousands) P element Insertions(thousands) Total15,412

28 Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data

29 How to find what you need FlyBaseMGISGD CappuccinoBNI1Formin 2 By name?By database ID? Actin binding FBgn S MGI: By function?

30 But in 1998 there was a problem… None of the organism databases used standard terminology to describe biological function.

31 For example It will be difficult for youand even harder for a computerto find functionally equivalent gene products. translation Protein synthesis You want all gene products that are involved in bacterial protein synthesis, But the sequences are significantly different from those in humans.

32 How to best describe biology? Natural language Highly expressive Ambiguous in meaning Hard to compute on Structured representation Limited in expressivity Precise May be computed on We needed to find a middle ground, that supports and enables both.

33 The aims of GO 1. To develop comprehensive shared vocabularies. 2. Use the vocabularies to describe the gene products held in different databases. 3. To provide access to the vocabularies, the annotations, and associated data. 4. To provide software tools to assist biological researchers.

34 The early key decisions The vocabulary itself requires a serious and ongoing effort. Carefully define every concept Initially keep things as simple as possible and only use a minimally sufficient data representation. Focus initially on molecular aspects that are shared between many organisms.

35 A sequence is not equal to a gene Physically a gene is composed of sequences. DNA, RNA, and protein Different strains, ESTs, cDNAs, alleles… A fully characterized gene has multiple sequence references

36 GO is NOT a gene nomenclature system Communities decide upon the official gene name or symbol and their community databases maintain these data. Sequence repositories (I.e. Genbank/EMBL/DDBJ/SwissProt) provide sequence identifiers and protein names Proteins may be named differently than genes e.g. HUGO and UniProt IDs

37 GO encompasses descriptions for all functional molecular entities A gene product may be either a functional RNA or a protein Protein tRNA miRNA snRNA rRNA …

38 The breakdown of work Task 1 Building the ontology: a computable description of the biological world Task 2 Describing your geneannotation Protein structure Phenotype Expression data Function, process, localization…

39 Vocabulary and relationships Look up concept to accurately express biology Your gene product Refer to representative sequences Gene nomenclature decisions Sequence DB Choose approved name and synonyms Collect what is known from the literature

40 GO databases: distributed and centralized Support cross-database queries By having a mutual understanding of the definition and meaning of any word used to describe a gene product Provide database access to a common repository of annotations By submitting a summary of gene products that have been annotated

41 If we build it… What is a term? Definition of term concepts How to represent and manage the concepts Biological scope Annotation

42 What is a term? Must have a stable ID May have synonyms Have relationships to other terms Can be made obsolete Can be split or merged Must have a definition

43 Definitions Purpose is to remove ambiguity of interpretation and alternate meanings All definitions are supported by cross-references to the source(s)

44 Annotating gene products Expert curation accepted from any group that can provide an ID and evidence Each annotation must be supported by evidence, including a cross-reference Gene can be annotated with multiple terms Annotation is at finest possible granularity Guidelines

45 GO functional analysis Sequence similarity Literature harvesting Motif analysis Expression studies Interaction studies

46 Current work Low coverage genome sequencing Multiple species genome comparisons SOthe Sequence Ontology

47 e.g. What is a pseudogene? Human Sequence similar to known protein but contains frameshift(s) and/or stop codons which disrupts the ORF. Neisseria A gene that is inactive - but may be activated by translocation (e.g. by gene conversion) to a new chromosome site. - note some would call such a gene a cassette in yeast.

48 SO is useful if you want to: Annotate sequence using consistent terminology for the same features across genomes. Enable practical querying and comparisons between sequence databases. Describe and propagate features at all levels of the sequence from genomic to mature protein.

49 Thank You to… Curators-Berkeley Sima Misra, Josh Kaminker, Simon Prochnik, Chris Smith, Jon Tupy Curators-Harvard Lynn Crosby, Bev Matthews, Kathy Campbell, Pavel Hradecky, Yanmei Huang, Leyla Bayraktaroglu Curators-Cambridge Gillian Millburn, Rachel Drysdale, Chihiro Yamada Curators-SWISSPROT Eleanor Whitfield Software-Berkeley Chris Mungall, Ben Berman, Joe Carlson, Mark Gibson, Nomi Harris, George Hartzell, Brad Marshall, John Richter, ShengQiang Shu Software-Harvard David Emmert Software-Cambridge Aubrey de Grey Software-Ensembl Michelle Clamp, Vivek Iyer, Steve Searle

50 and Thanks to… Christopher Mungall John Day- Richter Brad Marshall Karen Eilbeck Mark Yandell George Hartzell David Hill Joel Richardson GO Curators Michael Ashburner Judith Blake J. Michael Cherry

51 We want and depend on you! Corrections to the peptides Functional annotation Corrections and additions to GO


Download ppt "Pre-SIG Genome Annotation Database Operations Suzanna Lewis FlyBase/Berkeley Drosophila Genome Project Gene Ontology Consortium."

Similar presentations


Ads by Google