Presentation is loading. Please wait.

Presentation is loading. Please wait.

Judith Blake Biomedical Ontologies and their role in functional genomics Judith A. Blake, Ph.D. The Jackson Laboratory Functional Genomics – February 2012.

Similar presentations


Presentation on theme: "Judith Blake Biomedical Ontologies and their role in functional genomics Judith A. Blake, Ph.D. The Jackson Laboratory Functional Genomics – February 2012."— Presentation transcript:

1 Judith Blake Biomedical Ontologies and their role in functional genomics Judith A. Blake, Ph.D. The Jackson Laboratory Functional Genomics – February 2012 Func Genomics2012

2 Judith Blake Func Genomics2012 Bioinformatics-What is that? Bioinformatics is: the use of computers (and persistent data structures) in pursuit of biological research an emerging new discipline, with its own goals, research program, and practitioners the fundamental tool for 21st century biology all of the above. Robert J. Robbins

3 Judith Blake Topics: We need to coordinate the representation of information – from genetic and genomic studies, – as might be reported in the biomedical literature, and – from the output of high-throughput experiments This is done by designing databases (e.g., MGI) and bio-ontologies (e.g., GO) to support comprehensive data integration Such resources enable comparative analysis between different organisms and biological systems With the objective of helping us gain new knowledge about biological systems and particularly about genetic components of human diseases Func Genomics2012

4 Judith Blake Roxy Laybourne and others, photo by Chip Clark Managing Biological Information is Nothing New Bird Collections at the Smithsonian Natural History Museum Func Genomics2012

5 Judith Blake The trouble with facts is that there are so many of them. Samuel McChord CrothersSamuel McChord Crothers, The Gentle Reader (1903) Func Genomics2012

6 Judith Blake The data integration problem Vast wealth of data residing in different databases – Meaning of those records must be reconciled for data to be automatically integrated Science database medical database Func Genomics2012

7 Judith Blake Accession File Func Genomics2012

8 Judith Blake TCTCTCCCCCGCCCCCCAGGCTCCCCCGGTCGCTCTCCTCCGGCGGTCGCCCGCGCTCGGTGGATGTGGC TGGCAGCTGCCGCCCCCTCCCTCGCTCGCCGCCTGCTCTTCCTCGGCCCTCCGCCTCCTCCCCTCCTCCT TCTCGTCTTCAGCCGCTCCTCTCGCCGCCGCCTCCACAGCCTGGGCCTCGCCGCGATGCCGGAGAAGAGG CCCTTCGAGCGGCTGCCTGCCGATGTCTCCCCCATCAACTACAGCCTTTGCCTCAAGCCCGACTTGCTGG ACTTCACCTTCGAGGGCAAGCTGGAGGCCGCCGCCCAGGTGAGGCAGGCGACTAATCAGATTGTGATGAA TTGTGCTGATATTGATATTATTACAGCTTCATATGCACCAGAAGGAGATGAAGAAATACATGCTACAGGA TTTAACTATCAGAATGAAGATGAAAAAGTCACCTTGTCTTTCCCTAGTACTCTGCAAACAGGTACGGGAA CCTTAAAGATAGATTTTGTTGGAGAGCTGAATGACAAAATGAAAGGTTTCTATAGAAGTAAATATACTAC CCCTTCTGGAGAGGTGCGCTATGCTGCTGTAACACAGTTTGAGGCTACTGATGCCCGAAGGGCTTTTCCT TGCTGGGATGAGCCTGCTATCAAAGCAACTTTTGATATCTCATTGGTTGTTCCTAAAGACAGAGTAGCTT TATCAAACATGAATGTAATTGACCGGAAACCATACCCTGATGATGAAAATTTAGTGGAAGTGAAGTTTGC CCGCACACCTGTTATGTCTACATATCTGGTGGCATTTGTTGTGGGTGAATATGACTTTGTAGAAACAAGG TCAAAAGATGGTGTGTGTGTCCGTGTTTACACTCCTGTTGGCAAAGCAGAGCAAGGAAAATTTGCGTTAG AGGTTGCTGCTAAAACCTTGCCTTTTTATAAGGACTACTTCAATGTTCCTTATCCTCTACCTAAAATTGA TCTCATTGCTATTGCAGACTTTGCAGCTGGTGCCATGGAGAACTGGGGCCTTGTTACTTATAGGGAGACT GCATTGCTTATTGATCCAAAAAATTCCTGTTCTTCATCCCGCCAGTGGGTTGCTCTGGTTGTGGGACATG AACTCGCCCATCAATGGTTTGGAAATCTTGTTACTATGGAATGGTGGACTCATCTTTGGTTAAATGAAGG TTTTGCATCCTGGATTGAATATCTGTGTGTAGACCACTGCTTCCCAGAGTATGATATTTGGACTCAGTTT GTTTCTGCTGATTACACCCGTGCCCAGGAGCTTGACGCCTTAGATAACAGCCATCCTATTGAAGTCAGTG TGGGCCATCCATCTGAGGTTGATGAGATATTTGATGCTATATCATATAGCAAAGGTGCATCTGTCATCCG AATGCTGCATGACTACATTGGGGATAAGGACTTTAAGAAAGGAATGAACATGTATTTAACCAAGTTCCAA CAAAAGAATGCTGCCACAGAGGATCTCTGGGAAAGTTTAGAAAATGCTAGTGGTAAACCTATAGCAGCTG GTTTCTGCTGATTACACCCGTGCCCAGGAGCTTGACGCCTTAGATAACAGCCATCCTATTGAAGTCAGTG TGGGCCATCCATCTGAGGTTGATGAGATATTTGATGCTATATCATATAGCAAAGGTGCATCTGTCATCCG AATGCTGCATGACTACATTGGGGATAAGGACTTTAAGAAAGGAATGAACATGTATTTAACCAAGTTCCAA CAAAAGAATGCTGCCACAGAGGATCTCTGGGAAAGTTTAGAAAATGCTAGTGGTAAACCTATAGCAGCTG From the birth of the field of genetics until a decade ago, it was generally assumed that the parental origin of a gene could have no effect on its function. In the vast majority of studies carried out during the last 90 years, this paradigm has appeared to hold true. However, with increasingly sophisticated genetic and embryological investigations in the mouse, important exceptions to this rule have been uncovered over the last decade. First, the results of nuclear transplantation experiments carried out with single-cell fertilized embryos have demonstrated an absolute requirement for both a maternally-derived and a paternally-derived pronculeus to allow full-term development (McGrath and Solter, 1983). Second, in animals that receive both homologs of certain chromosomes or subchromosomal regions from one parent and not the other (through the mating of translocation heterozygotes as described in Section 5.2.3), dramatic effects on development can be observed including enhanced or retarded growth and outright lethality (Cattanach and Kirk, 1985). Third, either of two deletions that cover a small region of mouse chromosome 17 can be transmitted normally from a father to his offspring, but these same deletions cause prenatal lethality when they are maternally transmitted (Johnson, 1974; Winking and Silver, 1984). Fourth, similar parent-of-origin effects have been observed on the phenotypes expressed by animals that carry a targeted knock-out allele at the Igf2 locus (DeChiara et al., 1991). Finally, molecular techniques have been used to directly demonstrate the expression of transcripts from one parental allele and not the other at the Igf2r locus (Barlow et al., 1991) and the H19 locus (Bartolomei et al., 1991). The accumulated data indicate that a subset of mouse genes (on the order of 0.2%) will function differently in normal embryos depending on whether they have been inherited through the male or the female gamete, such that one allele will be expressed and the other will be silent. Genomic imprinting is the term that has been coined to describe this situation in which the phenotype expressed by a gene varies depending on its parental origin (Sapienza, 1989). Further experiments have demonstrated that, in general, the "imprint" is erased and regenerated during gametogenesis so that the function of an imprintable gene is fully determined by the sex of its progenitor alone, and not by earlier ancestors. Func Genomics2012

9 Judith Blake Crash Blossoms and other semantic ambiguities translating what we say into what we mean: data, words and knowledge Crash Blossoms Func Genomics2012 “Violinist Linked to JAL Crash Blossoms” “MacArthur Flies Back to Front” “Squad Helps Dog Bite Victim” “Red Tape Holds Up New Bridge.”

10 Judith Blake The English Language is hard to learn, even for computers. “Jessica Hahn Pooped After Long Day Testifying” Focus: creating the data structures and mining the biomedical literature to provide knowledge representations – with the objective of using logical reasoning applications and predictive approaches to ‘interrogate’ very large data sets, generating new hypothesis for further experimental investigation Func Genomics2012

11 Judith Blake What is an ontology? Func Genomics2012

12 Judith Blake A biological ontology is:  A formal representation of some portion of biological reality eye  what kinds of things exist?  what are the relationships between these things? ommatidium sense organ eye disc is_a part_of develops from Func Genomics2012

13 Judith Blake Why do we need ontologies? Func Genomics2012

14 Judith Blake Connections are not made explicit by default Computers are not intelligent We need to spell out interconnectedness of entities – Specificity Bone mineralization vs ossification – Granularity Osteocyte vs bone – Spatial Gill membrane and branchiostegal ray – Perspective Anatomy vs physiology – Causally related entities pathways development – Evolutionary Homology and descent Func Genomics2012

15 Judith Blake Ontologies : the key to data integration Ontologies provide: – rigorous, shared computable definitions for terms – classifications and connections that can be used for database search and inference Func Genomics2012

16 Judith Blake Annotation of genes and proteins using ontologies are key to data integration Biomedical Ontologies Ontologies are human and machine readable classification of biological knowledge. Ontologies have: Terms Term definitions Relationships among terms Func Genomics2012

17 Judith Blake Good ontology design is required for data integration Not any old ontology will do – Data integration served poorly by poor ontologies How do we know good ontologies? – Types and classifications should be constructed according to science and should reflect nature – Ontology constructed along lines of ontology best practices http://www.obofoundry.org Formal definitions and relations Based on distinction between types and instances Distinction between types and their labels Func Genomics2012

18 Judith Blake The Gene Ontology Mid-size – ~33,700 terms in all 3 ontologies – ~2n,nnn links (is_a, part_of, regulates) Each term represents a type – Terms also have alternate labels (synonyms) These do not represent distinct types Humans use different labels to refer to the same biological pattern – E.g: endoplasmic reticulum vs ER Func Genomics2012

19 Judith Blake Ontology is not nomenclature A type can have many labels – Preferred label (term) – Synonyms, aliases Types are not labels – Types are the underlying pattern Identified by a formal definition – Labels are important for doing science But life existed for billions of years quite happily prior to the invention of names and labels – Good ontology separates the underlying patterns in nature from the labels used to describe them Func Genomics2012

20 Judith Blake Ontologies and annotation Ontologies are of little practical use without annotation – GO has ~6 million annotations linking genes and gene products to GO terms – Mostly (but not all) MOD & Human – Same terms are shared across species All annotation statements have provenance – Source/publication – Evidence & evidence codes Func Genomics2012

21 Judith Blake Use of GO annotations Database search Database integration Automating further annotation Data mining and data analysis – Microarray analysis: 1. Extract cluster of co-exressed genes 2. Analyses annotations for enrichment of certain terms Func Genomics2012

22 Judith Blake Func Genomics2012 What is a Database? an organized body of related information In computing, a database can be defined as a structured collection of records or data that is stored in a computer so that a program can consult it to answer queries. The records retrieved in answer to queries become information that can be used to make decisions.

23 Judith Blake Func Genomics2012 Mouse Genome Informatics (MGI) Database Comprehensive information resource about the laboratory mouse Provides consensus representation of the mouse genome International scientific community resource Integrated data acquisition and query capabilites MGI Database is a Relational Database: Information is stored in tables that have relationships to each other. This facilitates query and retrieval of subsets of data.

24 Judith Blake MGI’s primary mission is to facilitate the use of mouse as a model for human biology by providing integrated access to data on the genetics, genomics, and biology of the laboratory mouse. Hermansky-Pudlak syndrome Mouse model & human phenotype Information content spans from sequence to phenotype/disease sequence variants & polymorphisms gene function genome location mouse/human orthologs & maps strain geneaology expression tumors Database Resource: Mouse Genome Informatics (MGI) Func Genomics2012

25 Judith Blake MGI integrates genetic, genomic and phenotypic data Integrate Factor out common objects Assemble integrated objects Gather data from multiple sources Within MGI Genes Sequence Expression Literature Alleles Phenotypes Between MGI and others Via shared sequence annotations……UniProt, EntrezGene, Ensembl Via shared semantic representations ……Drosophila, Arabidopsis, etc. Func Genomics2012

26 Judith Blake Data Acquisition Object Identity Standardizations Data Associations Integration with other bioinformatics resources New Gene, Strain or Sequence? Controlled Vocabularies Evidence & Citation Co-curation of shared objects and concepts Annotation Pipeline Literature & Loads Func Genomics2012

27 Judith Blake Func Genomics2012 RPCI Automated (mostly) Data Integration (Loads) MGI db Associations Clones Non-mouse Gene models and coordinates Sequences Vocabularies SNP db GO MP Anatomy Interpro OMIM PIRSF Annotation MGC GenBank RefSeq UniProt DFCIseq DoTSseq NIAseq NCBI VEGA dbSNP EG chimp EG dog EG rat EG human EG mouse UniProt DFCI DoTS NIA Unigene TreeFam Gene traps Ensembl microRNAs UniSTS HCOP Homologene

28 Judith Blake Manual (mostly) annotation of the biomedical literature Func Genomics2012 > 12,000 / year

29 Judith Blake Func Genomics2012 Data acquisition is constant Load Program Summary of Data Loaded Mouse EntrezGeneEntrezGene IDs for mouse markers. Plus marker-to-sequence associations from EntrezGene not already in MGD Human/Rat EntrezGeneNomenclature, map position and other data regarding human and rat genes. OMIM associations for human. GenBank SeqMouse sequence records from GenBank RefSeq SeqMouse sequence records from RefSeq UniProt/TrEMBL SeqMouse sequence records from UniProt and TrEMBL TIGR/DoTS/NIA SeqMouse consensus sequence records from TIGR/DoTS/NIA clusters TIGR/DoTS/NIA AssociationAssociations between TIGR/DoTS/NIA cluster sequences and markers. Ensembl Gene ModelEnsembl gene model sequences, coordinates, & associations between these & markers NCBI Gene ModelNCBI gene model sequences, coordinates, & associations between these & markers UniProt AssociationUniProt/TrEMBL IDs and additional GenBank IDs for mouse markers. Plus GO and InterPro annotations UniGene AssociationUniGene cluster IDs for mouse markers. EST cDNA CloneMouse IMAGE, NIA, MGC, Riken, cDNAs and EST sequence associations MGC AssociationMGC IDs and associations between MGC full length sequences and MGC cDNAs RPCI CloneRPCI 23/24 BAC clones and sequence associations GO VocabularyUpdated Gene Ontology (GO) vocabularies from the central GO site. OMIM VocabularyUpdated OMIM disease terms MP VocabularyUpdated MP vocabulary (from OBO-Edit) AnatomyUpdated adult mouse anatomy ontology (from OBO-Edit) Mapping panelJAX, EUCIB, Copeland-Jenkins and many others PIRSF Mouse PIR superfamily terms and associations to markers SNPsMouse SNPs from dbSNP and associations between SNPs & markers.

30 Judith Blake Func Genomics2012 Who is the authority? Mouse data for which MGI serves as the authoritative source. Data typeWorking relationship Gene Symbol/NameMGD makes primary assignment; coordination with HGNC, RGNC Allele Symbol/NameMGD makes primary assignment Strain DesignationsMGD makes primary assignment Gene -to- nucleotide sequence associationCo-curation with NCBI Gene -to- protein sequence associationCo-curation with UniProt Gene Ontology (GO) annotationsMGD provides primary data set Mammalian Phenotype OntologyMGD develops and applies vocabulary Gene homology data between mouse & other speciesMGD curated orthology relationships Genotype -to- phenotype dataMGD provides primary curation Mouse model -to- human disease (OMIM)MGD provides primary curation

31 Judith Blake Func Genomics2012 Snapshot of MGI data content MGI data statistics March 2010 Genes (including unmapped mutants)36,290 Genes w/ nucleotide sequence29,110 Genes w/ protein sequence26,108 Genes annotated to GO (comprehensive)25,644 Mouse/human orthologs17,841 Mouse/rat orthologs16,767 Targeted alleles mutant alleles in mice 24,770 23,866 Genes w/ phenotypic alleles genes w/ targeted alleles 12,350 10,340 Human diseases w/ one or more mouse model999 QTL4,404 References150,341 mouse refSNPs10,089,692

32 Judith Blake Func Genomics2012 Having the data, we want to ask complex questions

33 Judith Blake Func Genomics2012 Curators use controlled terms from structured vocabularies (ontologies) to annotate complex biological systems described in the literature The knowledge is in the details

34 Judith Blake Gene Nomenclature Gene/Marker Type Allele Type Assay Type – Expression – Mapping Molecular Mutation Inheritance Mode Tissue Types Cell Types Cell Lines Units – Cytogenetic – Molecular ES Cell Line Strain Nomenclature Keyword lists standardize descriptions and enable comprehensive data retrieval Keyword lists support data integration Func Genomics2012

35 Judith Blake Sheer number of terms too much to remember and sort – Need standardized, stable, carefully defined terms – Need to describe different levels of detail – So…defined terms need to be related in a hierarchy With structured vocabularies/hierarchies – Parent/child relationships exist between terms – Increased depth -> Increased resolution – Can annotate data at appropriate level – May query at appropriate level All model organisms database and genome annotation systems have same issues Organogenesis Blood vessel development Angiogenesis Vasculogenesis Process terms But, keyword lists are not enough Func Genomics2012

36 Judith Blake And so, we started the Gene Ontology (GO) Formed to develop a shared language adequate for the annotation of molecular characteristics across organisms; a common language to share knowledge. Seeks to achieve a mutual understanding of the definition and meaning of any word used; thus we are able to support cross- database queries. Members agree to contribute gene product annotations and associated sequences to GO database; thus facilitating data analysis and semantic interoperability. Func Genomics2012

37 Judith Blake What is Ontology? Func Genomics2012 Dictionary:A branch of metaphysics concerned with the nature and relations of being. Barry Smith: The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality. 1606 1700s

38 Judith Blake Func Genomics2012  what kinds of things exist?  what are the relationships between these things? eye part_of sclera is_a sense organ develops from Optic placode A biological ontology is: A (machine and human) interpretable representation of some aspect of biological reality http://www.macula.org/anatomy/eyeframe.html

39 Judith Blake Gene Ontology: widely adopted AgBase Func Genomics2012

40 Judith Blake Molecular Function = elemental activity/task - the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective – broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex – subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme Sequence Ontology = genome features – regions, attributes, variants; examples include exon, CpG island, and transgenic insertion Cell Ontology = cell types – Examples include photoreceptor cell and pillar cell GO represents selected molecular domains Func Genomics2012

41 Judith Blake Func Genomics2012 Biological Process GO term: tricarboxylic acid cycle Synonym: Krebs cycle Synonym: citric acid cycle GO id:GO:0006099 Cellular Component GO term: mitochondrion GO id: GO:0005739 Definition: A semiautonomous, self replicating organelle that occurs in varying numbers, shapes, and sizes in the cytoplasm of virtually all eukaryotic cells. It is notably the site of tissue respiration. Molecular Function GO term: Malate dehydrogenase. GO id: GO:0030060 (S)-malate + NAD(+) = oxaloacetate + NADH. GO reflects biological knowledge for computers

42 Judith Blake Terms are defined graphically relative to other terms Func Genomics2012

43 Judith Blake Func Genomics2012 Ontologies can be represented as graphs, where the nodes are connected by edges Nodes = terms in the ontology Edges = relationships between the concepts node edge Ontology Structure

44 Judith Blake Ontological relations Types are related Network of terms forms a graph – Terms (nodes) – The edge type (relation) is important Two common relations: – Is_a – Part_of Func Genomics2012

45 Judith Blake eyeball cavitated organ is_a organ is_a instance_of Types (represented in the ontology) Instances (NOT represented in the ontology) Func Genomics2012

46 Judith Blake Formal definition of is_a is_a holds between types X is_a Y holds if and only if: – Given any thing that instantiates X at some time, that thing also instantiates Y at the same time Func Genomics2012

47 Judith Blake Func Genomics2012 GO terms are used for functional annotations I IDenotes an ‘is-a’ relationship Denotes a ‘part-of’ relationshipP Brain development [GO:0007420] (141 genes, 207 annotations) I

48 Judith Blake Annotations are assertions There is evidence that this gene product can be best classified using this term The source of the evidence and other information is included There is agreement on the meaning of the term Func Genomics2012

49 Judith Blake P05147 PMID: 2976880 GO:0047519 IDA P05147GO:0047519 IDA PMID:2976880 GO Term Reference Evidence Annotating Gene Products using GO Gene Product Func Genomics2012

50 Judith Blake NO Direct Experiment Inferred from evidence Direct Experiment in organism Evidence codes describe the basis of the annotation IDA: Inferred from direct assay IPI: Inferred from physical interaction IMP: Inferred from mutant phenotype IGI: Inferred from genetic interaction IEP: Inferred from expression pattern IEA: Inferred from electronic annotation ISS: Inferred from sequence or structural similarity TAS: Traceable author statement NAS: Non-traceable author statement IC: Inferred by curator RCA: Reviewed Computational Analysis ND: no data available Func Genomics2012

51 Judith Blake DAGs Definition Synonyms GO:54321 Terms … Transcription factor DNA binding Protein binding Ligand binding or carrier Vocabulary Annotations … J:65378TAS J:62648IDA J:60000IEA Ahr Edr2 Genes Synonyms NameMGI:105043 Vocabularies in MGI: GO Example Func Genomics2012

52 Judith Blake 34,315 genes 75,933 annotations Acetyl-CoA CoA-SH Citrate synthase Function 34,517 genes 65,513 annotations Cellular Component Biological Process 34,063 genes 87,565 annotations TCA Cycle March, 2010 GO @ MGI Total Genes: 35,147 Total Annot.:145,895 Total Papers: 8,985 Func Genomics2012

53 Judith Blake Now we can query across all annotations based on shared biological activity. Func Genomics2012

54 Judith Blake Biomedical Ontologies in MGI GO: (function, process, cellular location) SO: (sequence features) PRO: (specific proteins by species/strain) MP: (phenotypes) Traits / Behavior / Anatomies / Homologies (morphology) DO: (diseases, not phenotypes; definitions not diagnoses) CL: (cells and their lineages) OBO Foundry (standards and status) Func Genomics2012

55 Judith Blake Func Genomics2012 BioOntologies (GO) enable science Ontologies as terminology / classifications Ontologies enable data aggregation Ontologies used for data mining Ontologies used for statistical analysis

56 Judith Blake Func Genomics2012


Download ppt "Judith Blake Biomedical Ontologies and their role in functional genomics Judith A. Blake, Ph.D. The Jackson Laboratory Functional Genomics – February 2012."

Similar presentations


Ads by Google