Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Annotation Databases Diseas es Anatom y Genes Physiolog y Diseases Physiology Anatomy Genes Diseases Medical Informatics Genomics and Bioinformatics.

Similar presentations


Presentation on theme: "Gene Annotation Databases Diseas es Anatom y Genes Physiolog y Diseases Physiology Anatomy Genes Diseases Medical Informatics Genomics and Bioinformatics."— Presentation transcript:

1 Gene Annotation Databases Diseas es Anatom y Genes Physiolog y Diseases Physiology Anatomy Genes Diseases Medical Informatics Genomics and Bioinformatics Novel relationships & Deeper insights

2 Identification and Prioritization of Novel Disease Candidate Genes Systems Biology Based Integrative Approaches Anil Jegga Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center (CCHMC) Department of Pediatrics, University of Cincinnati Cincinnati, Ohio Bioinformatics to Systems Biology November 16, 2007

3 Acknowledgements Jing Chen Eric Bardes Bruce Aronow Cincinnati Children’s Hospital Medical Center Computational Medical Center, Cincinnati Mouse Models of Human Cancers Consortium University of Cincinnati College of Medicine Support All the publicly available gene annotation resources especially NCBI, MGI and UCSC

4 Medical Informatics Bioinformatics & the “omes Patient Records Disease Database → Name → Synonyms → Related/Similar Diseases → Subtypes → Etiology → Predisposing Causes → Pathogenesis → Molecular Basis → Population Genetics → Clinical findings → System(s) involved → Lesions → Diagnosis → Prognosis → Treatment → Clinical Trials…… PubMed Clinical Trials Two Separate Worlds….. With Some Data Exchange… Genome Transcriptome miRNAome Interactome Metabolome Physiome Regulome Variome Pathome Pharmacogenome OMIM Clinical Synopsis Disease World 382 “omes” so far……… and there is “UNKNOME” too - genes with no function known (as on November 15, 2007) Proteome

5 PubMed Medical Informatics Patient Records Disease Database → Name → Synonyms → Related/Similar Diseases → Subtypes → Etiology → Predisposing Causes → Pathogenesis → Molecular Basis → Population Genetics → Clinical findings → System(s) involved → Lesions → Diagnosis → Prognosis → Treatment → Clinical Trials…… Clinical Trials Bioinformatics Genome Transcriptome Proteome Interactome Metabolome Physiome Regulome Variome Pathome Disease World OMIM ► Personalized Medicine ► Decision Support System ► Course/Outcome Predictor ► Diagnostic Test Selector ► Clinical Trials Design ► Hypothesis Generator ► Novel Gene/Drug Targets….. Integrative Genomics - Biomedical Informatics the Ultimate Goal……. miRNAome Pharmacogenome

6 No Integrative Genomics is Complete without Ontologies Gene Ontology (GO) Unified Medical Language System (UMLS) Gene WorldBiomedical World

7 Molecular Function = elemental activity/task –the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity –What a product ‘does’, precise activity Biological Process = biological goal or objective –broad biological goals, such as dna repair or purine metabolism, that are accomplished by ordered assemblies of molecular functions –Biological objective, accomplished via one or more ordered assemblies of functions Cellular Component = location or complex –subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme –‘ is located in’ (‘is a subcomponent of’ ) The 3 Gene Ontologies

8 Function (what)Process (why) Drive a nail - into woodCarpentry Drive stake - into soilGardening Smash a bugPest Control A performer’s juggling objectEntertainment Example: Gene Product = hammer

9 Unified Medical Language System Knowledge Server– UMLSKS The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems. The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain. The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word.

10 Unified Medical Language System Metathesaurus about over 1 million biomedical concepts About 5 million concept names from more than 100 controlled vocabularies and classifications (some in multiple languages) used in patient records, administrative health data, bibliographic and full-text databases and expert systems. The Metathesaurus is organized by concept or meaning. Alternate names for the same concept (synonyms, lexical variants, and translations) are linked together. Each Metathesaurus concept has attributes that help to define its meaning, e.g., the semantic type(s) or categories to which it belongs, its position in the hierarchical contexts from various source vocabularies, and, for many concepts, a definition. Customizable: Users can exclude vocabularies that are not relevant for specific purposes or not licensed for use in their institutions. MetamorphoSys, the multi-platform Java install and customization program distributed with the UMLS resources, helps users to generate pre-defined or custom subsets of the Metathesaurus. Uses : –linking between different clinical or biomedical vocabularies –information retrieval from databases with human assigned subject index terms and from free-text information sources –linking patient records to related information in bibliographic, full-text, or factual databases –natural language processing and automated indexing research

11 Open biomedical ontologies

12 Mammalian Phenotype Ontology 1.The Mammalian Phenotype (MP) Ontology enables robust annotation of mammalian phenotypes in the context of mutations, quantitative trait loci and strains that are used as models of human biology and disease. 2.Each node in MPO represents a category of phenotypes and each MP ontology term has a unique identifier, a definition, synonyms, and is associated with gene variants causing these phenotypes in genetically engineered or mutagenesis experiments. 3.In the current version of MPO, there are >4250 terms associated to >4300 unique Entrez mouse genes (extrapolated to ~4300 orthologous human genes).

13 Disease Gene Identification and Prioritization Hypothesis: Majority of genes that impact or cause disease share membership in any of several functional relationships OR Functionally similar or related genes cause similar phenotype. Functional Similarity – Common/shared Gene Ontology term Pathway Phenotype Chromosomal location Expression Cis regulatory elements (Transcription factor binding sites) miRNA regulators Interactions Other features…..

14 1.Most of the common diseases are multi- factorial and modified by genetically and mechanistically complex polygenic interactions and environmental factors. 2.High-throughput genome-wide studies like linkage analysis and gene expression profiling, tend to be most useful for classification and characterization but do not provide sufficient information to identify or prioritize specific disease causal genes. Background, Problems & Issues

15 3.Since multiple genes are associated with same or similar disease phenotypes, it is reasonable to expect the underlying genes to be functionally related. 4.Such functional relatedness (common pathway, interaction, biological process, etc.) can be exploited to aid in the finding of novel disease genes. For e.g., genetically heterogeneous hereditary diseases such as Hermansky-Pudlak syndrome and Fanconi anaemia have been shown to be caused by mutations in different interacting proteins. Background, Problems & Issues

16 Disease candidate gene studies Biological experiments (expensive, time consuming) Linkage, gene expression Potential candidate genes (too many!) Fine mapping Hand/cherry picking Prioritization approach dilated cardiomyopathy Linkage analysis Locus region 10q25-26 Ellinor et al. J Am Coll Cardiol ~9.5Mb with 68 genes 7 candidates selected by experts ADRB1 missing

17 Assumption: genes involved in the same complex disease will have similar functions dilated cardiomyopathy Current candidate gene prioritization tools Background, Problems & Issues Input: Multiple locus regions Enriched functions Prioritize genes based on the functions Approach without training Training: Known disease genes (10 from OMIM) Test: 68 genes at 10q25-26 Score test genes based on their similarity to training set Approach with training

18 TOPPGene Transcriptome Ontology Pathway based Prioritization of Genes Chen J, Xu H, Aronow BJ, Jegga AG Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8(1): 392 [Epub ahead of print] Applications: 1.For functional enrichment 2.For candidate gene prioritization Why another gene prioritization method?

19 Feature typePOCUSProspectrSUSPECTSENDEAVOURToppGene Year Sequence Features GO Annotations Transcript Features Protein Features Literature Phenotype Annotations Training set Comparison with other related approaches

20 Feature type POCUSProspectrSUSPECTSENDEAVOURToppGene Year Sequence Features & Annotations Gene length Homology Base composition Gene length Homology Base composition Blast cis-element Cytoband cis-element miRNA targets GeneSets Gene Annotations Gene Ontology Mouse Phenotype Transcript Features Gene expression EST expression Gene expression Protein Features domainsProtein domainsdomains interactions pathways domains interactions pathways LiteratureKeywordsCo-citation Training setNo Yes Comparison with other related approaches Feature Details

21 We do not check whether the human orthologous gene of a mouse gene causes similar phenotype. Rather, we assume that orthologous genes cause “orthologous phenotype” and test the potential of the extrapolated mouse phenotype terms as a similarity measure to prioritize human disease candidate genes Mammalian Phenotype Ontology

22 77 human genes explicitly associated with “heart development” (GO: ) Mouse orthologs cause various types of cardiac phenotype (MPO)

23 ToppGene – General Schema

24 TOPPGene - Data Sources 1.Gene Ontology: GO and NCBI Entrez Gene 2.Mouse Phenotype: MGI (used for the first time for human disease gene prioritization) 3.Pathways: KEGG, BioCarta, BioCyc, Reactome, GenMAPP, MSigDB 4.Domains: UniProt (Pfam, Interpro,etc.) 5.Interactions: NCBI Entrez Gene (Biogrid, Reactome, BIND, HPRD, etc.) 6.Pubmed IDs: NCBI Entrez Gene 7.Expression: GEO 8.Cytoband: MSigDB 9.Cis-Elements: MSigDB 10.miRNA Targets: MSigDB New features added

25 TOPPGene - Validation Random-gene cross-validation –Disease-gene relations from OMIM and GAD databases –Training set: disease genes with one gene ( “ target ” ) removed –Test set: 100 genes = “ target ” gene + 99 random genes –Rank of “ target ” gene –Control: random training sets –AUC and Sensitivity/Specificity

26 Random-gene cross-validation: breast cancer example Disease genes ATM BARD1 BRCA1 BRCA2 BRIP1 CASP8 CHEK2 KRAS PALB2 PIK3CA PPM1D RAD51 RB1CC1 SLC22A18 TP53 Training set BARD1 BRCA1 BRCA2 BRIP1 CASP8 CHEK2 KRAS PALB2 PIK3CA PPM1D RAD51 RB1CC1 SLC22A18 TP53 Test set KIAA1333 PQLC3 RBMY2OP ZNF133 LOC FBL SLEB4 FAM32A AACSL ATM NDUFB5 DENND4A C14orf106 … KCNJ16 99 random genes Ranked list 1.ATM 2.KIAA PQLC3 4.RBMY2OP 5.ZNF133 6.LOC FBL 8.SLEB4 9.FAM32A 10.AACSL 11.NDUFB5 12.DENND4A 13.C14orf106 … 100.KCNJ16 prioritization TOPPGene - Validation

27 Random-gene cross-validation result Training:19 diseases with 693 genes Control: 20 random sets of 35 genes each Sensitivity/Specificity: 77/90 AUC: Sensitivity: frequency of “target” genes that are ranked above a particular threshold position Specificity: the percentage of genes ranked below the threshold

28 Random-gene cross-validation with only one feature Using Mouse Phenotype as a feature of similarity measure improves human disease gene prioritization

29 Overall performance All features: All – MP: All – MP – PubMed: All All – MP All – MP - Pubmed Random-gene cross-validation by leaving one feature out Sensitivity: true positive rate at a cutoff score Specificity: true negative rate at the same cutoff Using Mouse Phenotype as a feature of similarity measure improves human disease gene prioritization

30 Locus-region cross-validation using different feature sets Features Average rank ratio of “ target ” genes Number of times “ target ” genes were ranked top 5% Number of times “ target ” genes were ranked top 10% All7.39% GO + MP + PubMed7.50% MP + PubMed7.08% Without GO6.84% Without Pathway7.66% Without Domain6.71% Without Interaction7.17% Without Expression7.28% Without MP9.77% Without Pubmed9.91% Without MP & Pubmed22.61%7180

31 ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis

32 ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis

33 ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis

34 ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis

35 1.Direct protein–protein interactions (PPI) are one of the strongest manifestations of a functional relation between genes. 2.Hypothesis: Interacting proteins lead to same or similar disease phenotypes when mutated. 3.Several genetically heterogeneous hereditary diseases are shown to be caused by mutations in different interacting proteins. For e.g. Hermansky-Pudlak syndrome and Fanconi anaemia. Hence, protein–protein interactions might in principle be used to identify potentially interesting disease gene candidates. PPI - Predicting Disease Genes

36 Known Disease Genes Direct Interactants of Disease Genes Mining human interactome HPRD BioGrid Which of these interactants are potential new candidates? Indirect Interactants of Disease Genes Prioritize candidate genes in the interacting partners of the disease- related genes Training sets: disease related genes Test sets: interacting partners of the training genes Prioritize candidate genes in the interacting partners of the disease- related genes Training sets: disease related genes Test sets: interacting partners of the training genes

37 Example: Breast cancer OMIM genes (level 0) Directly interacting genes (level 1) Indirectly interacting genes (level2) !

38 ToppGene web server (http://toppgene.cchmc.org) For candidate gene prioritization

39 ToppGene web server (http://toppgene.cchmc.org) For candidate gene prioritization

40 ToppGene web server (http://toppgene.cchmc.org) For candidate gene prioritization

41 Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature May 27. rs idLocationGeneTraining setTest set rs q26FGFR215 OMIM genes 83 genes in the region Prioritization result: RankGeneDescriptionP-value 1BUB3budding uninhibited by benzimidazoles 3 homolog FGFR2fibroblast growth factor receptor BCCIPBRCA2 and CDKN1A interacting protein

42 Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature May 27.

43 ToppGene Prioritization Example: Breast cancer Ranked Interactants RankGeneDescription 1ATRataxia telangiectasia and Rad3 related 2FANCD2Fanconi anemia, complementation group D2 3NBN (NBS1)Nibrin Training setTest set 15 OMIM genes342 interacting genes

44 Limitations General limitations of any training-test strategy: Prior knowledge of disease-gene associations. Assumption that the disease genes yet to discover will be consistent with what is already known about a disease. Depend on the accuracy and completeness of the functional annotations. –Only one-fifth of the known human genes have pathway or phenotype annotations and there are still more than 40% genes whose functions are not defined! Chen et al., 2007; BMC Bioinformatics

45 Mouse Phenotype - Limitations 1.MP is not a disease-centric ontology and the phenotype of a same gene mutation can vary depending on specific mouse strains or their genetic backgrounds. 2.Orthologous genes need not necessarily result in orthologous phenotypes. Possible Solutions - Future Directions More efficient cross-species phenome extrapolation where in the mouse phenotype terms are mapped to human phenotype concepts (from UMLS) semantically (“orthologous phenotype”) and the resultant orthologous genes associated with an orthologous phenotype are identified. Chen et al., 2007; BMC Bioinformatics

46 PPIs for disease gene identification Limitations 1.Noisy interactome data In vitro Vs in vivo (for e.g. only 5.8% of yeast two- hybrid predicted interactions were confirmed by HPRD) Extrapolation of interactions from one species to another Bias towards “well-studied” genes/proteins 2.Too many interactants! Hub proteins 3.Two interacting proteins need not lead to similar phenotype when mutated 4.Disease proteins may lie at different points in a pathway and need not interact directly 5.Lastly, disease mutations need not always involve proteins Oti et al., 2006; J Med Gen

47 (under presentations) Thank You! And PRIORITIZATION too!


Download ppt "Gene Annotation Databases Diseas es Anatom y Genes Physiolog y Diseases Physiology Anatomy Genes Diseases Medical Informatics Genomics and Bioinformatics."

Similar presentations


Ads by Google