Lawrence Hunter, Ph.D. Professor and Director Computational Bioscience Program University of Colorado School of Medicine

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
A general-purpose text annotation tool called Knowtator is presented. Knowtator facilitates the manual creation of annotated corpora that can be used for.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
Gene Ontology John Pinney
CSE 591 (99689) Application of AI to molecular Biology (5:15 – 6: 30 PM, PSA 309) Instructor: Chitta Baral Office hours: Tuesday 2 to 5 PM.
Evidence-Based Information Retrieval in Bioinformatics
Lawrence Hunter, Ph.D. Professor and Director Computational Bioscience Program University of Colorado School of Medicine
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Storing and Retrieving Biological Instances with the Instance Store Daniele Turi, Phillip Lord, Michael Bada, Robert Stevens.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
What is an ontology and Why should you care? Barry Smith with thanks to Jane Lomax, Gene Ontology Consortium 1.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Demonstration Trupti Joshi Computer Science Department 317 Engineering Building North (O)
Topics in Computational Biology (COSI 230a) Pengyu Hong 09/02/2005.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
ADL Slide 1 December 15, 2009 Evidence-Centered Design and Cisco’s Packet Tracer Simulation-Based Assessment Robert J. Mislevy Professor, Measurement &
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
>>> Korean BioInformation Center >>> KRIBB Korea Research institute of Bioscience and Biotechnology GS2PATH: Linking Gene Ontology and Pathways Jin Ok.
Review of Ondex Bernice Rogowitz G2P Visualization and Visual Analytics Team March 18, 2010.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Inductive Logic Programming Includes slides by Luis Tari CS7741L16ILP.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
Big Idea 1: The Practice of Science Description A: Scientific inquiry is a multifaceted activity; the processes of science include the formulation of scientifically.
Open Biomedical Ontologies. Open Biomedical Ontologies (OBO) An umbrella project for grouping different ontologies in biological/medical field –a repository.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
March 24, Integrating genomic knowledge sources through an anatomy ontology Gennari JH, Silberfein A, and Wiley JC Pac Symp Biocomputing 2005:
PattArAn – From Annotation Triplets to Sentence Fingerprints Motivation Motivation  Scientific concepts are annotated with controlled vocabulary (CV)
Affymetrix/BioCarta comparison & Java-based pathway analysis Michael Edmonson 2/26/2003.
Agent-based methods for translational cancer multilevel modelling Sylvia Nagl PhD Cancer Systems Science & Biomedical Informatics UCL Cancer Institute.
Cell Signaling Ontology Takako Takai-Igarashi and Toshihisa Takagi Human Genome Center, Institute of Medical Science, University of Tokyo.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Modeling of complex systems: what is relevant? Arno Knobbe, Marvin Meeng, Joost Kok Leiden Institute of Advanced Computer Science (LIACS)
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Bioinformatics Core Facility Guglielmo Roma January 2011.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
To Boldly GO… Amelia Ireland GO Curator EBI, Hinxton, UK.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
Bioinformatics and Computational Biology
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
High throughput biology data management and data intensive computing drivers George Michaels.
Effect of Alcohol on Brain Development NormalFetal Alcohol Syndrome.
How to read a scientific paper Professor Mark Pallen Acknowledgements : John W. Little and Roy Parker, University of Arizona.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Center for Bioinformatics and Genomic Systems Engineering Bioinformatics, Computational and Systems Biology Research in Life Science and Agriculture.
A knowledge-based text annotation tool
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Department of Genetics • Stanford University School of Medicine
The Omics Dashboard Suzanne Paley Pathway Tools Workshop 2018
A Short Tutorial on Causal Network Modeling and Discovery
Schedule for the Afternoon
Presentation transcript:

Lawrence Hunter, Ph.D. Professor and Director Computational Bioscience Program University of Colorado School of Medicine Accelerating Biomedical Discovery

How to Understand Gene Sets? There is no “gene” for any complex phenotype; gene products function together in dynamic groups A key task is to understand why a set of gene products are grouped together in a condition, exploiting all existing knowledge about: –The genes (all of them) –Their relationships (|genes| 2 ) –The condition(s) under study.

The amount of information relevant to the task 1,000 genomes project will create 1,400GB next year

Yet Still Not Enough! Experimental coverage of interactions and pathways is still sparse, especially in mammals

Exponential knowledge growth 1,170 peer-reviewed gene-related databases in 2009 NAR db issue 804,399 PubMed entries in 2008 (> 2,200/day) Breakdown of disciplinary boundaries makes more of it relevant to each of us “Like drinking from a firehose” – Jim Ostell

Knowledge-based data analysis Goal: Bring all of this information (and more!) to bear on analyzing experimental results. How? 3R systems –Integrate multiple databases (using the semantic web) –Extract knowledge from the literature –Infer implicit interactions –Build knowledge networks Nodes are fiducials, like genes or ontology terms Arcs (relations) are qualified (typed) and quantified (with reliability) –Deliver a tool for analysts to use knowledge networks to understand experiments and generate hypotheses

Reading The best source of knowledge is the literature OpenDMAP is significant progress in concept recognition in biomedical text Even simple-minded approaches are powerful –Gene co-occurrence widely used –Thresholded co-occurrence fraction is better

OpenDMAP extracts typed relations from the literature Concept recognition tool –Connect ontological terms to literature instances –Built on Protégé knowledge representation system –New project to hook to NCBO ontologies dynamically Language patterns associated with concepts and slots –Patterns can contain text literals, other concepts, constraints (conceptual or syntactic), ordering information, etc. –Linked to many text analysis engines via UIMA Best performance in BioCreative II IPS task >500,000 instances of three predicates (with arguments) extracted from Medline Abstracts [Hunter, et al., 2008]

GO: GO: CHEBI:33567 MGI:94876 MGI: catechols (CHEBI:33566) catecholamines (CHEBI:33567) adrenaline (CHEBI:33568) noradrenaline (CHEBI:33569 ) BPBP carboxylic acid metabolic process (GO: ) BPBP catecholamine biosynthesis process (GO: ) BPBP response to toxin (GO: ) … BPBP catecholamine secretion (GO: ) BPBP protein transport (GO: ) BPBP vesicle organization (GO: ) … Ddc; MGI:94876 Cadps; MGI: Reliability = Reasoning in knowledge networks [Bada & Hunter, 2006]

Inferred interactions Dramatically increase coverage… But at the cost of lower reliability We apply new method to assess reliability without an explicit gold standard [Leach, et al., 2007; Gabow, et al., 2008] Top 1,000 Craniofacial genes (1,000,000 possible edges)

3R Knowledge Networks Combine diverse sources… –Databases of interactions –Information extracted from the literature (CF or DMAP) –Inference of interactions … Into a unified knowledge summary network: –Every link gets a reliability value –Combine multiple links for one pair into a single summary More sources  more reliable Better sources  more reliable “Noisy Or” versus “Linear Opinion Pool” Summaries allow for effective use of noisy inferences –[Leach PhD thesis 2007; Leach et al., 2007]

Knowledge-based analysis of experimental data High-throughput studies generate their own interaction networks tied to fiducials –E.g. Gene correlation coefficients in expression data Combine with background knowledge by: –Averaging (highlights already known linkages) –Hanisch (ISMB 2002) method (emphasizes data linkages not yet well supported by the literature) Report highest scoring data + knowledge linkages, color coding for scores of average, logistic or both.

The Hanalyzer: 3R proof of concept [Leach, Tipney, et al., PLoS Comp Bio 2009] Knowledge network built for mouse –NLP only CF and DMAP for three relationships from PubMed abstracts Simple reasoning (co-annotation, including ontology cross-products) Visualization of combined knowledge / data network via Cytoscape + new plugins

Knowledge Network External sources Reading methods Reasoning methods Reporting methods Gene database 1 Ontology annotations Gene database 2 Gene database n … Medline abstracts Data Network Visualization & Drill-down tool Biomedical language processing Parsers & Provenance tracker Experimental data Network integration methods Literature co-occurrence Co-annotation inference Semantic integration Reliability estimation Co-database inference Ontology enrichment

First application: Craniofacial Development NICHD-funded study (Rich Spritz; Trevor Williams) focused on cleft lip & palate Well designed gene expression array experiment: –Craniofacial development in normal mice (control) –Three tissues (Maxillary prominence, Fronto-nasal prominence, Mandible) –Five time points (every 12 hours from E10.5) –Seven biological replicates per condition (well powered) >1,000 genes differentially expressed among at least 2 of the 15 conditions (FDR<0.01)

The Whole Network Craniofacial dataset, covering all genes on the Affy mouse chip. Graph of top 1000 edges using AVE or HANISCH (1734 in total). Edges identified by both. Focus on mid-size subnetwork

Co-occurrence in abstracts: PMID: … R = DMAP transport relation R = Shared GO biological processes: GO:6139… R = Shared GO cell component: GO:5667… R = Shared GO molecular functions: GO:3705… R = Shared knockout phenotypes: MP:5374 … R = Shared interpro domains: IPR:11598… R = Premod_M interaction: Mod R = Inferred link through shared GO/ChEBI: ChEBI:16991 R = 0.01 Correlation in expression data: P data = Link calculations for MyoD1  MyoG

AVE edges Both edges Skeletal muscle structural components Skeletal muscle contractile components Proteins of no common family Strong data and background knowledge facilitate explanations Goal is abductive inference: why are these genes doing this? –Specifically, why the increase in mandible before the increase in maxilla, and not at all in the frontonasal prominence?

Exploring the knowledge network

Scientist + aide + literature  explanation: tongue development AVE edges Both edges Skeletal muscle structural components Skeletal muscle contractile components Proteins of no common family Myogenic cells invade the tongue primodia ~E11 Myoblast differentiation and proliferation continues until E15 at which point the tongue muscle is completely formed. The delayed onset, at E12.5, of the same group of proteins during mastication muscle development.

inferred synapse signaling proteins Inferred myogenic proteins HANISCH edges AVE edges Both edges Proteins of no common family Proteins in the previous AVE based sub-network On to Discovery Add the strong data, weak background knowledge (Hanisch) edges to the previous network, bringing in new genes. Four of these genes not previously implicated in facial muscle development (1 almost completely unannotated)

Prediction validated! HoxA2,E12.5 ApoBEC2,E11.5 Zim1,E12.5E43rik,E12.5

Transforming biomedical research with 3R systems? Deeper connections to the literature –NLP on full texts of journal articles & textbooks –Stay current, be aware of priority & citations Abductive QA (provide evidence, explanation) Better user experience in reporting –Integration with an analyst’s notebook –More and better sense-making approaches –Different types of data (e.g. GWAS) –Automated focus on “interesting” material

OBO for knowledge representation and reasoning What is the role of CAV3 in muscle? “In contrast to clathrin- coated and COPI- or COPII- coated vesicles, caveolae are thought to invaginate and collect cargo proteins by virtue of the lipid composition of the calveolar membrane, rather than by the assembly of a cytosolic protein coat. Caveolae pinch off from the plasma membrane and can deliver their contents either to endosome-like compartments or (in a process called transcytosis, which is discussed later) to the plasma membrane on the opposite side of a polarized cell.” etc!

KR&R poses new challenges Need many on-the-fly terms (cross-products!) –Not all cross-products are valid: caveoli of muscle cells work, but not all CCs are in all cells (e.g. axons) Need many new relationships: –has-function, is-realization-of, occurs-in, precedes, results-in-formation-of, results-in- transport-to… Need to integrate multiple ontologies: e.g. cell from CL (muscle) and cell from CC (caveoli) Non-logical inference! CAV3 annotated to caveoli; can’t logically infer caveoli of muscle cells.

To find out more… Leach, et al., (2009) “Biomedical Discovery Acceleration, with Applications to Craniofacial Development” PLoS Comp Bio 5(3):e (or just search YouTube for “hanalyzer”) Presentation at ISMB Highlights track See also our Ontology Quality Assurance talk at ISMB (Verspoor, et al.)

Preview: Univocality in GO Univocality (Spinoza, 1677) “a shared interpretation of the nature of reality” For GO/OBO, consistency of expression Transformation-based method detects failures: GO: induction by organism of symbiont apoptosis GO: induction by organism of systemic acquired resistance in symbiont GO: radial glial cell differentiation in the forebrain GO: cell proliferation in forebrain GO: cellular bud site selection GO: selection of site for barrier septum formation GO: telomere maintenance in response to DNA damage GO: DNA damage response, signal transduction See Verspoor, et al., in ISMB proceedings…

Acknowledgements Sonia Leach (Design, first implementation) Hannah Tipney (Analyst) Bill Baumgartner (UIMA, Software engineer) Philip Ogren (Knowtator) Mike Bada (Ontologist) Helen Johnson (Linguist) Kevin Cohen (NLP guru) Lynne Fox (Librarian) Aaron Gabow (Programmer) NIH grants –R01 LM –R01 LM –R01 GM –G08 LM –T15 LM MIT Press for permission to use Being Alive for doing science

Opportunities at one of the best Computational Bioscience Programs Top faculty, great research, serious education Institutional Training Grant from NLM – Generous graduate and postdoctoral fellowships Grad school application deadline January 1 Currently open faculty positions & postdocs More info at Ask me for details Come Join Us!