Shankar Subramaniam University of California at San Diego Data to Biology.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Linkage and Genetic Mapping
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Molecular Systems Biology 3; Article number 140; doi: /msb
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.
CSE 591 (99689) Application of AI to molecular Biology (5:15 – 6: 30 PM, PSA 309) Instructor: Chitta Baral Office hours: Tuesday 2 to 5 PM.
By: Katie Adolphsen, Robin Aldrich, Brandon Hu, Nate Havko.
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Systems Biology Biological Sequence Analysis
Gene expression analysis summary Where are we now?
Microarrays Dr Peter Smooker,
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Experimental and computational assessment of conditionally essential genes in E. coli Chao WANG, Oct
Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break 14:45 – 15:15Regulatory pathways lecture 15:15 – 15:45Exercise.
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Systems Biology Biological Sequence Analysis
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Systems Biology Biological Sequence Analysis
1 Validation and Verification of Simulation Models.
Regulatory element detection using correlation with expression (REDUCE) Literature search WANG Chao Sept 14, 2004.
Analysis of Drug-Gene Interaction Data Florian Ganglberger Sebastian Nijman Lab.
Introduction of Cancer Molecular Epidemiology Zuo-Feng Zhang, MD, PhD University of California Los Angeles.
Office hours Wednesday 3-4pm 304A Stanley Hall Review session 5pm Thursday, Dec. 11 GPB100.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Modeling Count Data over Time Using Dynamic Bayesian Networks Jonathan Hutchins Advisors: Professor Ihler and Professor Smyth.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Shankar Subramaniam University of California at San Diego Data to Biology.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Data Analysis Summary. Elephant in the room General Comments General understanding that informatics is integral in medical sequencing and other –omics.
Science & Technology Centers Program National Science Foundation Science & Technology Centers Program Bryn Mawr Howard University MIT Princeton Purdue.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Statistical Bioinformatics Genomics Transcriptomics Proteomics Systems Biology.
Verna Vu & Timothy Abreo
Scientific Data Annotation and Analysis Lecture 7.
Computational biology of cancer cell pathways Modelling of cancer cell function and response to therapy.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
Higher Biology Chapter 16 Gene Mutations. This type of mutation involves a change in one or more of the nucleotides in a strand of DNA. There are four.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
Decoding the Network Footprint of Diseases With increasing availability of data, there is significant activity directed towards correlating genomic, proteomic,
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Bioinformatics lectures at Rice University Li Zhang Lecture 11: Networks and integrative genomic analysis-3 Genomic data
Organization of statistical research. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network Science, Vol 292, Issue 5518, , 4 May 2001.
Modeling the cell cycle regulation by the RB/E2F pathway Laurence Calzone Service de Bioinformatique U900 Inserm / Ecoles de Mines / Institut Curie Collaborative.
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
The Future of Genetics Research Lesson 7. Human Genome Project 13 year project to sequence human genome and other species (fruit fly, mice yeast, nematodes,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
Microarray: An Introduction
G. M. Jacquez. Our Dear Friend and Colleague Jawaid Rasul 8/2/ /22/2011.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
University of California at San Diego
University of California at San Diego
University of California at San Diego
Figure 1 Evolution of genetic concepts underlying risk of cardiovascular disease Figure 1 | Evolution of genetic concepts underlying risk of cardiovascular.
Presentation transcript:

Shankar Subramaniam University of California at San Diego Data to Biology

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS KNOWLEDGE EXTRACTION FROM DATA –DEALING WITH THE COFFEE DRINKERS PROBLEM –HOW CAN BIOLOGICAL DATA BE INTEGRATED? –DEFINING THE GRANULARITY OF DATA –UNBIASED STATISTICAL METHODS –BIOLOGY-CONSTRAINED METHODS –INFORMATION METRICS –HOW DO WE DEAL WITH CONTEXT?

FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS NOISY DATA –CAN WE DEFINE HOW MUCH NOISE AND WHAT TYPE OF NOISE CAN BE TOLERATED IN EXTRACTING KNOWLEDGE? –IS MISSING DATA TANTAMOUNT TO NOISE? IF NOT HOW DO WE DEAL WITH IT?

FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS CLASSIFICATION OF MODULARITY FROM DATA –HOW CAN WE DEFINE MODULES (FUNCTIONAL, SPATIAL, TEMPORAL, ETC.) FROM DATA? –WHAT IS THE INFORMATION CONTENT IN THE MODULES? –CAN WE COMPARE MODULES QUANTITATIVELY?

FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS DEALING WITH DYNAMICAL DATA –HOW DO WE DEAL WITH TIME SERIES DATA? –HOW IS INFORMATION PROCESSED IN TIME SERIES DATA? –WHAT GRANULARITY AND CONTEXT IS NECESSARY TO ANALYZE THIS DATA?

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The coffee drinkers problem [highly skewed distributions]: – 90% of people are coffee drinkers What does this say about making drink predictions that are 90% accurate? Biology is all about highly skewed distributions – posing significant challenges for methods, measures, and validation

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The coffee drinkers problem – examples: – 99% of us likely do not have the disease one might be looking for – 99% of protein interactions are accounted for by 5% of the proteins – 99% of the known disease-implicated mutations occur in less than 5% of the people – (all estimates, but largely realistic)

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The coffee drinkers problem: – Most current techniques in data analysis are rendered useless because of this. – Statistical significance with meaningful null hypotheses are critical (information content is one of the most commonly used measures even today) – Simulation based methods often do not work – requiring analytics – Methods must optimize for these analytical measures of quality – Validation in the absence of complete data is hard

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The coffee drinkers problem (real examples) When is a module in a network significant? When is an observed mutation in a sequenced phenotype implicated genome significant? When is an alignment of two networks significant? When is correlation in time-course microarray data significant? Conversely: How do we detect the most significant modules in a network? How do we identify all phenotype-implicated mutations from a large number of sequenced diseased and normal genomes? How do we align networks for most statistically significant alignments? How do we find most correlated signals and associated groups of genes?

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The Hidden Terminal Problem: – Consider a phenotype, reflected in its genetic variants (i.e., what are nucleotide-level variations associated with a disease, say). – Often, these variations are not consistent (e.g., liver cancer manifests itself in gene mutations that are not all at the same place). – However, these variations correspond to significantly aligned pathways in the underlying networks (i.e., they disrupt the same function, albeit by altering different genes). – How do we go from an observable (phenotype/disease) to an abstraction (where the observable has little informative content) to other abstractions (where the observable might have significant information content). – More importantly, how do we go backwards (predict observables)?

“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS RESEARCHERS The Hidden Terminal Problem: Specific Instance Start from observed mutations in a specific disease (liver or breast cancer has significant genomic data available) The mutations result from both noise, other phenotypes, and the specific disease. A simple intersection yields no signal. Cross-reference against synthetic lethality data. Redefine intersection over pathways. Reassess mutations under this definition and quantify the significance of these mutations w.r.t. observed phenotype.