Computational Methods in Systems Biology

Slides:



Advertisements
Similar presentations
12-3 RNA and Protein Synthesis
Advertisements

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
1 A B C
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
David Burdett May 11, 2004 Package Binding for WS CDL.
Introduction to Algorithms 6.046J/18.401J
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Break Time Remaining 10:00.
This module: Telling the time
The basics for simulations
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Computer vision: models, learning and inference
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
Overview of Genevestigator
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
Essential Cell Biology
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
ANSC644 Bioinformatics-Database Mining 1 ANSC644 Bioinformatics §Carl J. Schmidt §051 Townsend Hall §
Chapter 8 Estimation Understandable Statistics Ninth Edition
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Chapter 13 Web Page Design Studio
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Copyright Tim Morris/St Stephen's School
3 - 1 Copyright McGraw-Hill/Irwin, 2005 Markets Demand Defined Demand Graphed Changes in Demand Supply Defined Supply Graphed Changes in Supply Equilibrium.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Bio-Medical Informatics
Presentation transcript:

Computational Methods in Systems Biology Nir Friedman Maya Schuldiner .

What is Biology? “A branch of knowledge that deals with living organisms and vital processes” The hottest scientific frontier of our times Many great processes have been figured out Much is still unknown Tremendous impact on Medicine Both diagnosis, prognosis, and treatment I didn’t add pictures to anything, but we should once we decide what the slides will look like. NF: I agree

Bakers Yeast Saccharomyces Cereviciae Used to make bread and beer The simplest cell that still resembles human cells

Biological Systems are Complex The System is NOT just a sum of its parts

What is Systems Biology? “Systems biology is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behavior of that system” The last decades lead to revolution on how we can examine and understand biological systems Characterized by High-throughput assays Integration of multiple forms of experiments & knowledge Mathematical modeling

The Age of Genomes 404 Complete Microbial Genomes (Thousands in progress) 31 Complete Eukaryotic Genomes (315 in progress!) 3 Complete Plant Genomes (6 in progress) Bacteria 1.6Mb 1600 genes Eukaryote 13Mb ~6000 genes Animal 100Mb ~20,000 genes Human 3Gb ~30,000 genes? Individual Genomes? Bacteria Haemophilus influenzae Eukaryote S. Cerevisiae Animal C. elegans 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10

.

Ask Not What Systems Biology Can do For you….

Why Biology for NIPS Crowd? Quantity Data-intense discipline: Too vast for manual interpretation Systematic Collection of data on all genes/proteins/… Multi-faceted Measurements of complementary aspects of cellular function, development and disease states Challenge of integration and fusion of multiple data Has the potential to be medically applicative! This is way too much writing for one slide but we can think how to make it more interesting Shitloads of interesting data, chance to make an impact. Chance to publish in good journals?

Flow of Information in Biology DNA RNA Protein Phenotype Recipe (in safe) Working copy The resulting dish The Review

The “Post-Genomic Era” Systematic is Not Just More Assays DNA Genomic sequences Variations within a population … RNA Quantity Structure Degradation rate … Protein Quantity Location Modifications Interactions … Phenotype Genetic interventions Environmental interventions …

Outline Protein DNA RNA Phenotype Stores genetically inherited information Sequence of four nucleotide types (A, C, G, T) Two complementary strands creating base pairs (bp) 105 bp in bacteria, 3x109 in humans 6 X1013 in wheat

Understanding Genome Sequences ~3,289,000,000 characters: aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc cagcactttgggagatcgaggagggaggatcacctgaggtcaggagttac agacatggagaaaccccgtctctactaaaaatacaaaattagcctggcgt ggtggcgcatgcctgtaatcccagctactcgggaggctgaggcaggagaa tcgcttgaacccgggagcggaggttgcggtgagccgagatcgcaccgttg cactccagcctgggcgacagagcgaaactgtctcaaacaaacaaacaaaa aaacctgatacatggtatgggaagtacattgtttaaacaatgcatggaga tttaggttgtttccagtttttactggcacagatacggcaatgaatataat tttatgtatacattcatacaaatatatcggtggaaaattcctagaagtgg aatggctgggtcagtgggcattcatattgagaaattggaaggatgttgtc aaactctgcaaatcagagtattttagtcttaacctctcttcttcacaccc ttttccttggaagaaagctaaatttagacttttaaacacaaaactccatt ttgagacccctgaaaatctgggttcaaagtgtttgaaaattaaagcagag gctttaatttgtacttatttaggtataatttgtactttaaagttgttcca . . . Goal: Identify components encoded in the DNA sequence

Open Reading Frame ATGCTCAGCGTGACCTCA . . . CAGCGTTAA M L S V T S . . . Q R STP Protein-encoding DNA sequence consists of a sequence of 3 letter codons Starts with the START codon (ATG) Ends with a STOP codon (TAA, TAG, or TGA)

Finding Open Reading Frames ATGCTCAGCGTGACCTCA . . . CAGCGTTAA M L S V T S . . . Q R STP Try all possible starting points 3 possible offsets 2 possible strands Simple algorithm finds all ORFs in a genome Many of these are spurious (are not real genes) How do we focus on the real ones?

Using Additional Genomes Basic premise “What is important is conserved” Evolution = Variation + Selection Variation is random Selection reflects function Idea: Instead of studying a single genome, compare related genomes A real open reading frame will be conserved

Phylogentic Tree of Yeasts S. cerevisiae S. paradoxus S. mikatae S. bayanus C. glabrata S. castellii K. lactis A. gossypii K. waltii D. hansenii C. albicans Y. lipolytica N. crassa M. graminearum M. grisea A. nidulans S. pombe ~10M years Kellis et al, Nature 2003

Evolution of Open Reading Frame S. cerevisiae S. paradoxus S. mikatae S. bayanus ATGCTCAGCGTGACCTCA . . . ATGCTCAGCGTGACATCA . . . ATGCTCAGGGTGACA--A . . . ATGCTCAGG---ACA--A . . . Conserved positions Frame shift changes interpretation of downstream seq Variable positions A deletion

Examples Spurious ORF Confirmed ORF Frame shift ATG not conserved Variable Frame shift Spurious ORF ATG not conserved Confirmed ORF Greedy algorithm to find conserved ORFs surprisingly effective (> 99% accuracy) on verified yeast data Sequencing error [Kellis et al, Nature 2003]

Defining Conservation Conserved Variable Naïve approach Consensus between all species Problem: Rough grained Ignores distances between species Ignores the tree topology Goal: More sensitive and robust methods A A C G T A C C A % conserv 100 33 55 55

Probabilistic Model of Evolution Aardvark Bison Chimp Dog Elephant Random variables – sequence at current day taxa or at ancestors Potentials/Conditional distribution – represent the probability of evolutionary changes along each branch

Parameterization of Phylogenies Assumptions: Positions (columns) are independent of each other Each branch is a reversible continuous time discrete state Markov process governed by a rate matrix Q

Conserved vs. unconserved Two hypotheses: Use 2 3 4 1 2 3 4 1 Conserved Short branches (fewer mutations) Unconserved Long branches (more mutations) [Boffelli et al, Science 2003]

Genes Are Better Conserved log Fast/Slow % conserved [Boffelli et al, Science 2003]

Evolution: Beyond Vanilla Flavor Rate variation: Different positions evolve at different rate Dependencies between rates of related positions Rate does not have to be constant (change in evolutionary selection) These variants can be naturally captured by graphical models

Challenges Other types of genomic elements Small polypeptides (peptohormones, neuropeptides) RNA coding genes rRNA, tRNA, snoRNA… miRNA Regulatory regions

Regulatory Elements *Essential Cell Biology; p.268

Transcription Factor Binding Sites Relatively short words (6-20bp) Recognition is not perfect Binding sites allow variations Often conserved

Challenges Other types of genomic elements Small polypeptides (peptohormones, neuropeptides) RNA coding genes rRNA, tRNA, snoRNA… miRNA Regulatory regions Recognition of elements without comparisons Clearly sequence contains enough information to “parse” it within the living cell

Outline Protein DNA RNA Phenotype Copied from DNA template Conveys information (mRNA) Can also perform function (tRNA, rRNA, …) Single stranded, four nucleotide types (A,C, G, U) For each expressed gene there can be as few as 1 molecule and up to 10,000 molecules per cell.

Outline Measuring dynamic process Generic analysis Protein DNA RNA Phenotype Measuring dynamic process Generic analysis Reconstructing networks

Gene Expression Same DNA content Very different phenotype Difference is in regulation of expression of genes

High Throughput Gene Expression Transcription Translation Extract Microarray RNA expression levels of 10,000s of genes in one experiment

Dynamic Measurements Time courses Conditions Time courses Different perturbations (genetic & environmental) Biopsies from different patient populations … Genes Gasch et al. Mol. Cell 2001

Expression: Supervised Approaches Labeled samples Classifier confidence P-value =< 0.027 Feature selection + Classification Potential diagnosis/prognosis tool Characterizes the disease state  insights about underlying processes Maybe switch to a red/green example from another paper Segman et al, Mol. Psych. 2005

Expression: Unsupervised PCA Cluster Eisen et al. PNAS 1998; Alter et al, PNAS 2000

Compendia From “small” directed experiments to compendia of gene expression measurements Yeast Cancer Development Change of emphasis in questions Slide about network reconstruction?

Papers Compendia 26 datasets from Whitehead and Stanford Breast cancer Fibroblast EWS/FLI infection serum Gliomas HeLa cell cycle Leukemia Liver cancer Lung cancer NCI60 Neuro tumors Prostate cancer Stimulated immune PBMC Various tumors Viral infection B lymphoma 26 datasets from Whitehead and Stanford Segal et al Nat. Gen. 2004

Cancer span wide range of phenomena Tumor type specific Cancer types & folding Translation, degradation Cell cycle Apoptosis Immune IF & keratins Cytoskeleton & ECM Cell lines Chromatin nucleotide metabolism DNA damage / MMPs growth regulation Signaling & Signaling Muscle Adhesion & signaling Synapse & signaling Metabolism Breast Cytoskeleton (IF & MT) development Protein biosynthesis Nucleotide metabolism & oxidative phos. Signaling, development immune Metabolism, detox & ECM Tissues Signaling & CNS Cancer span wide range of phenomena Tumor type specific Tissue specific Generic across many tumors Modules >0.4 CNS Immune Cell lines Hemato Leukemia Lung/AD Breast\liver AD/CNS Liver Lung / Hemati Hemao Segal et al Nat. Gen. 2004

Goal: Reconstruct Cellular Networks Biocarta. http://www.biocarta.com/

First Attempt: Bayesian networks Gene A Gene C Gene D Gene E Gene B One gene  One variable An instance: microarray sample Use standard approaches for learning networks Friedman et al, JCB 2000

Second Attempt: Module Networks MAPK of cell wall integrity pathway SLT2 RLM1 CRH1 YPS3 PTP2 One common regulation function Regulation Function 1 Regulation Function 2 Regulation Function 3 Regulation Function 4 Idea: enforce common regulatory program Statistical robustness: Regulation programs are estimated from m*k samples Organization of genes into regulatory modules: Concise biological description Segal et al, Nature Genetics 2003

Learned Network (fragment) Atp1 Atp3 Module 1 (55 genes) Atp16 Hap4 Mth1 Module 4 (42 genes) Kin82 Tpk1 Module 25 (59 genes) Nrg1 Module 2 (64 genes) Msn4 Usv1 Tpk2 Gasch et al. 2001: Yeast Response to Environmental Stress 173 Yeast arrays 2355 Genes 50 modules

Validation How do we evaluate ourselves? Statistical validation Ability to generalize (cross validation test) Test Data Log-Likelihood (gain per instance) Number of modules -150 -100 -50 50 100 150 200 300 400 500 Bayesian network performance

Validation How do we evaluate ourselves? Statistical validation Biological interpretation Annotation database Literature reports Other experiments, potentially different experiment types

Visualization & Interpretation Cis-regulatory motifs Molecular Pathways (KEGG GeneMAPP) Functional annotations GO Visualization Interpretation Hypotheses Function Dynamics Regulation Expression profiles

Hap4 Msn4 Gene set coherence (GO, MIPS, KEGG) HAP4 Motif 29/55; p<2x10-13 Hap4 STRE (Msn2/4) 32/55; p<103 Oxid. Phosphorylation (26, 5x10-35) Mitochondrion (31, 7x10-32) Aerobic Respiration (12, 2x10-13) HAP4+STRE 17/29; p<7x10-10 Msn4 -500 -400 -300 -200 -100 Gene set coherence (GO, MIPS, KEGG) Match between regulator and targets Match between regulator and cis-reg motif Match between regulator and condition/logic Remember to mention that coherence underestimated p-values using hypergeometric dist; corrected for multiple hypotheses

Segal et al, Nature Genetics 2003 Validation How do we evaluate ourselves? Statistical validation Biological interpretation Experiments Test causal predictions in the real system Lead to additional understanding beyond the prediction Experimental validation of three regulators 3/3 successful results Segal et al, Nature Genetics 2003

Challenges New methodologies for the huge amount of existing RNA profiles Meta analysis Better mechanistic models Contrasting new profiles with existing databases Visualization Other measurements Degradation rates Localization

Outline Proteins are the main executers of cellular function DNA RNA Phenotype Proteins are the main executers of cellular function Building blocks are 20 different amino-acid Synthesized from mRNA template Acquires a sequence dependent 3-D conformation Proteomics: Systematic Study of Proteins

Why Measure Proteins? RNA Level ≠ Protein level Protein quantity is not a direct function of RNA levels Protein Level ≠ Activity level Activity of proteins is regulated by many additional mechanisms Cellular localization Post-translational modifications Co-factors (protein, RNA, …)

Challenges in Proteomics Problematic recognition: No generic mechanism to detect different protein forms Thousands of different proteins in the typical cell Protein abundances vary over several orders of magnitude

Making a Protein Generic TAG Tags make a protein generic Underlying assumption is that the tag does not change the protein All proteins have the same tag Inability to pool strains Each experiment is done on a “different” strain

TAP-Tag Libraries for Abundance ~4500 Yeast strains have been TAP tagged How much is each protein expressed? What is the proteome under different conditions?

Why Study Protein Complexes? # Most proteins in the cell work in protein complexes or through protein/protein interactions # To understand how proteins function we must know: - who they interact with - when do they interact - where do they interact - what is the outcome of that interaction

Using TAP-Tag to Find Complexes .

Large Scale Pull Downs Provide Information on Protein Complexes Both labs used the same proteins as bait Each lab got slightly different results The results depended dramatically on analysis method *Gavin et al. Nature 2006 *Krogan et al. Nature 2006 .

*Gavin et al. Nature 2006 *Krogan et al. Nature 2006

We can now define a yeast “interactome” Isnt full use of data Static picture .

Making a Protein Generic Fluorescent proteins allow us to visualize the proteins within the cell. Allow us to measure individual cells and the variation/ noise within a population

Cellular Localization Using GFP Tags What can it teach us? A library of yeast GFP fusion strains has been used to localize nearly all yeast proteins A collection of cloned C. elegans promoters is being created for similar purposes Huh et al Nature 2003 Genome Research 14:2169-2175, 2004

Challenges in Fluorescence-based Approaches Better Vision processing will allow to do this in High-Throughput and answer questions like: Changes in localization in response to cellular cues Changes in localization in response to environment cues Changes in localization in various genetic backgrounds Dynamics of localization changes

THROUGHPUT THE MAJOR BOTTLENECK .

Single Cell Measurements: Flow Cytometry Cells pass through a flow cell one at a time Lasers focused on the flow cell excite fluorescent protein fusions Allows multiple measurements (cell size, shape, DNA content) Applications: Protein abundance Protein-protein interactions Single-cell measurements

High Throughput Flow Cytometer Describe software. 4000 samples per 10 hours is much more rapid than western blotting (especially when one considers that data analysis is automated for cytometry, but not for western blotting. Compares favorably with the time required for a mass spec. run. 7 seconds/sample ~50,000 counts per sample

Comparison of mRNA to Protein Levels Allows Identification of Post-transcriptional Regulation Compare Rich media Poor media Observed behaviors No change in both Coordinated change Change in protein, but not mRNA Log2 Poor/Rich protein Log2 Poor/Rich mRNA Newman et al Nature 2006

Noise in Biological Systems Measurement of 10,000 individual cells allows measurement of variation (noise) in a biological context factors that affect levels of noise in gene expression: Abundance, mode of transcriptional regulation, sub-cellular localization Nature 441, 840-846(15 June 2006)

Challenges Proteomics is in its infancy - easier to make an impact Integrating this data with other proteomic/genomic data to better predict protein function Higher Throughput methods such as flow cytometry will allow generation of varied data: Different growth conditions, Cell cycle, Stress, Mating Tagging is mammalian cells becoming more feasable - near future should bring proteomic data on human cells

Outline Protein DNA RNA Phenotype Traits that selection can apply to, the observable characteristics Mutations in the DNA can cause a change in a phenotype. Shape and size Growth rate How many years your liver can survive alcohol damage….

Single Gene KO Phenotypic Screen Giaver et ., 2002

Starting to Probe the Cellular Network Genetic Interaction The effect of a mutation in one gene on the phenotype of a mutation in a second gene Different type of interaction - not physical DIFFERENT TYPE OF INTERACTION - NOT PHYSICAL

w.t. Genetic Interactions ∆X ∆Y ∆X Y<∆X*∆Y ∆X Y=∆X*∆Y The way to do so is to measure the growth rate of the wild type strain, and then measuring the effect of knocking out X on the growth rate, and the effect of knocking out Y on the growth rate. Now, if there is no functional relation between the genes, we expect the growth rate of the double mutant to be similar to the cumulative effect of both single knockouts. However if they belong to two alternative pathways that lead to the same product we would expect that the effect of the double mutant will be much graver then the cumulative effect of both single KO. we call this an aggravating interaction. But, if the activity of the genes in the pathway depend on wach other, then after knocking out X, the KO of Y will have almost no effect - hence we expect the doulbe KO growth rate to be faster than what we would expect from the cumulative effect of both single KO. This case is called an alleviating genetic interaction. Aggravating interaction No interaction Alleviating interaction

What is a genetic interaction (Epistasis)? The effect of a mutation in one gene on the phenotype of a mutation in a second gene. Genotype Growth Rate WT 1 D geneA x (x <= 1) D geneB y (y <= 1) geneA D geneB xy (Product) DIFFERENT TYPE OF INTERACTION - NOT PHYSICAL DIFFERENT TYPE OF INTERACTION - NOT PHYSICAL

What is a genetic interaction? Genetic Interaction Growth Rate None xy Aggravating less than xy Alleviating greater than xy What is a genetic interaction?  A B C X Y A B C X Y ROY KISHONI - PRODUCT OF TWO ACTIN - LYSINE TRANSPORT ATP PRODUCTION USED EXTENSIVELY IN GENETICS TO INFER GENE FUNCTION - WE DO COMBINATION BTWEEN CLASSIC GENETICS AND HIGH THROUGH PUT LARGE SCALE NOT ONLY FASTER BUT ALSO DEEPER UNDERSTANDING 

Systematic Method of Analyzing Double Mutants X ∆X:NAT ∆Y:KAN ∆X:NAT ∆Y:KAN MENTION NEVAN ∆X:NAT ∆Y:KAN WT ∆X:NAT ∆Y:KAN Double deletion mutants are made systematically Colony sizes are measured in high throughput Tong et al., 2001

Epistasis Mini Array Profiles E-MAPS Epistasis Mini Array Profiles BY SOLVING ALL THESE PROBLEMS WE COULD GET FOR THE FIRST TIME E-MAPS….. Aggravating Alleviating Schuldiner et al., Cell 2005

Defining Protein Complexes On Off B C A Co-complex proteins have similar interaction patterns alleviating interactions Aggravating Alleviating

Challenges for the future Only a small fraction of the information has been utilized in E-MAPS made so far E-MAPS to cover all yeast cellular processes to come out until the end of 2007 Extending this to human cells is now feasible using gene silencing techniques Amount of data scales exponentially - Higher organisms - more genes .

Outline Model-free approach Model-based approach Protein DNA RNA Phenotype Combined Insights Model-free approach Model-based approach

Why Integrate Data? High-throughput assays: attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat High-throughput assays: Observations about one aspect of the system Often noisy and less reliable than traditional assays Provide partial account of the system

Model-Free Approach Kan YAR041C YAR040W YAL003W YAL002W YAL001C GCN4 HSF1 RAP1 Salt Poor Rich Mito Cyto Nuc Gene Location Expression Phenotype Binding sites Treat different observations about elements as multivariate data Clustering Statistical tests

Model-Free Approach Finding bi-clusters in large compendium of functional data Tanay et al PNAS 2004

Model-Free Approach Pros: No assumptions about data Unbiased Can be applied to many data types Can use existing tools to analyze combined data Cons: Interpretation is post-analysis No sanity check Cannot deal with data from different modalities (interactions, other types of genetic elements)

Model-Based Approach What is a model? “A description of a process that could have generated the observed data” Idealized, simplified, cartoonish Describes the system & how it generates observations attttgggccagtgaatttttttctaagctaatatagttatttggacttt tgacatgactttgtgtttaattaaaacaaaaaaagaaattgcagaagtgt tgtaagcttgtaaaaaaattcaaacaatgcagacaaatgtgtctcgcagt cttccactcagtatcatttttgtttgtaccttatcagaaatgtttctatg tacaagtctttaaaatcatttcgaacttgctttgtccactgagtatatta tggacatcttttcatggcaggacatatagatgtgttaatggcattaaaaa taaaacaaaaaactgattcggccgggtacggtggctcacgcctgtaatcc aattgtgctctgcaaattatgatagtgatctgtatttactacgtgcatat

Explaining Expression DNA binding proteins Non-coding region Gene Repressor Activator Binding sites Coding region RNA transcript Key Question: Can we explain changes in expression? General concept: Transcription factor binding sites in promoter region should “explain” changes in transcription

Explaining Expression Relevant data: Expression under environmental perturbations Expression under transcription factors KOs Predicted binding sites of transcription factors Protein-DNA interactions of transcription factors Protein levels/location of transcription factors …

A Stab at Model-Based Analysis TCGACTGC GATAC CCAAT GCAGTT Motifs ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCAGCTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGTACTGATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTCGATCGATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATGCTAGCTAGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCACCCAACTGACTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTAGCTACGTAGCATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATCGTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG Genes Sequence TCGACTGC GATAC + CCAAT GCAGTT Motif Profiles Expression Profiles

Unified Probabilistic Model Sequence Experiment Gene Expression Sequence S4 S1 S2 S3 R2 R1 R3 Motifs Motif Profiles Expression Profiles Segal et al, RECOMB 2002, ISMB 2003

Unified Probabilistic Model Sequence Sequence S1 S2 S3 S4 Motifs R1 R2 R3 Module Experiment Motif Profiles Gene Expression Profiles Expression Segal et al, RECOMB 2002, ISMB 2003

Unified Probabilistic Model Sequence Sequence Observed S1 S2 S3 S4 Motifs R1 R2 R3 Experiment Module Motif Profiles ID Level Gene Observed Expression Profiles Expression Segal et al, RECOMB 2002, ISMB 2003

Probabilistic Model Regulatory Modules Sequence Sequence Motifs genes Motif profile Expression profile Regulatory Modules Sequence Sequence S1 S2 S3 S4 Motifs R1 R2 R3 Experiment Module Motif Profiles ID Level Gene Expression Profiles Expression Segal et al, RECOMB 2002, ISMB 2003

Model-Based Approach Pros: Incorporates biological principles Suggests mechanisms Incorporate diverse data modalities Declarative semantics -- easy to extend Cons: Reconstruction depends on the model Biological principles Bias

Physical Interactions

Physical Interactions Interaction between two proteins makes it more probable that they share a function reside in the same cellular localization their expression is coordinated have similar genetic interactions … Can we exploit this to make better inference of properties of proteins?

Relational Markov Network Probabilistic patterns hold for all groups of objects Represent local probabilistic dependencies -1 1  P2.M P1.N Protein Nucleus Cytoplasm Mitochndria Interaction Exists 2 1 -1  I.E P2.N P1.N Protein Nucleus Mitochndria Cytoplasm

Relational Markov Network Compact model Allows to infer protein attributes by combining Interaction network topology (observed) Observations about neighboring proteins

Adding Noisy Observations Add class for experimental assay View assay result as stochastic function (CPD) of underlying biology GFP image Protein Cytoplasm Nucleus Nucleus Cytoplasm Mitochndria Mitochndria Interaction Exists Directed CPD Protein Nucleus Mitochndria Cytoplasm

Uncertainty About Interactions Add interaction assays as noisy sensors for interactions GFP image Protein Cytoplasm Nucleus Nucleus Cytoplasm Mitochndria Mitochndria Interaction Exists Protein Nucleus Mitochndria Cytoplasm Assay Interact

Relational Markov Network Design Plan Relational Markov Network Simultaneous prediction Pre7 Pre9 Tbf1 Cdk8 Med17 Cln5 Taf10 Pup3 Pre5 Med5 Srb1 Med1 Taf1 Mcm1

Relational Markov Network Add potentials over interactions Protein Nucleus Protein Nucleus Interaction Exists Potential over Interaction Exists Interaction Exists Protein Nucleus

Relational Markov Models Combine (Noisy) interaction assays (Noisy) protein attribute assays Preferences over network structures To find a coherent prediction of the interaction network

Challenges

Discussion Every day papers are published with high-throughput data that is not analyzed completely or not used in all ways possible The bottlenecks right now are the time and ideas to analyze the data

The Need for Computational Methods Experiment design and facilitation: Control, vision, and signal processing; Natural Language Retrieval Low-level analysis: Informatics, Ontologies and Knowledge bases. Higher order analysis (discovering principles) Integration of multiple datasets Modeling: “in silico” simulation of biological systems (e-cell) This might not come across well. I thought about digaram here showing the experimental process and how they connect

The Need for Computational Methods Experiment Low-level analysis Modeling & Simulation High-level analysis

What are the Options? Analyze published data Abundant, easy to obtain Method oriented Don’t have to bump into biologists Two million other groups have that data too Collaborate with an experimental group Be involved in all stages of project Understand the system and the data better Have priority on the data Involved in generating & testing biological hypotheses Goal oriented Start your own experimental group…(yeah, sure)

Questions to Keep in Mind Crucial questions to ask about biological problems What quantities are measured? Which aspects of the biological systems are probed How are they measured? How this measurement represents the underlying system? Bias and noise characteristics of the data Why are these measurements interesting? Which conclusions will make the biggest impact?

Acknowledgements Slides: Special thanks: Gal Elidan, Ariel Jaimovich The Computational Bunch Yoseph Barash Ariel Jaimovich Tommy Kaplan Daphne Koller Noa Novershtern Dana Pe’er Itsik Pe’er Aviv Regev Eran Segal The Biologist Crowd David Breslow Sean Collins Jan Ihmels Nevan Krogan Jonathan Weissman