Large scale genomic data mining Curtis Huttenhower 10-23-09 Harvard School of Public Health Department of Biostatistics.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Network integration and function prediction: Putting it all together Slides courtesy of Curtis Huttenhower Harvard School of Public Health Department.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Network integration and function prediction: Putting it all together Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Putting genetic interactions in context through a global modular decomposition Jamal.
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Supervised and unsupervised methods for large scale genomic data integration Curtis Huttenhower Harvard School of Public Health Department of.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Scalable data mining for functional genomics and metagenomics
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Unit 1: The Language of Science  communicate and apply scientific information extracted from various sources (3.B)  evaluate models according to their.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Whole Genome Expression Analysis
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
Answering biological questions using large genomic data collections Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
Gene Set Enrichment Analysis (GSEA)
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Microarrays to Functional Genomics: Generation of Transcriptional Networks from Microarray experiments Joshua Stender December 3, 2002 Department of Biochemistry.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Computational biology of cancer cell pathways Modelling of cancer cell function and response to therapy.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Apostolos Zaravinos and Constantinos C Deltas Molecular Medicine Research Center and Laboratory of Molecular and Medical Genetics, Department of Biological.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Large scale genomic data integration for functional genomics and metagenomics Curtis Huttenhower Harvard School of Public Health Department of.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Large scale genomic data integration for functional metagenomics Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Shortest Path Analysis and 2nd-Order Analysis Ming-Chih Kao U of M Medical School
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
NCode TM miRNA Analysis Platform Identifies Differentially Expressed Novel miRNAs in Adenocarcinoma Using Clinical Human Samples Provided By BioServe.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
High throughput biology data management and data intensive computing drivers George Michaels.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Network integration and function prediction: Putting it all together
Genomic Data Integration
Large Scale Data Integration
Genomic Data Manipulation
Volume 20, Issue 5, Pages (November 2014)
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Predicting Gene Expression from Sequence
Volume 20, Issue 5, Pages (November 2014)
Presentation transcript:

Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics

Mining Biological Data ~100 GB More than 100GB

Mining Biological Data ~100 GB More than 100GB

Mining Biological Data ~100 GB More than 100GB How can we ask and answer specific biomedical questions using thousands of genome-scale datasets?

Outline 5 2. Applications: Human molecular data and clinical cancer cohorts 1. Methodology: Algorithms for mining genome-scale datasets 3. Next steps: Methods for microbial communities and functional metagenomics

A Definition of Functional Genomics 6 Genomic data Prior knowledge Data ↓ Function ↓ Function Gene ↓ Gene ↓ Function

MEFIT: A Framework for Functional Genomics 7 BRCA1BRCA20.9 BRCA1RAD510.8 RAD51TP … Related Gene Pairs High Correlation Low Correlation Frequency MEFIT

MEFIT: A Framework for Functional Genomics 8 BRCA1BRCA20.9 BRCA1RAD510.8 RAD51TP … BRCA2SOX20.1 RAD51FOXP20.2 ACTR1H6PD0.15 … Related Gene Pairs Unrelated Gene Pairs High Correlation Low Correlation Frequency MEFIT

MEFIT: A Framework for Functional Genomics 9 Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 Functional Relationship

MEFIT: A Framework for Functional Genomics 10 Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 Functional Relationship Biological Context Functional area Tissue Disease …

Functional Interaction Networks 11 MEFIT Global interaction network Autophagy network Vacuolar transport network Translation network Currently have data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases

Predicting Gene Function 12 Cell cycle genes Predicted relationships between genes High Confidence Low Confidence

Predicting Gene Function 13 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Predicting Gene Function 14 Predicted relationships between genes High Confidence Low Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest.

Comprehensive Validation of Computational Predictions 15 Genomic data Computational Predictions of Gene Function MEFIT SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Petite frequency Growth curves Confocal microscopy New known functions for correctly predicted genes Retraining With David Hess, Amy Caudy Prior knowledge

Evaluating the Performance of Computational Predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

Evaluating the Performance of Computational Predictions Original GO Annotations Genes involved in mitochondrion organization and biogenesis 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated.

Functional Associations Between Contexts 18 Predicted relationships between genes High Confidence Low Confidence The average strength of these relationships indicates how cohesive a process is. Cell cycle genes

Functional Associations Between Contexts 19 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Functional Associations Between Contexts 20 DNA replication genes The average strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Functional mapping: Scoring functional associations 21 How can we formalize these relationships? Any sets of genes G 1 and G 2 in a network can be compared using four measures: Edges between their genes Edges within each set The background edges incident to each set The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

Functional mapping: Bootstrap p-values Scoring functional associations is great… …how do you interpret an association score? –For gene sets of arbitrary sizes? –In arbitrary graphs? –Each with its own bizarre distribution of edges? 22 Empirically! # Genes Histograms of FAs for random sets For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is approximately normal with mean 1. Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Null distribution σ s for one graph

Functional Associations Between Processes 23 Edges Associations between processes Very Strong Moderately Strong Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional Associations Between Processes 24 Edges Associations between processes Very Strong Moderately Strong Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional Associations Between Processes 25 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance AHP1 DOT5 GRX1 GRX2 … APE3 LAP4 PAI3 PEP4 …

Functional Maps: Focused Data Summarization 26 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps: Focused Data Summarization 27 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

Outline Applications: Human molecular data and clinical cancer cohorts 1. Methodology: Algorithms for mining genome-scale datasets 3. Next steps: Methods for microbial communities and functional metagenomics

HEFalMp: Predicting human gene function 29 HEFalMp

HEFalMp: Predicting human genetic interactions 30 HEFalMp

HEFalMp: Analyzing human genomic data 31 HEFalMp

HEFalMp: Understanding human disease 32 HEFalMp

Validating Human Predictions 33 Autophagy Luciferase (Negative control) ATG5 (Positive control) LAMP2RAB11A Not Starved (Autophagic) Predicted novel autophagy proteins 5½ of 7 predictions currently confirmed With Erin Haley, Hilary Coller

Current Work: Molecular Mechanisms in a Colon Cancer Cohort 34 With Shuji Ogino, Charlie Fuchs ~3,100 gastrointestinal subjects ~3,800 tissue samples ~1,450 colon cancer samples ~1,150 CpG island methylation ~1,200 LINE-1 methylation ~700 TMA immuno- histochemistry ~2,100 cancer mutation tests Health Professionals Follow-Up Study Nurse’s Health Study LINE-1 Methylation Repetitive element making up ~20% of mammalian genomes Very easy to assay methylation level (%) Good proxy for whole-genome methylation level DASL Gene Expression Gene expression analysis from paraffin blocks Thanks to Todd Golub, Yujin Hoshida ~775 gene expression

Colon Cancer: LINE-1 methylation levels 35 ρ = 0.718, p < 0.01 Ogino et al, 2008 Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. What does it all mean?? What is the biological mechanism linking LINE-1 methylation to colon cancer? With Shuji Ogino, Charlie Fuchs

Colon Cancer: LINE-1 methylation levels 36 ρ = 0.718, p < 0.01 Ogino et al, 2008 Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. This suggests a genetic effect. This suggests a copy number variation. This suggests linkage to a cancer-related pathway. Is anything different about these outliers? What is the biological mechanism linking LINE-1 methylation to colon cancer? With Shuji Ogino, Charlie Fuchs

Colon Cancer: LINE-1 methylation levels 37 What is the biological mechanism linking LINE-1 methylation to colon cancer? Preliminary Data Six genes differentially expressed even using naïve methods One uncharacterized, one oncogene, three malignancy, one histone 1/3 are from a family with known variable GI expression, prognostic value 2/3 fall in same cytogenic band, which is also a known CNV hotspot HEFalMp links to a set of transmembrane receptors/channels Better analysis pulls out mostly one-carbon metabolism and a few more signaling pathways (neurotransmitters??) Check back in a couple of months!

Outline Applications: Human molecular data and clinical cancer cohorts 1. Methodology: Algorithms for mining genome-scale datasets 3. Next steps: Methods for microbial communities and functional metagenomics

Next Steps: Microbial Communities Data integration is off to a great start in humans –Complex communities of distinct cell types –Very sparse prior knowledge Concentrated in a few specific areas –Variation across populations –Critical to understand mechanisms of disease 39

Next Steps: Microbial Communities What about microbial communities? –Complex communities of distinct species/strains –Very sparse prior knowledge Concentrated in a few specific species/strains –Variation across populations –Critical to understand mechanisms of disease 40

Next Steps: Functional Metagenomics Metagenomics: data analysis from environmental samples –Microflora: environment includes us! Another data integration problem –Must include datasets from multiple organisms Another context-specificity problem –Now “context” can also mean “species” What questions can we answer? –How do human microflora interact with diabetes, obesity, oral health, antibiotics, aging, … –What’s shared within community X? What’s different? What’s unique? –What’s perturbed in disease state Y? One organism, or many? Host interactions? –Current methods annotate ~50% of synthetic data, <5% of environmental data 41

Next Steps: Microbial Communities 42 ~120 available expression datasets ~70 species Weskamp et al 2004 Flannick et al 2006 Kanehisa et al 2008 Tatusov et al 1997 Data integration works just as well in microbes as it does in humans We know an awful lot about some microorganisms and almost nothing about others Purely sequence-based and purely network-based tools for function transfer both fall short We need data integration to take advantage of both and mine out useful biology!

Functional Maps for Functional Metagenomics 43 YG17 YG16 YG15 YG10 YG6 YG9 YG8 YG5 YG11 YG7 YG12 YG13 YG14 YG2 YG1 YG4 YG3 KO8 KO 4 KO5 KO7 KO9 KO 6 KO2 KO3 KO1 KO1: YG1, YG2, YG3 KO2: YG4 KO3: YG6 … ECG1, ECG2 PAG1 ECG3, PAG2 …

Functional Maps for Functional Metagenomics 44

Validating Orthology-Based Functional Mapping 45 Does unweighted data integration predict functional relationships? What is the effect of “projecting” through an orthologous space? Recall log(Precision/Random) KEGG GO Recall log(Precision/Random) Recall log(Precision/Random) GO Unsupervised integration Individual datasets Recall log(Precision/Random) Individual datasets KEGG Unsupervised integration

Validating Orthology-Based Functional Mapping 46 YG17 YG16YG15 YG10 YG6 YG9 YG8 YG5 YG11 YG7 YG12 YG13 YG14 YG2 YG1 YG4 YG3 Holdout set, uncharacterized “genome” Random subsets, characterized “genomes”

Validating Orthology-Based Functional Mapping 47

KEGG GO Validating Orthology-Based Functional Mapping 48 Can subsets of the yeast genome predict a heldout subset’s functional maps? Can subsets of the yeast genome predict a heldout subset’s interactome? What have we learned? Yeast is incredibly well-curated KEGG tends to be more specific than GO Predicting interactomes by projecting through functional maps works decently in the absolute best case

Functional Maps for Functional Metagenomics 49 Now, what happens if you do this for characterized microbes? ~20 (somewhat) well-characterized species 1-35 datasets each Integrate within species Evaluate using KEGG Then cross-validate by holding out species Recall log(Precision/Random) KEGG Unsupervised integrations

Next Steps: Missing Methodology, Mining Most machine learning algorithms are optimized for one of two cases: –Small, dense data –Large, sparse data HEFalMp integrates ~300M records using ~1K features, relatively few of which are missing, in ~200 contexts 50 Feature selection Regularization Dimension reduction Simple models, efficient algorithms Slightly less

Next Steps: Missing Methodology, Models 51 Dataset #1 Dataset #2 … Functional Relationship

Next Steps: Missing Methodology, Models 52 Dataset #1 Dataset #2 Dataset #3 … Functional Relationship Biological Context

Next Steps: Missing Methodology, Models 53 Dataset #1 Dataset #2 Dataset #3 … Functional Relationship Cellular Processes Tissue/Cell Lineage Disease State Developmental Stage Cross-Species Orthology This is clearly not a sustainable system; novel large-scale hierarchical modeling is needed to capture the complex biology of metazoan and metagenomic interaction networks. Types of Interactions Regulation

Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. 54 Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! It’s also speedy: improves on Bayes Net Toolbox by ~22x in memory usage and up to >100x in runtime.

Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. 55 Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! 8 hours 1 minute 30 years 2 months 18 hours Original processing time Current processing time 2-3 hours

Outline Applications: Human molecular data and clinical cancer cohorts 1. Methodology: Algorithms for mining genome-scale datasets 3. Next steps: Methods for microbial communities and functional metagenomics Bayesian system for genomic data integration Sleipnir software for efficient large scale data mining Functional mapping to statistically summarize large data collections HEFalMp system for human data analysis and integration Six confirmed predictions in autophagy Ongoing analysis of LINE-1 methylation in colon cancer Data integration applied to microbial communities and functional metagenomics Efficient machine learning for large, dense feature spaces

Thanks! Interested? We’re looking for students and postdocs! Biostatistics Department Interested? We’re looking for students and postdocs! Biostatistics Department Hilary Coller Erin Haley Tsheko Mutungu Olga Troyanskaya Matt Hibbs Chad Myers David Hess Edo Airoldi Florian Markowetz Shuji Ogino Charlie Fuchs

Colon Cancer: Immunohistochemistry 59 Tumor #1Tumor #2…Tumor #700 AKT AURKA050 CCND …… Genes Conditions Quantities The world’s smallest, cheapest microarray! What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation?

Colon Cancer: Immunohistochemistry 60 ~700 Tumor Samples LINE-1 hypomethylated outliersLINE-1 methylation “normal” What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? The world’s smallest, cheapest microarray!

Colon Cancer: Mining Microarrays 61 ~650 datasets ~15,000 expression conditions ~24,000 genes Most like our 26-gene LINE-1 differential methylation signature Least like the signature 26 genes in signature What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation.

Colon Cancer: Mining Microarrays 62 “The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L.” data Subramanian et al, 2005 Most like our 26-gene LINE-1 differential methylation signature Least like the signature Bleomycin effect on mutagen- sensitive lymphoblastoid cells Folic acid deficiency effect on colon cancer cells Bladder tumor stage classification Normal tissue of diverse types Muscle function and aging Non-diseased lung tissue What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? Dataset 1 Condition X Condition Y Condition Z Dataset 2 Condition A Condition B Condition C Condition D Condition E

Colon Cancer: Mining Microarrays 63 “The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L.” Subramanian et al, 2005 What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? CNV 1 Gene X Gene Y Gene Z CNV 2 Gene A Gene B Gene C Gene D Gene E Most upregulated in significantly enriched datasets Most downregulated PSGs (11 genes on 19q13.3) PCDHs (~50 genes on 5q31.3)Misc. ~12 genes on 16p13.3 Iafrate et al, 2005 ?

Colon Cancer: Mining Microarrays 64 What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? Iafrate et al, 2005 Pregnancy specific β glycoproteins Salahshor et al, 2005 “PSG9 is not found in the non- pregnant adult except in association with cancer, and it appears to be an early molecular event associated with colorectal cancer.” Differential gene expression profile reveals deregulation of pregnancy specific β1 glycoprotein 9 early during colorectal carcinogenesis

Colon Cancer: Generating a Hypothesis 65 What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? Pregnancy specific β glycoproteins

Colon Cancer: Generating a Hypothesis 66 What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? Pregnancy specific β glycoproteins

Colon Cancer: Using All the Data 67 What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? Pregnancy specific β glycoproteins GI cancers and chemotherapy Yes (caveat investigator) Get back to me in a couple of months… What’s the state of the data? Extremely hypomethylated colon cancer carries a significantly poor prognosis In our cohort, these ~20 tumors are weakly enriched for a protein activity signature based on IHC The expression datasets most enriched for the same signature represent mainly GI cancer and chemotherapy conditions The PSG gene family is upregulated in these datasets and is linked to a known CNV HEFalMp associates the PSGs with cancer based on correlation with known colorectal cancer genes in a variety of expression datasets Nothing definite – yet.

Of only five regulators found, four have generic cell cycle/proliferation targets Just five basic regulators for ~7,000 genes? These motifs only appear upstream of ~half of the genes Human Regulatory Networks 68 G0 I III IV V VI VII IX VIII II X 6,829 genes Serum re-stimulated (hrs)Serum starved (hrs) 1 5<< Development Cholesterol Protein localization Cell cycle RNA processing Metabolism FIRE: Elemento et al Elk-1 Sp1 NF-Y YY1 Quiescence: reversible exit from the cell cycle

Regulatory Modules: Expression Biclusters + Sequence Motifs 69 CRG1 CRG2 CRG3 CRG4 RND1 RND2 RND3 RND4 RND5 RND6 RND7 RND Bicluster: Coregulated subset of genes and conditions

Regulatory Modules: Expression Biclusters + Sequence Motifs 70 CRG1 CRG2 CRG4 CRG3 RND1 RND2 RND3 RND4 RND5 RND6 RND7 RND Bicluster: Coregulated subset of genes and conditions

Regulatory Modules: Expression Biclusters + Sequence Motifs 71 CRG1 CRG2 CRG4 CRG3 RND1 RND2 RND3 RND4 RND5 RND6 RND7 RND Bicluster: Coregulated subset of genes and conditions …do all that, and simultaneously find (under)enriched sequence motifs! …any dataset can contain many overlapping biclusters… …any gene or condition can participate in multiple biclusters…

COALESCE: Combinatorial Algorithm for Expression and Sequence-based Cluster Extraction 72 Gene ExpressionDNA Sequence 5’ UTR 3’ UTR Upstream flankDownstream flank Evolutionary Conservation Nucleosome Positions Identify conditions where genes coexpress Identify motifs enriched in genes’ sequences Create a new module Select genes based on conditions and motifs Subtract mean from all data Regulatory modules Coregulated genes Conditions where they’re coregulated Putative regulating motifs Feature selection: Tests for differential expression/frequency Bayesian integration

COALESCE: Selecting Coexpressed Conditions For each gene expression condition… –Compare distributions of values for Genes in the module versus Genes not in the module –If significantly different, include the condition 73 Preserving data structure: If multiple conditions derive from the same dataset, can be included/excluded as a unit For example, time course vs. deletion collection Test using multivariate z-test Precalculate covariance matrix; still very efficient

COALESCE: Selecting Significant Motifs Coalesce looks for three kinds of motifs: –K-mers –Reverse complement pairs –Probabilistic Suffix Trees (PSTs) For every possible motif… –Compare distributions of values for Genes in the module versus Genes not in the module –If significantly different, include the motif 74 ACGACGT ACGACAT | ATGTCGT A TC G T TG CA This can distinguish flanks from UTRs Fast! Efficient enough to search coding sequence (e.g. exons/introns)

COALESCE: Selecting Probable Genes For each gene in the genome… 75 For each significant condition…For each significant motif… What’s the probability the gene came from the module’s distribution? What’s the probability that it came from outside the module? Distributions of each feature in and out of the developing module are observed from the data. Prior is used to stabilize module convergence; genes already in the module are more likely to stay there next iteration. The probability of a gene being in the module given some data…

COALESCE: Integrating Additional Data Types 76 Nucleosome placement Evolutionary conservation Can be included as additional datasets and feature selected just like expression conditions/motifs. Or can be used as a prior or weight on the values of individual motifs. NC G G G ……… TCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATG

COALESCE Results: S. cerevisiae Modules 77 ~2,200 conditions ~6,000 genes The haystack A needle 100 genes 80 conditions

COALESCE Results: Yeast TF/Target Accuracy 78

COALESCE Results: Yeast Clustering Accuracy ~2,200 yeast conditions –Recapitulation of known biology from Gene Ontology 79

COALESCE Results: Yeast Clustering Accuracy ~2,200 yeast conditions –Recapitulation of known biology from Gene Ontology 80 ASCL1 in 5’ flank, unch. sequences underenriched in 3’ UTR M. musculus: Up in callosal and motor neurons C. elegans: Up in larvae, down in adults GATA in 5’ flank, miR-788 seed in 3’ UTR AAGGGGC (zf?) and enriched in 5’ flank H. sapiens: Up in normal muscle, down in diabetic

COALESCE: Coregulated Quiescence Modules 81 Down during quiescence entry, up during quiescence exit, down with adenoviral infection Specific predicted uncharacterized reverse complement motif Up during quiescence entry, down during quiescence exit Many known related (proliferation) motifs: Pax4, Staf, NFKB1, Gfi, ESR1, Runx1, Su(H) Down during quiescence entry, enriched for transport/trafficking miR-297 motif predicted in 3’ UTR (CACATAC) Down with let-7 exposure let-7 motifs predicted in 3’ UTR (UACCUC)

Summary COALESCE algorithm for regulatory module prediction –Biclustering + putative de novo motifs –Optimized for complex organisms (fast!) Large genomes, large data collections –High accuracy, low false positives –Leverage prior knowledge, multiple data types 82