Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower 08-12-10 Harvard School of Public Health Department of Biostatistics.

Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower 08-12-10 Harvard School of Public Health Department of Biostatistics

Greatest biological discoveries? 2 Our job is to create computational microscopes: To ask and answer specific biological questions using millions of experimental results Our job is to create computational microscopes: To ask and answer specific biological questions using millions of experimental results

A computational definition of functional genomics 3 Genomic data Prior knowledge Data ↓ Function ↓ Function Gene ↓ Gene ↓ Function

A framework for functional genomics 4 High Similarity Low Similarity High Correlation Low Correlation G1 G2 + G4 G9 + … G3 G6 - G7 G8 - … G2 G5 ? 0.90.7…0.10.2…0.8 +-…--…+ 0.5…0.050.1…0.6 High Correlation Low Correlation Frequency Let.Not let. Frequency SimilarDissim. Frequency P(G2-G5|Data) = 0.85 100Ms gene pairs → ← 1Ks datasets + =

Functional network prediction and analysis 5 Global interaction network Carbon metabolism networkExtracellular signaling networkGut community network Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases HEFalMp

Functional network prediction from diverse microbial data 6 486 bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species Integrated functional interaction networks in 15 species 307 bacterial interaction experiments 154796 raw interactions 114786 postprocessed interactions E. Coli Integration ← Precision ↑, Recall ↓

Cross-species knowledge transfer using functional data 7 Pinaki Sarder TaFTan

TaFTan: Cross-species knowledge transfer using functional data 8 E. coli B. subtilis P. aeruginosa M. tuberculosis Species-specific data Species’ data excluded All species’ data log(precision/random) log(recall) Important to take advantage of all available data for any one organism Important to take advantage of all available data for every organism Scalable to dozens of organisms with hundreds of functional datasets Currently working on making this more context-specific

Meta-analysis for unsupervised functional data integration 9 Evangelou 2007 Huttenhower 2006 Hibbs 2007 Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions

Meta-analysis for unsupervised functional data integration 10 Evangelou 2007 Huttenhower 2006 Hibbs 2007 + =

~2000 11 AML/ALL Temperature DNA damage Gene expression Batch effects Functional modules So what does all of this have to do with microbial communities ?

2010 12 Healthy/IBD Temperature Location Taxa & Orthologs ??? Niches & Phylogeny Test for correlates Multiple hypothesis correction Feature selection p >> n Confounds/ stratification/ environment Cross- validate Biological story? Independent sample Intervention/ perturbation

What features to test? 13 16S reads WGS reads Taxa Orthologous clusters Pathways/ modules Functional roles Pathway activity Genomic data (Reference genomes) Functional data (Experimental models) Binning Clustering Microbiome data

MetaHIT: Data  features 14 WGS reads Pathways/ modules KO clusters KEGG pathways 85 healthy,15 IBD + 12 healthy,12 IBD ReBLASTed against KEGG since published data obfuscates read counts 10x bootstrap within training cohort, test on 12+12 as validation Taxa Phymm Brady 2009

MetaHIT: Taxonomic CD biomarkers 15 Bacteroidetes Firmicutes Methanomicrobia Enterobacteriaceae Chromatiales Desulfobacterales OxalobacteraceaeRhodobacteraceae Bradyrhizobiaceae iTOL Letunic 2007

MetaHIT: Taxonomic CD biomarkers 16 Down in CD Up in CD

MetaHIT: Functional CD biomarkers 17 Growth/replication Motility Transporters Sugar metabolism Down in CD Up in CD

MetaHIT: KO IBD biomarkers 18 Transporters Growth/ replication Motility Sugar metabolism Down in IBD Up in IBD LEfSe Nicola Segata

t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis… Metagenomic differential analysis: LEfSe 1. Is there a statistically significant difference? 2. Is the difference biologically significant? 3. How large is the difference? PCA, LDA, mean difference, class or cluster distance… expert supervision, specific post-hoc tests… p(ANOVA) < 0.05 pairwise post-hoc Wilcoxon OK Log(Score(LDA)) = 3.68 LEfSe: 19

LEfSe: A non-human example Viromes vs. bacterial metagenomes 20 Metastats (White 2009) :p < 0.001 ANOVA:p < 0.05 LEfSE: DIFF! Hi-level functional category: Carbohydrates Hi-level functional category: Transporters Hi-level functional category: Nucleosides and Nucleotides LEfSE: NO DIFF! MicrobialViral Dinsdale 2008

Sleipnir C++ library for computational functional genomics Data types for biological entities Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization Efficient machine learning algorithms Generative (Bayesian) and discriminative (SVM) And it’s fully documented! Sleipnir: Software for scalable functional genomics Massive datasets require efficient algorithms and implementations. 21 It’s also speedy: microbial data integration computation takes <3hrs.

Recap 22 TaFTanMeta-analytic integration LEfSe Unsupervised system for data mining without curated prior knowledge Comparative microbiome analysis by taxa, orthologs, and pathways Sleipnir software for scalable functional genomics Network framework for scalable data integration Cross-species knowledge transfer from functional data

Thanks! 23 http://huttenhower.sph.harvard.edu/sleipnir Jacques Izard Wendy Garrett Sarah Fortune Pinaki SarderNicola Segata Levi WaldronLarisa Miropolsky Willythssa Pierre-Louis

Predicting Gene Function 25 Cell cycle genes Predicted relationships between genes High Confidence Low Confidence

Predicting Gene Function 26 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

Predicting Gene Function 27 Predicted relationships between genes High Confidence Low Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest.

Comprehensive Validation of Computational Predictions 28 Genomic data Computational Predictions of Gene Function MEFIT SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Petite frequency Growth curves Confocal microscopy New known functions for correctly predicted genes Retraining With David Hess, Amy Caudy Prior knowledge

Evaluating the Performance of Computational Predictions 29 106 Original GO Annotations Genes involved in mitochondrion organization and biogenesis 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

Evaluating the Performance of Computational Predictions 30 106 Original GO Annotations Genes involved in mitochondrion organization and biogenesis 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated.

Validating Human Predictions 31 Autophagy Luciferase (Negative control) ATG5 (Positive control) LAMP2RAB11A Not Starved (Autophagic) Predicted novel autophagy proteins 5½ of 7 predictions currently confirmed With Erin Haley, Hilary Coller

Functional mapping: mining integrated networks 32 Predicted relationships between genes High Confidence Low Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis

Functional mapping: mining integrated networks 33 Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional mapping: mining integrated networks 34 Flagellar assembly The strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Chemotaxis

Functional Mapping: Scoring Functional Associations 35 How can we formalize these relationships? Any sets of genes G 1 and G 2 in a network can be compared using four measures: Edges between their genes Edges within each set The background edges incident to each set The baseline of all edges in the network Stronger connections between the sets increase association. Stronger within self-connections or nonspecific background connections decrease association.

Functional Mapping: Bootstrap p-values Scoring functional associations is great… …how do you interpret an association score? –For gene sets of arbitrary sizes? –In arbitrary graphs? –Each with its own bizarre distribution of edges? 36 Empirically! # Genes 151050 1 5 10 50 Histograms of FAs for random sets For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is approximately normal with mean 1. Standard deviation is asymptotic in the sizes of both gene sets. Maps FA scores to p-values for any gene sets and underlying graph. Null distribution σ s for one graph

Functional Mapping: Functional Associations Between Processes 37 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

Functional Mapping: Functional Associations Between Processes 38 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered

Functional maps for cross-species knowledge transfer 39 G17 G16 G15 G10 G6 G9 G8 G5 G11 G7 G12 G13 G14 G2 G1 G4 G3 O8 O4 O5 O7 O9 O6 O2 O3 O1 O1: G1, G2, G3 O2: G4 O3: G6 … ECG1, ECG2 BSG1 ECG3, BSG2 …

Functional maps for functional metagenomics 40 GOS 4441599.3 Hypersaline Lagoon, Ecuador KEGG Pathways Organisms Pathogens Env. Mapping genes into pathways Mapping pathways into organisms + Integrated functional interaction networks in 27 species Mapping organisms into phyla =

Functional Maps: Focused Data Summarization 41 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

Functional Maps: Focused Data Summarization 42 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

Functional maps for cross-species knowledge transfer 43 ← Precision ↑, Recall ↓ Following up with unsupervised and partially anchored network alignment

LEfSe: A non-human example Viromes vs. bacterial metagenomes 44 Metastats (White 2009) :p < 0.001 ANOVA:p < 0.05 LEfSE: DIFF! Hi-level functional category: Carbohydrates Hi-level functional category: Membrane Transport Hi-level functional category: Nitrogen Metabolism Hi-level functional category: Nucleosides and Nucleotides LEfSE: NO DIFF! MicrobialViral

Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower 08-12-10 Harvard School of Public Health Department of Biostatistics.

Similar presentations

Presentation on theme: "Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower 08-12-10 Harvard School of Public Health Department of Biostatistics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower 08-12-10 Harvard School of Public Health Department of Biostatistics.

Similar presentations

Presentation on theme: "Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower 08-12-10 Harvard School of Public Health Department of Biostatistics."— Presentation transcript:

Similar presentations

About project

Feedback