Presentation is loading. Please wait.

Presentation is loading. Please wait.

Answering biological questions using large genomic data collections Curtis Huttenhower 10-05-09 Harvard School of Public Health Department of Biostatistics.

Similar presentations


Presentation on theme: "Answering biological questions using large genomic data collections Curtis Huttenhower 10-05-09 Harvard School of Public Health Department of Biostatistics."— Presentation transcript:

1 Answering biological questions using large genomic data collections Curtis Huttenhower 10-05-09 Harvard School of Public Health Department of Biostatistics

2 A Definition of Computational Functional Genomics 2 Genomic data Prior knowledge Data ↓ Function ↓ Function Gene ↓ Gene ↓ Function

3 MEFIT: A Framework for Functional Genomics 3 BRCA1BRCA20.9 BRCA1RAD510.8 RAD51TP530.85 … Related Gene Pairs High Correlation Low Correlation Frequency MEFIT

4 MEFIT: A Framework for Functional Genomics 4 BRCA1BRCA20.9 BRCA1RAD510.8 RAD51TP530.85 … BRCA2SOX20.1 RAD51FOXP20.2 ACTR1H6PD0.15 … Related Gene Pairs Unrelated Gene Pairs High Correlation Low Correlation Frequency MEFIT

5 MEFIT: A Framework for Functional Genomics 5 Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 Functional Relationship

6 MEFIT: A Framework for Functional Genomics 6 Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 Functional Relationship Biological Context Functional area Tissue Disease …

7 Functional Interaction Networks 7 MEFIT Global interaction network Autophagy network Vacuolar transport network Translation network Currently have data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases

8 Predicting Gene Function 8 Cell cycle genes Predicted relationships between genes High Confidence Low Confidence

9 Predicting Gene Function 9 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

10 Predicting Gene Function 10 Predicted relationships between genes High Confidence Low Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest.

11 Functional Associations Between Contexts 11 Predicted relationships between genes High Confidence Low Confidence The average strength of these relationships indicates how cohesive a process is. Cell cycle genes

12 Functional Associations Between Contexts 12 Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

13 Functional Associations Between Contexts 13 DNA replication genes The average strength of these relationships indicates how associated two processes are. Predicted relationships between genes High Confidence Low Confidence Cell cycle genes

14 Functional Associations Between Processes 14 Edges Associations between processes Very Strong Moderately Strong Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

15 Functional Associations Between Processes 15 Edges Associations between processes Very Strong Moderately Strong Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance

16 Functional Associations Between Processes 16 Edges Associations between processes Very Strong Moderately Strong Nodes Cohesiveness of processes Below Baseline (genomic background) Very Cohesive Borders Data coverage of processes Well Covered Sparsely Covered Hydrogen Transport Electron Transport Cellular Respiration Protein Processing Peptide Metabolism Cell Redox Homeostasis Aldehyde Metabolism Energy Reserve Metabolism Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Organelle Fusion Protein Depolymerization Organelle Inheritance AHP1 DOT5 GRX1 GRX2 … APE3 LAP4 PAI3 PEP4 …

17 HEFalMp: Predicting human gene function 17 HEFalMp

18 HEFalMp: Predicting human genetic interactions 18 HEFalMp

19 HEFalMp: Analyzing human genomic data 19 HEFalMp

20 HEFalMp: Understanding human disease 20 HEFalMp

21 Validating Human Predictions 21 Autophagy Luciferase (Negative control) ATG5 (Positive control) LAMP2RAB11A Not Starved (Autophagic) Predicted novel autophagy proteins 5½ of 7 predictions currently confirmed With Erin Haley, Hilary Coller

22 Comprehensive Validation of Computational Predictions 22 Genomic data Computational Predictions of Gene Function MEFIT SPELL Hibbs et al 2007 bioPIXIE Myers et al 2005 Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Petite frequency Growth curves Confocal microscopy New known functions for correctly predicted genes Retraining With David Hess, Amy Caudy Prior knowledge

23 Evaluating the Performance of Computational Predictions 23 106 Original GO Annotations Genes involved in mitochondrion organization and biogenesis 135 Under-annotations 82 Novel Confirmations, First Iteration 17 Novel Confirmations, Second Iteration 340 total: >3x previously known genes in ~5 person-months

24 Evaluating the Performance of Computational Predictions 24 106 Original GO Annotations Genes involved in mitochondrion organization and biogenesis 95 Under-annotations 40 Confirmed Under-annotations 80 Novel Confirmations First Iteration 17 Novel Confirmations Second Iteration 340 total: >3x previously known genes in ~5 person-months Computational predictions from large collections of genomic data can be accurate despite incomplete or misleading gold standards, and they continue to improve as additional data are incorporated.

25 Functional Maps: Focused Data Summarization 25 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next?

26 Functional Maps: Focused Data Summarization 26 ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a researcher take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data

27 Thanks! 27 http://function.princeton.edu/hefalmp Interested? I’m accepting students and postdocs! Hilary Coller Erin Haley Tsheko Mutungu Olga Troyanskaya Matt Hibbs Chad Myers David Hess Edo Airoldi Florian Markowetz Shuji Ogino Charlie Fuchs http://www.huttenhower.org

28

29 Next Steps: Microbial Communities Data integration is off to a great start in humans –Complex communities of distinct cell types –Very sparse prior knowledge Concentrated in a few specific areas –Variation across populations –Critical to understand mechanisms of disease 29

30 Next Steps: Microbial Communities What about microbial communities? –Complex communities of distinct species/strains –Very sparse prior knowledge Concentrated in a few specific species/strains –Variation across populations –Critical to understand mechanisms of disease 30

31 Next Steps: Microbial Communities 31 ~120 available expression datasets ~70 species Weskamp et al 2004 Flannick et al 2006 Kanehisa et al 2008 Tatusov et al 1997 Data integration works just as well in microbes as it does in humans We know an awful lot about some microorganisms and almost nothing about others Purely sequence-based and purely network-based tools for function transfer both fall short We need data integration to take advantage of both and mine out useful biology!

32 Next Steps: Functional Metagenomics Metagenomics: data analysis from environmental samples –Microflora: environment includes us! Another data integration problem –Must include datasets from multiple organisms Another context-specificity problem –Now “context” can also mean “species” What questions can we answer? –How do human microflora interact with diabetes, obesity, oral health, antibiotics, aging, … –What’s shared within community X? What’s different? What’s unique? –What’s perturbed in disease state Y? One organism, or many? Host interactions? –Current methods annotate ~50% of synthetic data, <5% of environmental data 32


Download ppt "Answering biological questions using large genomic data collections Curtis Huttenhower 10-05-09 Harvard School of Public Health Department of Biostatistics."

Similar presentations


Ads by Google