Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding.

Similar presentations


Presentation on theme: "Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding."— Presentation transcript:

1 Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Sourav Chatterji UC Davis Genome Center schatterji@ucdavis.edu

2 Background

3 The Microbial World

4 Exploring the Microbial World Who is out there? What are they doing?

5 Exploring the Microbial World Culturing – Majority of microbes currently unculturable. – No ecological context. Molecular Surveys (e.g. 16S rRNA) – “who is out there?” – “what are they doing?”

6 6 Traditional Genome Sequencing

7 Environmental Shotgun Sequencing

8

9 Interpreting Metagenomic Data Nature of Metagenomic Data – Mosaic – Fragmentary New Sequencing Technologies – Enormous amount of data – Short Reads

10 Overview of Talk Metagenomic Binning PhyloMetagenomics The Big Picture/ Future Work

11 Overview of Talk Metagenomic Binning – Background – CompostBin [to appear in RECOMB 2008] PhyloMetagenomics The Big Picture

12 Metagenomic Binning Classification of sequences by taxa

13 Metagenomic Binning Problem : To classify sequences into taxon specific bins. Taxonomic level of classification depends on context. Closely related to fundamental question of genome signatures.

14 Current Binning Methods Assembly Align with Reference Genome Database Search [ MEGAN, BLAST ] Phylogenetic Analysis DNA Composition [ TETRA,Phylopythia ]

15 Current Binning Methods Need closely related reference genomes. Poor performance on short fragments. – Sanger sequence reads 500-1000 bp long. – Current assembly methods unreliable Complex Communities Hard to Bin.

16 Genome Signatures Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? – Yes [Karlin et al. 1990s] What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

17 Imperfect World Horizontal Gene Transfer – Recent Estimates [Ge et al. 2005] Varies between 0-6% of genes. Typically ~2%. But… – Amelioration

18 DNA-composition metrics The K-mer Frequency Metric CompostBin uses hexamers

19 Working with K-mers for Binning. – Curse of Dimensionality : O(4 K ) independent dimensions. – Statistical noise increases with decreasing fragment lengths. Project data into a lower dimensional space to decrease noise. – Principal Component Analysis. DNA-composition metrics

20 Measures Hexamer frequencies – Computationally feasible – Captures bias in codon usage – Captures bias due to restriction enzymes Project hexamer frequencies into the first few principal components. The CompostBin Algorithm

21 PCA separates species Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

22 Effect of Skewed Relative Abundance B. anthracis and L. monogocytes Abundance 1:1Abundance 20:1

23 Weighted PCA PCA calculates directions with highest variance – Within-species variance of more abundant species is overwhelming. Weighted PCA – Use a weighting scheme to normalize the effects of species abundance. – Calculates weighted co-variance matrix.

24 A Weighting Scheme For each read, find overlap with other sequences

25 A Weighting Scheme Calculate the redundancy of each position. 4553 Weight is inverse of average redundancy.

26 Weighted PCA Calculate weighted mean µ w : Calculates weighted co-variance matrix M w Principal Components are eigenvectors corresponding to the highest eigenvalues.

27 Weighted PCA Calculate weighted mean µ w : Calculates weighted co-variance matrix M w Principal Components are eigenvectors of M w. – Use first three PCs for further analysis.    N 1i ii Xw w μ T wi N 1i wiiw )μ(X)μ(XwM   

28 Weighted PCA separates species B. anthracis and L. monogocytes : 20:1 PCAWeighted PCA

29 Un-supervised Classification

30 Semi-Supervised Classification 31 Marker Genes [courtesy Martin Wu] – Omni-present – Relatively Immune to Lateral Gene Transfer Reads containing these marker genes can be classified with high reliability.

31 Semi-supervised Classification Use a semi-supervised version of the normalized cut algorithm

32 The Semi-supervised Normalized Cut Algorithm 1.Calculate the K-nearest neighbor graph (KNN- graph) from the point set. 2.Update the KNN-graph with information from marker genes. 3.Bisect the graph using the normalized-cut algorithm.

33 Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62] Apply algorithm recursively

34 Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

35 Testing Simulate Metagenomic Sequencing – Variables Number of species Relative abundance GC content Phylogenetic Diversity Test on a “real” dataset where answer is well- established.

36

37 Future Directions Holy Grail : Complex Communities Semi-supervised methods – More marker genes – Semi-supervised projection? Hybrid Methods – Assembly Information – Population Genetic Information

38 Overview of Talk Metagenomic Binning Phylo-Metagenomics – Applications – Incorporating Alignment Accuracy The Big Picture/ Future Work

39 Garcia Martin et al., Nat. Biotechnology (2006) Population Structure of Communities

40 Yooseph et al., PLoS Biology (2007) Gene Family Characterization

41

42 Masking and trimming Goal : Removal of poorly aligned regions in a multiple alignment Increases the signal-to-noise ratio Often improves the quality of the tree

43 Manual Masking Require skilled and tedious manual intervention Subjective and non-reproducible Impractical for high throughput data – Frequently ignored. “Garbage-in-and-garbage- out”

44 Conservation Based Masking Calculate the conservation score for each column in the alignment. Weight the score of each column by its neighbors. Columns whose score is below a certain cutoff are removed.

45 Gblocks

46 Conservation Based Masking Does not work quite well – Support for phylogenetic trees is reduced after trimming, possibly due to excessive removal of informative sites No solid theoretical basis – Conservation is not equivalent to homology.

47 Probabilistic Masking using pair-HMMs Durbin et al., Cambridge University Press (1998)

48 Probabilistic Masking using pair-HMMs Probabilistic formulation of alignment problem. Can answer additional questions – Alignment Reliability – Sub-optimal Alignments Durbin et al., Cambridge University Press (1998)

49 Probabilistic Masking What is the probability residues x i and y j are homologous? Posterior Probability the residues x i and y j are homologous Can be calculated efficiently for all pairs (and gaps) in quadratic time.

50 Probabilistic Masking What is the probability residues x i and y j are homologous? Posterior Probability the residues x i and y j are homologous Can be calculated efficiently for all pairs (and gaps) in quadratic time. y]Pr[x, y]x,,yPr[x ]yPr[x ji ji

51 Scoring Multiple Alignments Calculate the “posterior probability matrix” and distances d ij between every pair of sequences. Weighted “sum of pairs” score for column r :

52 Scoring Multiple Alignments Calculate the “posterior probability matrix” and distances d ij between every pair of sequences. Weighted “sum of pairs” score for column r :  ji, ij ji ji, ij d ]rPr[rd

53 Testing The Balibase 3.0 Benchmark Database

54 Testing Realign sequences using MSA programs like Clustalw. Sensitivity: for all correctly aligned columns, the fraction that has been masked as good Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad

55 Performance Gblocks Prob Mask SensitivitySpecificity 97%93% 53%94 %

56 The Final Result A Phylogenetic Database/Pipeline (with Martin Wu)

57 Overview of Talk Metagenomic Binning Phylo-Metagenomics The Big Picture/ Future Work

58 Population Structure Venter et al., Science (2004) How to integrate information from multiple markers?

59 Species-species Interactions

60 Interactions in Microbial Communities

61 Time Series Data Ruan et al., Bioinformatics (2006)

62 Interaction Networks in Microbial Communities Ruan et al., Bioinformatics (2006)

63 Functional Profiling Prediction of Gene Function Prediction of Metabolic Pathway

64 Functional Profiling Binning can provide additional insights Wu et al. PLoS Biology (2006)

65 Functional Profiling (with Binning) McCutcheon and Moran PNAS.(2007)

66 Single Cell Genomics Hutchinson and Venter, Nature Biotechnology (2006)

67 Single Cell Genomics Reads From Single Cell “Simulated” Contamination

68 The Big Picture Microbial Community Metagenomic SamplingSingle Cell Genomics Population StructureFunctional Profiling Species Interaction Network Time Series Data

69 Acknowledgements UC Davis Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann UC Berkeley Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan Princeton University Simon Levin Josh Weitz Jonathan Dushoff


Download ppt "Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding."

Similar presentations


Ads by Google