Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case.

Similar presentations


Presentation on theme: "Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case."— Presentation transcript:

1 Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case

2 Advancing Science with DNA Sequence Minoan eruption and metagenomics …it seemed as though the sea was being sucked backwards, as if it were being pushed back by the shaking of the land…Behind us were frightening dark clouds, rent by lightning twisted and hurled, opening to reveal huge figures of flame. These were like lightning, but bigger. From Pliny the Younger’s Letter

3 Advancing Science with DNA Sequence Apart from Minoan eruption… from Chernicoff & Stanley, Geology, 2007 Diagram by Gary Massoth/PMEL

4 Advancing Science with DNA Sequence Sampling sites white mat red mat Key gradients white vs red: Temperature60 vs 18 o C CO2 tension>99% vs <1%

5 Advancing Science with DNA Sequence This is what it looks like

6 Advancing Science with DNA Sequence Chimney material may be of biological origin

7 Advancing Science with DNA Sequence Standard JGI metagenome pipeline DNA sample DNA QC SSU pyrotags shotgun libraries http://pyrotagger.jgi-psf.org  Community composition  Semi-quantitative – OTU abundance Illumina long mate pair Illumina standard 454 standard 454 long mate pair Metagenome IMG/M-ER contigs + unassembled reads  Community composition  Functional analysis Assembly Analysis

8 Advancing Science with DNA Sequence Pyrotag results – BLASTn against Greengenes database

9 Advancing Science with DNA Sequence PhyloDistribution results – BLASTp of metagenome CDSs against isolates in IMG

10 Advancing Science with DNA Sequence Pyrotags vs PhyloDistribution – white mat Big differences in abundance (an order of magnitude or more) of Bacteroidetes and Thermotogae

11 Advancing Science with DNA Sequence Possible explanations Primer bias in pyrotags (against Proteobacteria)? Amplification artifacts in pyrotags – well known for metagenome data Sequencing GC bias in the metagenome – low and high ( 65%) are underrepresented in Illumina data K-mer assembler problems: abundant populations may be undrrepresented in assembly if incorrect k-mer/coverage parameters selected

12 Advancing Science with DNA Sequence PCR artifacts in metagenome data 12 Reason: presence of free beads during the library prep step; escaped emPCR products bind to free beads and are disproportionately amplified 454 technology includes an emulsion PCR step, which may lead to artificial overrepresentation of certain sequences

13 Advancing Science with DNA Sequence Low GC (Brachyspira) What about GC bias? Medium GC (Arcanobacterium) High GC (Cellulomonas) Question: how do you find average/max/min GC content for a clade? Answer: IMG=>Genome Browser=>View Phylogenetically=>click on green + to select the clade, then “Add selected to Genome Cart”=>Compare Genomes=>Genome Statistics Result: Thermotogae GC percent 41 average/47 max/31 min Bacteroidetes GC percent 42.5 average/66 max/31 min

14 Advancing Science with DNA Sequence Are there any abundant populations that could be filtered out in assembly? Typical Pyrotagger output There are 2 highly abundant populations – just 2 clusters account for nearly all Bacteroidetes and Thermotogae in the sample

15 Advancing Science with DNA Sequence Let’s take a closer look at the assemblies and unassembled reads White matRed mat 454 reads total299,9751,429,091 Illumina reads total49,227,14645,337,178 Assembled contigs195,59088,776 N50, bp659869 Longest contig, bp28,14575,483 Illumina reads mapped to assembly, % total 42.312.5 454 reads mapped to assembly, % total 62.115.3

16 Advancing Science with DNA Sequence Functional analysis: metagenome as a bag of functions Red mat is taxonomically more diverse Is it more diverse functionally? White matRed mat COG clusters36313402 Pfam clusters38473505 Question: where do you find this information? Answer: IMG=>Taxon Details=>Metagenome Statistics; Genes with Pfam=>Display as a list =>Export Rarefaction curves: white mat is expected to have ~4000 different Pfams; red mat ~3600

17 Advancing Science with DNA Sequence Abundance Comparisons Motility and chemotaxis genes are overrepresented in white mat (detected by both Pfams and COG Categories) white matred mat

18 Advancing Science with DNA Sequence Is motility/chemotaxis common to all organisms in white mat? Scenario 1: the function/pathway is overrepresented because it is present in all members of the community, possibly at higher copy number Scenario 2: the function/pathway is overrepresented because it is present in one clade, which is absent from the second sample Question: can we distinguish between the two scenarios? Answer: click on the gene count for protein family/functional category, add all genes to Gene Cart=>add scaffolds to Scaffold Cart=>PhyloDistribution of all scaffolds in the Scaffold Cart

19 Advancing Science with DNA Sequence Are Sulfurimonas-like bacteria present in both samples? The total number of sequences in all clusters assigned to Epsilonproteobacteria is 50 in white mat and 66 in red mat Largest cluster in white mat includes 125K+ sequences Largest cluster in red mat includes 14K+ sequences Question: what about the presence of Sulfurimonas-like bacteria in the metagenomes? Answer: go to Compare Genomes=>PhyloDistribution=>Genome vs Metagenomes, select the genome; the histogram shows the number of BLASTp hits from CDSs in all metagenomes to this genome

20 Advancing Science with DNA Sequence Are there any methylotrophs in the white mat?

21 Advancing Science with DNA Sequence Conclusions Two communities have different composition; white mat sampled next to the hydrothermal vent has lower complexity Community composition as sampled by pyrotags and the metagenome may be quite different due to a number of biases Some protein families/functional categories are more abundant in one sample as compared to the other because of different community composition, and not necessarily because they are more important in this environment


Download ppt "Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case."

Similar presentations


Ads by Google