Presentation is loading. Please wait.

Presentation is loading. Please wait.

MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res. 2007.

Similar presentations


Presentation on theme: "MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res. 2007."— Presentation transcript:

1 MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res. 2007

2 Early metagenomic  Known phylogenetic markers and subsequent sequencing of clones  Analysis of paired-end reads  Complete sequences of environmental fosmid and BAC clones  Rough annotation of the metabolic capacity  Environmental assemblies  Distinguish between discrete species and population of closely related biotypes  Problem of using proven phylogenetic markers(ribosomal genes, coding sequences)  Slow-evolving genes : distinguishing between species at large evolutionary distances

3 What is MEGAN?  Metagenome Analyzer (MEGAN)  Free software.  Deviates from the analytical pattern of previous  Built on the statistical analysis of comparing random sequence intervals with unspecified phylogenetic properties against databases  Depends on the related sequences in the databases  Providing filter to adjust the level of stringency later to an appropriate level  Laptop analysis  Comparing result (BLAST)-> laptop (MEGAN)  Graphical and statistical output

4 Pipeline  Compare against databases : BLAST  Compute, explore taxonomical content : NCBI taxonomy  Lowest common ancestor (LCA) algorithm  Data sets(Sargasso Sea, mammoth bone, Short E. coli K12 & B. bacteriovorus HD100)

5 What we can do with MEGAN  Species and strain identification through species-specific genes  Searching species or taxa by find tool  Distribution of strains of a species  Underlying sequence alignments

6 Experiments-1  Sargasso Sea  data set  Sanger sequencing  Sample 1-4 from DDBJ/EMBL/GenBank  10000 reads from Sample1  Randomly selected a pooled set of 10000 reads from samples 2-4  BLASTX->NCBI-NR  1% no hits from sample1, <3% no hits from sample 2-4  Filters  Min-score : bit-score threshold of 100  Top-percent : bit scores lie within 5% of the best score  Min-support : isolated assignments it by one read) discarded

7 Analysis-Sargasso Sea data  1.66M reads, AVG. 818bp by Sanger sequensing  Species profile of 16 taxonomical groups  Environmental assemblies  By analyzing six specific phylogenetic markers  rRNA, RecA/RadA, HSP70, RpoB, EF-Tu, and Ef-G

8 Result Sample1 ~83% reads were assigned to taxa that were more speific than the kingdom level Majority of (8298) were assigned to bacterial group Sample 2-4 ~59% reads were assigned to taxa that were more specific than the kingdom level Majority of (5709) were assigned to bacterial group Alphaproteobacteria, Gammaproteobacteria by a factor of 2-4 over the remaining 14 taxonomic groups Eukaryotes & Viruses : size filtering Archaea : May be there is 10times as much vacterial sequence information in the public databases MEGAN vs. previous (Venter et al. 2004) Specific assignment information : LCA

9 Result-cont. Averaged weighted percentage of the siz phylogenetic markers for each of the 16 taxonomic groups Easily detect sampling bias between sample1 and pooled sample 2-4

10 Experiments-2  Mammoth bone  Data set  Roche GS20 sequencing (Sequencing-by-synthesis)  Sample from 1g of mammoth bone, 28000 years  ~300,000 reads, 95bp  BLASTZ-genome sequences (elephant, human, dog)  45.4% of the reads mammoth DNA, others are environmental organisms (bacteria, fungi, amoeba, nematodes)  BLASTX–NCBI-NR for environmental sequences  Filters : bit-score threshold 30, discard isolated assignment (filtered 2086 reads)

11 Result  19841 reads to Eukaryota, of which 7969 to Gnathostomata  16972 : Bacteria, 761: Archea, 152 : Viruses

12 Experiment 3  Identifying species from various lead length  Short E. coli K12 & B. bacteriovorus HD100 simulation  5000 random shotgun reads  BLASTX-NCBI-NR  Filters  Bit-score threshold 35  20% of the best hit  Discarded isolated assignments  Result : no false-positive assignment, short read can be used for metagenomic analysis, albeit at the cost of a high rate of under- prediction

13 Experiment 3-cont.  Roche GS20 sequencing  Data set  2000 reads from random positions in the E.coli K12  ~100 bp  BALSTX – NCBI-NR  Filters  Bit-score threshold 35  20% of the best hit  Discarded isolated assignments  Result

14 Experiment 3-cont.  Roche GS20 sequencing  Data set  2000 reads from random positions in the B. bacteriovorus HD100  ~100 bp  BALSTX – NCBI-NR : A in figure  BLASTX – NCBI-NR without B.bacteriovorus HD100 : B in figure  Filters  Bit-score threshold 35  20% of the best hit  Discarded isolated assignments  Result

15 MEGAN 3(June, 2009)  Suitable for very large datasets  Advances in the throughput and cost-efficiency of sequencing technology  Interests changed  From ‘which species present’ to ‘What’s different?’  Features  Visualization technique for multiple database  New statistical method for highlighting the difference in a pairwise comparison

16 MEGAN3-cont.  Comparing 6 mouse gut with human gut  Clickable, collapsible.

17


Download ppt "MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res. 2007."

Similar presentations


Ads by Google