Presentation is loading. Please wait.

Presentation is loading. Please wait.

Metagenomic dataset preprocessing – data reduction

Similar presentations


Presentation on theme: "Metagenomic dataset preprocessing – data reduction"— Presentation transcript:

1 Metagenomic dataset preprocessing – data reduction
Konstantinos Mavrommatis

2 Complexity Who is there? (phylogenetic content) What does it do?
Acid Mine Drainage Sargasso Sea Termite Hindgut Cow rumen Soil The total metagenome is the result of a cell community. Cells belong to different organisms ranging from strains to domains. Who is there? (phylogenetic content) What does it do? (Functional content) Why is it there? (Comparative study) Species complexity

3 ? Dataset processing Analysis Feature prediction QC
Sample preparation High throughput sequencing Assemble reads Analysis Feature prediction ? QC Functional annotation and comparative analysis Binning

4 Dataset processing (v 3.0a)
Submitted file Assembled contigs Submitted file 454 reads Submitted file Illumina reads Fasta/fastq File QC. Check character set and contig name. Remove trailing Ns. Trimming. Q=20 Trimming. Q=13 Fasta Low complexity. Size of 80 bp Dereplication. Prefix = 5, identity 95%, Clustering. 100% identity File for gene calling fasta

5 Dataset processing Feature prediction pipeline (v 3.0a)
File for gene calling fasta CRISPR detection. crt / pilercr RNA detection. tRNAscan / hmmer / Blast / (isolates:Rfam) CDS detection. Isolates: prodigal Metagenomes: varies Unassembled reads + assembled contigs Conflict resolution Concatenation of all results. Creation of final output file File for IMG IMG

6 Dataset processing Quality trimming
Courtesy Alex Copeland Remove sequences from the ends of the reads. lucy for 454 datasets. Illumina (longest high quality string)

7 Dataset processing Low complexity filter
tatatatatatatatatat aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa using dust (NCBI) Remove sequences with less than 80 informative bases

8 Dataset processing Dereplication

9 Dataset processing Sequence dereplication
atcccat atc-cat atcccat atcccat atcccat gctacat gctncat gctacat Not dereplicated gctacat using uclust 95% identity (global alignment). Identical prefix (5nt)

10 Dataset processing Evaluation of processing tools
Unassembled sequences due to their small size, quality problems, and large number need to be processed with efficient pipelines. Simulated datasets: Using sequences extracted from finished genomes (Perfect sequences) Using reads that have been used to assemble finished genomes (Real errors). Evaluation and development of new tools/wrappers.

11 Dataset processing Feature prediction
Available methods: Ab initio: Metagene, MetaGeneMark, FragGeneScan, Prodigal. Similarity based: Blastx, USEARCH. isolate CORRECT MISSED WRONG NEW metagenome

12 Method performance

13 Quality effect

14 Trimming

15 454 Ti(no errors)

16 454Ti(with errors)

17 Illumina 115 bp

18 Illumina 74 bp

19 Contigs frameshift Wrong prediction

20 Why annotate unassembled reads?
Sample Total size 102,722,384 (2x150) reads Assembled contigs 1,375,950 contigs 5060 different pfams Assembled reads Mapped (by bwa) 11,778,925 reads Genes called on unassembled reads 64,737,444 genes 7481 different pfams 8,373,641 (12%) genes Similar to genes on contigs1 Genes with similarity to isolate genomes 40,778,854 genes Additional information about functions and phylogeny Assembled only More accurate statistics based on unassembled + assembled Unassembled + assembled + real metagenome

21 Processing time(metagenomes)
Highlight metrics. Things that Show what I think should be the best metric for predcition for 2012 Total submissions Processing time Data size (bp) 336 2.45 days (annotation) 24 days (integration) 174,719,855 (average) 58,006,992, (total)

22 Processing time(isolates)
Total submissions Processing time Data size (bp) 3630 10 hours(annotation) 12 days (integration) 1,658,242 (average) 4,114,099, (total)

23 Thank you for your attention


Download ppt "Metagenomic dataset preprocessing – data reduction"

Similar presentations


Ads by Google