Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quality Control & Preprocessing of Metagenomic Data

Similar presentations


Presentation on theme: "Quality Control & Preprocessing of Metagenomic Data"— Presentation transcript:

1 Quality Control & Preprocessing of Metagenomic Data
SDSU Robert Schmieder –

2 Need for automated approach
Metagenomic datasets contain 100,000s (454) or 1,000,000s (Illumina) of sequences IlluminaHiSeq 2000: currently 300 GB of data soon 2,000 GB (≈33 human genomes with 20x coverage in single sequencing run) Can not just read sequence by sequence to get an idea of your data

3 Basic data analysis Perform similarity search New dataset Assemble

4 Bad data analysis

5 Bad data analysis

6 Bad data analysis

7 Bad data analysis

8 Bad data analysis

9 Bad data analysis

10 Good data analysis New dataset

11 Good data analysis New dataset Quality control & Preprocessing

12 Good data analysis New dataset Quality control & Preprocessing
Similarity search Assembly

13 Good data analysis New dataset Quality control & Preprocessing
Similarity search Assembly

14 3 Tools for metagenomic data

15 Quality control and data preprocessing

16 Number and Length of Sequences

17 Number/Length of sequences
Bad Reads should be approx. same length (same number of cycles)  Short reads are likely lower quality Good

18 Quality of Sequences

19 Linearly degrading quality across the read
Trim low quality ends

20 Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

21 Low quality sequence issue
Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible

22 What if quality scores are not available ?
Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huseet al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

23 Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, …) use 2-bit encoding system for nucleotides some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 – A, 01 – C, 10 – G, 11 - T

24 Sequence duplicates

25 Real or artificial duplicate ?
Metagenomics = random sampling of genomic material Why do reads start at the same position? Why do these reads have the same errors? No specific pattern or location on sequencing plate 11-35% Gomez-Alvarez et al.: Systematic artifacts in metagenomes from complex microbial communities. ISME (2009) 25

26

27 One micro-reactor – Many beads
Martine Yerle (Laboratory of Cellular Genetics, INRA, France)

28 Impacts of duplicates False variant (SNP) calling
Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong

29 Impacts of duplicates False variant (SNP) calling
Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong

30 Depends on the experiment
In contrast, for Illumina reads with high coverage: eliminating singletons is an easy way of dramatically reducing the number of error- prone reads

31 Tag Sequences

32 No tag MID tag WTA tags

33 Detect and remove tag sequences

34 Fragment-to-fragment concatenations

35 Concatenated fragments in assembled contigs

36

37 Data upload Tag sequence definition

38 Tag sequence prediction

39 Parameter definition Download results

40 Sequence Contamination

41 Principal component analysis (PCA) of dinucleotide relative abundance
Microbial metagenomes Viral metagenomes

42 Identification and removal of sequence contamination

43 Contaminant identification
Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences  can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

44 DeconSeq web interface
Two types of reference databases Remove Retain

45 DeconSeq web interface (cont.)

46 Human DNA contamination identified in 145 out of 202 metagenomes

47 Conclusions Quality control and data preprocessing are very important to increase quality of downstream analysis Preprocessing depends on the experiment

48


Download ppt "Quality Control & Preprocessing of Metagenomic Data"

Similar presentations


Ads by Google