Quality Control Hubert DENISE
Image credits: (1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74: Quality control Diversity analysis Metagenomics data analysis Functional analysis
QC rationale Why ? Garbage in, garbage out Base call error: - each base call has a quality score associated - specific platform-dependent errors Reads quality decreases with reads length NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.
EBI Metagenomics: QC step by step Clipping - low quality ends trimmed and adapter sequences removed using Biopython SeqIO package Quality filtering - sequences with > 10% undetermined nucleotides removed Read length filtering - short sequences are removed: 100 nt theshold Duplicate sequences removal - clustered on 99% identity (UCLUST v for 454 and Qiime prefix clustering for Illumina) and representative sequence chosen Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked
EBI Metagenomics: QC consequences Roche 454 Illumina Ion Torrent
MG-RAST QC EBI Metagenomics QC dereplication ( first 50 bp ) model organism screening ( bowtie ) length filtering ( >75 bp ) ambiguous base filtering ( <5 bp ) dynamic base filtering ( phred score ) analysis duplicate sequence filtering ( first 50 bp ) repeat masking clipping (10%) quality filtering ( phred score ) read length filtering (> 100bp) analysis
QC Tutorial Introduction to exercise Hubert Denise
QC Tutorial Today we’ll be investigating a dataset obtained from varying depths of water taken from the Pacific Ocean 25m125m 75m500m First we will look at the “HOT_Station_ALOHA,_25m_depth” fastq sequence file using the software FASTQC Then we will use the Trimmomatic package to: Perform quality and length trimming on this file
Performing QC steps using Trimmomatic All instructions are provided in the manual Trimmomatic is written in Java but you only need basic Unix knowledge to run it Trimmomatic functions: -removal of Illumina adapters from reads, -quality filtering, -length trimming, -conversion of quality score format In this tutorial we will only perform quality and length filtering More details at
@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCC TCGGTTTTTCATCCAATTGAGTCGTCCCGTTGATAGTGAACTGGTACGTCATCGACTGCA... + Trimmomatic steps used in this tutorial A - LEADING:8 TRAILING:8 quality threshold quality score (phred 33) … trimmed sequence
@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCC TTTTTCATCCAATTGAGTCGTCCCGTTGATAG...CGTAGCGATTGTTACCCAGAGGA + Trimmomatic steps used in this tutorial B – SLIDINGWINDOW:4:15 window size … average quality sum: 57 avg: work in the 5’ to 3’ end direction (whole read is scanned) sum: 58 avg: sum = 141 avg = no trimming etc … avg ≥ 15 : no trimming Final sequence sum = 59 avg < 15 => trimming
Hubert DENISE