Quality Control Hubert DENISE

Slides:



Advertisements
Similar presentations
Facilitator: Richard Bruskiewich
Advertisements

Hubert DENISE
CS 6293 Advanced Topics: Current Bioinformatics
NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Introduction to next generation sequencing Rolf Sommer Kaas.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Next Generation DNA Sequencing
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
Metagenomics Assembly Hubert DENISE
Quick introduction to genomic file types Preliminary quality control (lab)
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Sequence File Formats.
De Novo Genome Assembly - Introduction
Accurate estimation of microbial communities using 16S tags
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
RNA-Seq in Galaxy Igor Makunin DI/TRI, March 9, 2015.
QC and pre-assembly analyses
De novo assembly of RNA Steve Kelly
Adapter and quality trimming Mick Watson Director of ARK-Genomics The Roslin Institute.
Calling Somatic Mutations using VarScan
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Metagenomic dataset preprocessing – data reduction
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
Canadian Bioinformatics Workshops
What should a bioinformatician know about DNA sequencing, and why?
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
Robert Edgar Independent scientist
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
MGmapper A tool to map MetaGenomics data
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
NGS Analysis Using Galaxy
Presented By: Chinua Umoja
Short Read Sequencing Analysis Workshop
EDNA analyze Wang Ying & Huang Junman.
Denovo genome assembly of Moniliophthora roreri
Transcriptomics II De novo assembly
Sequencing technology and assembly
The FASTQ format and quality control
EMC Galaxy Course November 24-25, 2014
Long way to solve short ncRNA data analysis problems – evaluation of small RNA-Seq datasets from non-model organisms in Galaxy Jochen Bick Jochen Bick.
Independent scientist
2nd (Next) Generation Sequencing
Results report: _roreriPE_AGTCAA_L008_R1_all. fastq
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.
Information processing after resequencing
Garbage In, Garbage Out: Quality control on sequence data
Example usage of mockrobiota MC resource for marker gene and metagenome sequencing pipelines. Example usage of mockrobiota MC resource for marker gene.
Phrap assemblies visualized with the Consed (53) program.
BF nd (Next) Generation Sequencing
Additional file 2: RNA-Seq data analysis pipeline
BF528 - Sequence Analysis Fundamentals
Toward Accurate and Quantitative Comparative Metagenomics
The Variant Call Format
Presentation transcript:

Quality Control Hubert DENISE

Image credits: (1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74: Quality control Diversity analysis Metagenomics data analysis Functional analysis

QC rationale Why ?  Garbage in, garbage out  Base call error: - each base call has a quality score associated - specific platform-dependent errors  Reads quality decreases with reads length  NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.

EBI Metagenomics: QC step by step  Clipping - low quality ends trimmed and adapter sequences removed using Biopython SeqIO package  Quality filtering - sequences with > 10% undetermined nucleotides removed  Read length filtering - short sequences are removed: 100 nt theshold  Duplicate sequences removal - clustered on 99% identity (UCLUST v for 454 and Qiime prefix clustering for Illumina) and representative sequence chosen  Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked

EBI Metagenomics: QC consequences Roche 454 Illumina Ion Torrent

MG-RAST QC EBI Metagenomics QC dereplication ( first 50 bp ) model organism screening ( bowtie ) length filtering ( >75 bp ) ambiguous base filtering ( <5 bp ) dynamic base filtering ( phred score ) analysis duplicate sequence filtering ( first 50 bp ) repeat masking clipping (10%) quality filtering ( phred score ) read length filtering (> 100bp) analysis

QC Tutorial Introduction to exercise Hubert Denise

QC Tutorial Today we’ll be investigating a dataset obtained from varying depths of water taken from the Pacific Ocean 25m125m 75m500m First we will look at the “HOT_Station_ALOHA,_25m_depth” fastq sequence file using the software FASTQC Then we will use the Trimmomatic package to: Perform quality and length trimming on this file

Performing QC steps using Trimmomatic All instructions are provided in the manual Trimmomatic is written in Java but you only need basic Unix knowledge to run it Trimmomatic functions: -removal of Illumina adapters from reads, -quality filtering, -length trimming, -conversion of quality score format In this tutorial we will only perform quality and length filtering More details at

@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCC TCGGTTTTTCATCCAATTGAGTCGTCCCGTTGATAGTGAACTGGTACGTCATCGACTGCA... + Trimmomatic steps used in this tutorial A - LEADING:8 TRAILING:8 quality threshold quality score (phred 33) … trimmed sequence

@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCC TTTTTCATCCAATTGAGTCGTCCCGTTGATAG...CGTAGCGATTGTTACCCAGAGGA + Trimmomatic steps used in this tutorial B – SLIDINGWINDOW:4:15 window size … average quality sum: 57 avg: work in the 5’ to 3’ end direction (whole read is scanned) sum: 58 avg: sum = 141 avg =  no trimming etc … avg ≥ 15 : no trimming Final sequence sum = 59 avg < 15 => trimming

Hubert DENISE