Quality Control Hubert DENISE

Slides:

Advertisements

Similar presentations

Facilitator: Richard Bruskiewich

Advertisements

CS 6293 Advanced Topics: Current Bioinformatics

NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.

Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.

Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.

NGS Analysis Using Galaxy

Whole Exome Sequencing for Variant Discovery and Prioritisation

National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.

Expression Analysis of RNA-seq Data

Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.

Introduction to next generation sequencing Rolf Sommer Kaas.

Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD

June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.

Next Generation DNA Sequencing

RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.

Metagenomics Assembly Hubert DENISE

Quick introduction to genomic file types Preliminary quality control (lab)

Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.

Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID

Sequence File Formats.

De Novo Genome Assembly - Introduction

Accurate estimation of microbial communities using 16S tags

Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.

RNA-Seq in Galaxy Igor Makunin DI/TRI, March 9, 2015.

QC and pre-assembly analyses

De novo assembly of RNA Steve Kelly

Adapter and quality trimming Mick Watson Director of ARK-Genomics The Roslin Institute.

Calling Somatic Mutations using VarScan

GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.

Metagenomic dataset preprocessing – data reduction

Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.

User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.

Canadian Bioinformatics Workshops

What should a bioinformatician know about DNA sequencing, and why?

Introduction to Illumina Sequencing

From Reads to Results Exome-seq analysis at CCBR

Robert Edgar Independent scientist

Canadian Bioinformatics Workshops

071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.

MGmapper A tool to map MetaGenomics data

Preprocessing Data Rob Schmieder.

Quality Control & Preprocessing of Metagenomic Data

NGS Analysis Using Galaxy

Presented By: Chinua Umoja

Short Read Sequencing Analysis Workshop

EDNA analyze Wang Ying & Huang Junman.

Denovo genome assembly of Moniliophthora roreri

Transcriptomics II De novo assembly

Sequencing technology and assembly

The FASTQ format and quality control

EMC Galaxy Course November 24-25, 2014

Long way to solve short ncRNA data analysis problems – evaluation of small RNA-Seq datasets from non-model organisms in Galaxy Jochen Bick Jochen Bick.

Independent scientist

2nd (Next) Generation Sequencing

Results report: _roreriPE_AGTCAA_L008_R1_all. fastq

A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.

Information processing after resequencing

Garbage In, Garbage Out: Quality control on sequence data

Example usage of mockrobiota MC resource for marker gene and metagenome sequencing pipelines. Example usage of mockrobiota MC resource for marker gene.

Phrap assemblies visualized with the Consed (53) program.

BF nd (Next) Generation Sequencing

Additional file 2: RNA-Seq data analysis pipeline

BF528 - Sequence Analysis Fundamentals

Toward Accurate and Quantitative Comparative Metagenomics

The Variant Call Format

Presentation transcript:

Quality Control Hubert DENISE

Image credits: (1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74: Quality control Diversity analysis Metagenomics data analysis Functional analysis

QC rationale Why ?  Garbage in, garbage out  Base call error: - each base call has a quality score associated - specific platform-dependent errors  Reads quality decreases with reads length  NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.

EBI Metagenomics: QC step by step  Clipping - low quality ends trimmed and adapter sequences removed using Biopython SeqIO package  Quality filtering - sequences with > 10% undetermined nucleotides removed  Read length filtering - short sequences are removed: 100 nt theshold  Duplicate sequences removal - clustered on 99% identity (UCLUST v for 454 and Qiime prefix clustering for Illumina) and representative sequence chosen  Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked

EBI Metagenomics: QC consequences Roche 454 Illumina Ion Torrent

MG-RAST QC EBI Metagenomics QC dereplication ( first 50 bp ) model organism screening ( bowtie ) length filtering ( >75 bp ) ambiguous base filtering ( <5 bp ) dynamic base filtering ( phred score ) analysis duplicate sequence filtering ( first 50 bp ) repeat masking clipping (10%) quality filtering ( phred score ) read length filtering (> 100bp) analysis

QC Tutorial Introduction to exercise Hubert Denise

QC Tutorial Today we’ll be investigating a dataset obtained from varying depths of water taken from the Pacific Ocean 25m125m 75m500m First we will look at the “HOT_Station_ALOHA,_25m_depth” fastq sequence file using the software FASTQC Then we will use the Trimmomatic package to: Perform quality and length trimming on this file

Performing QC steps using Trimmomatic All instructions are provided in the manual Trimmomatic is written in Java but you only need basic Unix knowledge to run it Trimmomatic functions: -removal of Illumina adapters from reads, -quality filtering, -length trimming, -conversion of quality score format In this tutorial we will only perform quality and length filtering More details at

@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCC TCGGTTTTTCATCCAATTGAGTCGTCCCGTTGATAGTGAACTGGTACGTCATCGACTGCA... + Trimmomatic steps used in this tutorial A - LEADING:8 TRAILING:8 quality threshold quality score (phred 33) … trimmed sequence

@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCC TTTTTCATCCAATTGAGTCGTCCCGTTGATAG...CGTAGCGATTGTTACCCAGAGGA + Trimmomatic steps used in this tutorial B – SLIDINGWINDOW:4:15 window size … average quality sum: 57 avg: work in the 5’ to 3’ end direction (whole read is scanned) sum: 58 avg: sum = 141 avg =  no trimming etc … avg ≥ 15 : no trimming Final sequence sum = 59 avg < 15 => trimming

Hubert DENISE