National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.

Slides:



Advertisements
Similar presentations
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Advertisements

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
Mutation Analysis Server Nagarajanlab. © Copyright 2005, Washington University School of Medicine. 2 Agenda Mutation pipeline overview High level design.
SOLiD Sequencing & Data
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Read Processing and Mapping: From Raw to Analysis-ready Reads
E-BIOGENOUEST: A REGIONAL LIFE SCIENCES INITIATIVE FOR DATA INTEGRATION Datacite Annual Conference Nancy Olivier Collin – IRISA/INRIA
De-novo Assembly Day 4.
Update on HTProcess Apps Sciplant May 8, HTProcessPipeline Purpose- – Provide a more functional set of commonly needed applications for RNASeq and.
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
File formats Wrapping your data in the right package Deanna M. Church
GBS Bioinformatics Pipeline(s) Overview
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Giuseppe D'Auria Norwich September 2014 FISABIO, Valencia Introduction into the processing of raw data.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
Quick introduction to genomic file types Preliminary quality control (lab)
Algorithms for Biological Sequence Analysis ─ Class Presentation Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Quality Control Hubert DENISE
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introduction to RNAseq
The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas.
Sequence File Formats.
Hackathon #1 From Snack to Sequence
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
RNA-Seq in Galaxy Igor Makunin DI/TRI, March 9, 2015.
De novo assembly of RNA Steve Kelly
Adapter and quality trimming Mick Watson Director of ARK-Genomics The Roslin Institute.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.
Canadian Bioinformatics Workshops
What should a bioinformatician know about DNA sequencing, and why?
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
Galaxy for analyzing genome data Hardison October 05, 2010
Canadian Bioinformatics Workshops
DNA Sequencing Second generation techniques
Computing challenges in working with genomics-scale data
RNA-Seq Green Line Overview
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Presented By: Chinua Umoja
Sequencing technology and assembly
ChIP-Seq Analysis – Using CLCGenomics Workbench
The FASTQ format and quality control
EMC Galaxy Course November 24-25, 2014
Introduction into the processing of raw data
Workshop on Microbiome and Health
2nd (Next) Generation Sequencing
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.
Maximize read usage through mapping strategies
Garbage In, Garbage Out: Quality control on sequence data
BF nd (Next) Generation Sequencing
Additional file 2: RNA-Seq data analysis pipeline
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
Campus and Phoenix Resources
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment of RNA-Seq Data

National Center for Genome Analysis Support: What do the data look 6046 length=76 GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATT GCGCCCT +SRR length=76 Fastq is a common format for storing Next Gen Sequencing data. Text based Stores both the sequence and quality information Originally developed at Wellcome Trust Sanger Insitute and later adopted by Solexa (Bennett, 2004) Information for each read comprises of 4 lines Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), doi: /

@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCA A + Sequence Identifier Begins with symbol Comprises of Instrument Name Flowcell Lane Tile X and Y coordinates of the Cluster on the Tile Member of a Pair (1 or 2) Index FASTQ Format

@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCA A + Read Sequence (A, G, T, C, N)

FASTQ 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA + ‘+’ character Can be followed by the same Sequence Identifier (from Line1)

@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA + Base Quality Scores (Phred33) for the sequence in Line2 Must contain the same number of characters as those in the sequence FASTQ Format

National Center for Genome Analysis Support: Sequencers can assign a “confidence” value per call based on how ambiguous the base call is Quality Scores Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186– 194. doi: /gr PMID The sequencer will estimate the probability that a given base call is NOT correct (Erwing 1998)

National Center for Genome Analysis Support: Quality Scores Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186– 194. doi: /gr PMID P-10*log10(p) Est. Accuracy = 1-P PHRED Score is defined as q = -10 x log 10 (p) (Erwing 1998) P = probability call is not correct

National Center for Genome Analysis Support: Why not just have numbers? Quality Score 1:N:0:ACAGTG CGTTCAGT… …

National Center for Genome Analysis Support: Why not just have numbers? Quality Score 1:N:0:ACAGTG CGTTCAGT… … Quality symbols to the rescue

National Center for Genome Analysis Support: Letters are represented deep down in the computer as numbers The quality score + a constant number (33 or 64, usually) is the number, which is converted to the quality symbol using ASCII Quality Score Encodings

National Center for Genome Analysis Support: ASCII Table

National Center for Genome Analysis Support: FastQC is an excellent program for visualizing the overall quality of all reads in a fastq file Quality Scores FastQC is developed by the Babraham Bioinformatics Group:

National Center for Genome Analysis Support: Tactics for increasing overall quality We want to cut away the low quality bases! Trimming Based on Quality ✔

National Center for Genome Analysis Support: Wholesale cutting by base position Trimming Based on Quality

National Center for Genome Analysis Support: Start from ends of read and cut away until quality is above a specified threshold (usually 20) Trimming Based on Quality ✔ 3622

National Center for Genome Analysis Support: Start from one end and keep bases until they fall below a specified threshold Trimming Based on Quality 362

National Center for Genome Analysis Support: Sliding windows and minimum vs. average quality scores Trimming Based on Quality Average: Min: Max: 25 Target: Average below 20

National Center for Genome Analysis Support: Sliding windows and minimum vs. average quality scores Trimming Based on Quality Average: Min: Max: Target: Average below 20 Step Size = 5 Window Size = 6

National Center for Genome Analysis Support: Sliding windows and minimum vs. average quality scores Trimming Based on Quality Average: Min: Max: Target: Average below 20 Step Size = 5 Window Size = 6

National Center for Genome Analysis Support: Sliding windows and minimum vs. average quality scores Trimming Based on Quality Average: Min: Max: Target: Average below 20 Step Size = 5 Window Size = 6

National Center for Genome Analysis Support: Mate pairs, orphans and minimum sequence length Trimming Based on Quality Right read too short to keep Left read survives trimming

National Center for Genome Analysis Support: Trimmomatic Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. Trim Galore! developed by the Babraham Bioinformatics Group: FASTX Toolkit Galaxy Trimming tools Trimming Software Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol Aug 25;11(8):R86. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology Jan; Chapter 19:Unit Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research Oct; 15(10):

National Center for Genome Analysis Support: What’s a Kmer? For a given sequence and a number, K, how many sub- sequences of length K are there? Kmers

National Center for Genome Analysis Support: Why? Kmers K = 5

National Center for Genome Analysis Support: When fragments are shorter than total length of the read, adapters will be sequenced on both mates of a paired-end read. For example, if we use technology that can sequence up to 100 bp: Primers and Adapters

National Center for Genome Analysis Support: When to suspect this: Patterns toward ends of reads Primers and Adapters

National Center for Genome Analysis Support: Software for removing adapters Primers and Adapters Cutadapt Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: /ej pp FASTX-Toolkit Scythe

National Center for Genome Analysis Support: Library Prep – retained and sequenced poly-As/poly-Ts When to suspect this: Poly-A Tails and Other Artifacts

National Center for Genome Analysis Support: PRINSEQ (Schmieder 2011) for trimming poly- Ts – takes a % of the read that contains T’s and sorts them out Conservatively, 60% of a read is T? Kick it out. Filter on % base, sequence complexity, duplicates Poly-A Tails and Other Artifacts Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27: [PMID: ]

National Center for Genome Analysis Support: How much sequence one can afford to cut out depends on the following: Coverage: If your sequence was run with very low coverage, you may not want to cut aggressively Sequence length: You can afford to cut 20bp out of a 150bp read, but not 30bp read Goals: Depending on your end goal, cut more or less aggressively Conservative QC vs Aggressive QC - factors

National Center for Genome Analysis Support: References Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), doi: / Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology Jan; Chapter 19:Unit Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi: /gr PMID Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research Oct; 15(10): Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol Aug 25;11(8):R86. Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: /ej pp Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27: [PMID: ]

National Center for Genome Analysis Support: Fin Thanks for watching! Questions and comments: