Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.

Slides:

Advertisements

Similar presentations

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.

Advertisements

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.

ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.

Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.

SOLiD Sequencing & Data

Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.

Before we start: Align sequence reads to the reference genome

Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.

NGS data analyses with BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia.

NGS Analysis Using Galaxy

Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.

Expression Analysis of RNA-seq Data

Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.

Introduction to next generation sequencing Rolf Sommer Kaas.

File formats Wrapping your data in the right package Deanna M. Church

June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.

NGS data analysis CCM Seminar series Michael Liang:

Next Generation DNA Sequencing

RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

NIH Extracellular RNA Communication Consortium 2 nd Investigators’ Meeting May 19 th, 2014 Sai Lakshmi Subramanian – (Primary

Quick introduction to genomic file types Preliminary quality control (lab)

1 P6a Extra Discussion Slides Part 1. 2 Section A.

DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

Quality Control Hubert DENISE

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.

__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.

What is BLAST? Basic BLAST search What is BLAST?

GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.

Metagenomic dataset preprocessing – data reduction

Compression by Reference a rational approach to storing aligned sequence data.

User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.

Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.

HOMER – a one stop shop for ChIP-Seq analysis

Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.

What is sequencing? Video: WlxM (Illumina video) WlxM.

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

Canadian Bioinformatics Workshops

Visualizing data from Galaxy

Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.

What should a bioinformatician know about DNA sequencing, and why?

From Reads to Results Exome-seq analysis at CCBR

Canadian Bioinformatics Workshops

What is BLAST? Basic BLAST search What is BLAST?

Konstantin Okonechnikov Qualimap v2: advanced quality control of

Introductory RNA-seq Transcriptome Profiling

Using command line tools to process sequencing data

NGS File formats Raw data from various vendors => various formats

RNA-Seq Green Line Overview

Quality Control & Preprocessing of Metagenomic Data

NGS Analysis Using Galaxy

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

How to store and visualize RNA-seq data

Introductory RNA-Seq Transcriptome Profiling

The FASTQ format and quality control

Using the SRAdb Package to Query the Sequence Read Archive

A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.

Garbage In, Garbage Out: Quality control on sequence data

ChIP-seq Robert J. Trumbly

Canadian Bioinformatics Workshops

BF528 - Sequence Analysis Fundamentals

Presentation transcript:

Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151

Sequence Formats  All Sequence formats are ASCII text containing sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence  Formats are designed to hold sequence data and other information about sequence 8/19/20152

Why so many formats? 8/19/20153  Supply required information for each step of analysis  Efficient Data management- moving data across file system takes time  Each Data formats vary in the information they contain Five types of sequence file formats Raw Sequence files Co-ordinate files Parameter files Annotation files Metadata files

Sequencers & Sequence Analysis Packages 8/19/20154

Read output formats  454  Solexa/Illumina  SOLiD 8/19/20155

454 output formats.sff.fna.qual 8/19/20156

Illumina output formats.seq.txt.prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Illumina single line format SCARF 8/19/20157

SOLiD output format(s) CSFASTA 8/19/20158

If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI ENA at EMBL-EBI 8/19/20159

Common (“standard”) format for read alignments: Alignment/Assembly Format SAM BAM (= binary SAM) 8/19/201510

Formats for Genome/Gene annotation BED format (genome-browser tracks) GFF format (gene/genome features) BioXSD (XML) (any annotation; under development) 8/19/201511

Deposit genome/metagenome in a public repository: INSDC databases: GenBank, EMBL, DDBJ Deposit genome/metagenome metadata: MIGS/MIMS standard by GSC Genomic Standards Consortium International Nucleotide Sequence Database Collaboration 8/19/201512

MIGS: Minimum Information about a Genome Sequence MIMS: Minimum Information about a Metagenome Sequence/Sample 8/19/201513

Use raw sequencing data- format when possible  For base-call data, use “standard” FASTQ (Sanger, Phred)  For read alignments, use SAM/BAM format  For annotation results (e.g. GFF or BED format) Points to remember on Data Formats 8/19/201514

QC analysis 8/19/201515

Need for QC & Preprocessing QC analysis of sequence data is extremely important for meaningful downstream analysis  To analyze problems in quality scores/ statistics of sequencing data  To check whether further analysis with sequence is possible  To remove redundancy (filtering)  To remove low quality reads from analysis Highly efficient and fast processing tools are required to handle large volume of datasets 8/19/201516

FastQC and FastX Toolkit  Use FastQC in preliminary analysis  Use FastX-toolkit to optimize different datasets and visualize the results with FastQC 8/19/201517

FastQC output  Basic statistics  Quality- Per base position  Per Sequence Quality Distribution  Nucleotide content per position  Per sequence GC distribution  Per base GC distribution  Per base N content  Length Distribution  Overrepresented/ duplicated sequences  K-mer content 8/19/201518

FastQC (Box-Whisker plot) Y axis- Quality Score X axis- Base position 8/19/201519

Basic Statistics Contains information about  File_type  ASCII encoding quality value  Total sequences, filtered sequence  Sequence length  Percentage GC content 8/19/201520

2. Quality- Per base position 8/19/201521

2. Quality- Per base position 8/19/201522

3.Per Sequence Quality Distribution 8/19/201523

3. Per Sequence Quality Distribution 8/19/201524

4.Nucleotide content per position 8/19/201525

4. Nucleotide content per position 8/19/201526

5.Per sequence GC distribution 8/19/201527

5.Per sequence GC distribution 8/19/201528

6. Per base GC distribution 8/19/201529

6. Per base GC distribution 8/19/201530

7. Per base N content 8/19/201531

7. Length Distribution 8/19/201532

8. Kmer content 8/19/201533

9. Overrepresented/ duplicate sequences Too many duplicate regions in the sequence will be due to sequencing problems 8/19/201534

FASTX Toolkit  fastx_quality_stats.txt  fastq_quality_boxplot_graph.png  fastx_nucleotide_distribution.png  QC report.txt 8/19/201535

QC Report  Sequence Statistics Total No. Of Sequences Avg. Sequence Length54 Max Sequence Length54 Min Sequence Length54 Total Sequence Length Total N bases % N bases No of Sequences with Ns % Sequences with Ns  Quality Statistics Total HQ bases %HQ bases88.78 Total HQ reads %HQ reads /19/201536

quality_boxplot_graph & nucleotide_distribution 8/19/201537

Thank you 8/19/201538