Quick introduction to genomic file types Preliminary quality control (lab)

Slides:



Advertisements
Similar presentations
NGS data analysis in R Biostrings and Shortread
Advertisements

Indexing DNA Sequences Using q-Grams
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
IT253: Computer Organization Lecture 6: Assembly Language and MIPS: Programming Tonga Institute of Higher Education.
SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
Chapter 2: Data Representation
SOLiD Sequencing & Data
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
2-1 Computer Organization Part Fixed Point Numbers Using only two digits of precision for signed base 10 numbers, the range (interval between lowest.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
PROBABILITY AND SAMPLES: THE DISTRIBUTION OF SAMPLE MEANS.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
Workshop Schedule Schedule has links to introductory presentations and the FungiDB workshops Tuesday 3rdWednesday.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
Whole Exome Sequencing for Variant Discovery and Prioritisation
De-novo Assembly Day 4.
Bioinformatics Applications
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
MES Genome Informatics I - Lecture IV. NGS basics Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University.
2-1 Chapter 2 - Data Representation Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles of Computer.
File formats Wrapping your data in the right package Deanna M. Church
Genome Assembly Preliminary Results
RNAseq analyses -- methods
Giuseppe D'Auria Norwich September 2014 FISABIO, Valencia Introduction into the processing of raw data.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Sizing Basics  Why Size?  When to size  Sizing issues:  Bits and Bytes  Blocks (aka pages) of Data  Different Data types  Row Size  Table Sizing.
DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
De novo assembly validation
GE3M25: Computer Programming for Biologists Python, Class 5
Sequence File Formats.
De Novo Genome Assembly - Introduction
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
What should a bioinformatician know about DNA sequencing, and why?
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
Introduction to Illumina Sequencing
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Canadian Bioinformatics Workshops
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
RNA-Seq Green Line Overview
NGS Analysis Using Galaxy
Bacterial Genome Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
First Bite of Variant Calling in NGS/MPS Precourse materials
Sequencing technology and assembly
The FASTQ format and quality control
Introduction into the processing of raw data
Bacterial Genome Assembly
2nd (Next) Generation Sequencing
BF nd (Next) Generation Sequencing
Additional file 2: RNA-Seq data analysis pipeline
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

Quick introduction to genomic file types Preliminary quality control (lab)

File types overview Fasta/fasta qual Fastq SAM BAM sff … Text files Binary files

Fasta Most basic file format to represent nucleotide or amino-acid sequences Each sequence is represented by: – A single description line (shouldn’t exceed 80 characters): Starts with “>” Followed by the sequence ID, and a space, then More information (description) – The sequence, over one or several lines (the number of characters per line is generally 70 or 80, but it doesn’t matter)

Qual (aka fasta qual) Fasta-like quality format Always paired with a fasta file (sequences with same ids, same order) Description line as in fasta format Qualities: a number for each base in the corresponding fasta, separated by spaces Can be gzip-ped and used as such by some programs

Most common representation of qualities Related to the probability of errors (P) in a particular base Quality - Phred scores Phred scoreProbability of error … Solexa runs < 1.3 use a different calcuation: Equivalent for high quality Different for low quality (negative values of Q allowed)

FastQ A more compact format to store sequence and qualities Normally on 4 lines: – followed by the sequence ID – Sequence – “+” – The quality score Quality score: – ASCII encoding of phred scores – Sanger has one scale, Illumina has 3 differents (…) Can be gzip-ped and used as such by some programs Example taken from GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

FastQ – quality values Solexa picked different quality definition and ranges over time, all different from Sanger values Ask your sequence provider! Guessing by getting the range of all values in all/many reads (not foolproof) SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ !"#$%&'()*+,-./ :; | | | | | | S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) Example taken from Wikipedia

SAM/BAM SAM (Sequence Alignment/Map) format represents the alignment of sequences (e.g. reads) to a reference sequence (e.g. genome) – Simple to read and parse (text, tab-delimited) – Flexible (possibility to add custom fields) – Compact in file size – Can store paired-end information Reference document: BAM is a binary (=indexable, more compact) representation of SAM

SAM/BAM (cont.) Structure: two sections: – Header: lines starting two letters, then several key:value pairs. The keys are again two letters. Contains information about the reference sequence (SQ), the libraries used (“read groups”, RG), etc… – Sequences: one line for each read, with the following fields (among others) Query (pair) name Reference name Position Mapping quality CIGAR string Seq and quality Tag:type:value fields

sff Binary format provided by 454 Contains – A header with information on the run (name, key sequence, number of reads, etc.) – For each read: Name, length of the read Clipping information (quality and adaptor) Numeric representation of the flowgrams (454 equivalent to chromatograms) Base sequence called from flowgrams Qualities

Genome assembly lingo Read: segment of DNA (~ nt) read by a sequencer Mate-pair, paired ends: pair of reads whose distance from each other within the genome is approximately known Contig: contiguous segment of DNA reconstructed (unambiguously) from a set of reads Scaffold: group of contigs that can be ordered and oriented with respect to each other (usually with the help of mate-pair data) N50 (N90): 50% (90%) of the nucleotides are included in contigs this size or larger. The higher the better.

Exercise: preliminary quality control of raw sequences number of sequences, length, average, distribution fasta/fastx conversion fastx statistics fasta quality chart/boxplot nucleotide distribution clipping/trimming reads