Maximize read usage through mapping strategies

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
DNAseq analysis Bioinformatics Analysis Team
Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored.
Introduction to Short Read Sequencing Analysis
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg 林恩羽 宋曉亞 陳翰平.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
NGS Analysis Using Galaxy
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
RNAseq analyses -- methods
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
NGS data analysis CCM Seminar series Michael Liang:
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Konstantin Okonechnikov Qualimap v2: advanced quality control of
RNAseq: a Closer Look at Read Mapping and Quantitation
Computing challenges in working with genomics-scale data
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Lesson: Sequence processing
Next Generation Sequencing Analysis
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Stubbs Lab Bioinformatics – 5 Review tophat, alignment summary and htseq-count exercises: MDS plots and Differential expression We want to be able to.
Integrative Genomics Viewer (IGV)
VCF format: variants c.f. S. Brown NYU
Gene expression from RNA-Seq
Short Read Sequencing Analysis Workshop
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
EMC Galaxy Course November 24-25, 2014
Pairwise and NGS read alignment
Department of Computer Science
MapView: visualization of short reads alignment on a desktop computer
ChIP-Seq Data Processing and QC
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Lecture 14 Algorithm Analysis
Learning to count: quantifying signal
BIOINFORMATICS Fast Alignment
Garbage In, Garbage Out: Quality control on sequence data
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Canadian Bioinformatics Workshops
Alignment of Next-Generation Sequencing Data
Sequence Analysis - RNA-Seq 2
BF528 - Sequence Analysis Fundamentals
RNA-Seq Data Analysis UND Genomics Core.
Quality Control & Nascent Sequencing
Presentation transcript:

Maximize read usage through mapping strategies

Key concepts of session The settings you use MATTER. More reads mapping is not ALWAYS better. Sometimes you need to understand what you threw away. Mapping is yet another quality control step.

What are we doing with mapping? How do we align the bag of reads to the reference? Efficiently (memory and time) Account for inexact matches and ambiguity? Traditional sequence alignment (BLAT, Blast) are too slow for millions of short reads.

Short read mapping Input: Output: A reference genome A collection of many 25-250bp tags/reads User-specified parameters Output: One or more genomic coordinates for each tag In practice, not all reads successfully map to the reference genome. Why?

What makes it hard? Inexact matches Multiple mapping ?

Indexing The key to speeding up matching short reads is to tightly INDEX the genome. There are multiple strategies for building indicies: > 2000X faster than BLAST Suffix Tree: Suffix Array: Hashing:

Bowtie Indexes the reference genome using a scheme based on the Burrows-Wheeler transform (BWT) which is very space efficient. A quality-aware backtracking algorithm that allows mismatches and favors high-quality alignments. Double indexing', a strategy to avoid excessive backtracking

Bowtie caveat “If one or more exact matches exist for a read, then Bowtie is guaranteed to report one, but if the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment.” …unless you use the MUCH slower “best” option

Mature RNA presents a unique problem for read mapping …

Read mapping exon mapping exon-exon junction Unlike DNA-Seq, when mapping RNA-Seq reads back to reference genome, we need to pay attention to exon-exon junction reads

Three mapping strategies. De novo – can align assembled products to genome if have reference Transcriptome – rely on ANNOTATION or do isoform inference and then rely on inferred isoforms. Reference genome – this is TOPHAT/CUFFLINKS sort of approach; Data driven. Diagrams from: Cloonan & Grimmond, Nature Methods 2010

Many splice junctions per gene

Mapping software …

Alignment quality score Base quality values and mismatch positions in a candidate alignment are used to assign a probability value Probability reflect likelihood that candidate position in genome would give rise to the observed read if its bases were sequenced at error rates corresponding to the read’s quality values. Should also reflect “uniqueness”. Alignment score for a read is computed from probability values of all candidate alignments. If there are two candidate alignments for a read with probabilities values 0.9 and 0.3: 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct 1- 0.75, chance highest scoring alignment is wrong Alignment score = -10 log(0.25) = 6.

Sequence Alignment/Map (SAM)

Sequence Alignment/Map (SAM) format http://samtools.sourceforge.net/ Standard format for reporting short read alignment data BAM is compressed version Header Alignment info Header:

Read Information

Read Information

Mapping Flags

How is mapping a quality control step? Poor quality reads will not map with the correct settings If too many reads are thrown away, it may may indicate a problem in pre-mapping qc (protocol, trimming) RED = not mapped YELLOW = mapped, duplicated BLUE = uniquely mapped

Mapping : from fastq to tdf After successful trimming, it’s time to map Again, there are a number of different mapping tools all with pros and cons We will use Bowtie2

In part one of this script, we will: Set our Bowtie2 parameters Setting mapping sensitivity will largely affect time it takes to run the file (fast, sensitive, very sensitive) Generate mapping stats

In the next part of this script, we will: Convert .sam to .bam (binary sam, quicker processing) and index  see handout Generate bedGraph file which gives information about read coverage

Lastly, we will: Read count correct – adjust for read depth post-mapping using flagstat file Convert .bedGraph to .tdf for rapid loading into IGV To run (after adjusting rootname/project – refdir is the same for everyone): $ bash mapping.sh

Open tdfs in IGV Path = $ cd data/nascent-ws/Mapped/tdfs/ Open all SRRs with .tri.tdf : see dowell.colorad.edu/HackCon/pages/visualization.html for IGV tips and tricks

Evaluating your mapping Poor quality reads will not map Too many duplicates may indicate a problem with the library or sequencing better to run a small “test” sample (~10M reads) if you have time and $$ Look at your sample!!

Post-mapping : additional QC RSeQC – assessing where reads are mapping in the genome SCRIPT: rseqc.sh Preseq – determining sample complexity (how many unique reads) SCRIPT: preseq.sh