Maximize read usage through mapping strategies

Slides:

Advertisements

Similar presentations

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Advertisements

ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.

DNAseq analysis Bioinformatics Analysis Team

Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored.

Introduction to Short Read Sequencing Analysis

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg 林恩羽宋曉亞陳翰平.

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani

NGS Analysis Using Galaxy

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Whole Exome Sequencing for Variant Discovery and Prioritisation

Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.

Li and Dewey BMC Bioinformatics 2011, 12:323

Expression Analysis of RNA-seq Data

Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.

Introduction to Short Read Sequencing Analysis

MES Genome Informatics I - Lecture V. Short Read Alignment

RNAseq analyses -- methods

June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.

NGS data analysis CCM Seminar series Michael Liang:

TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.

Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015

Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

Short Read Workshop Day 5: Mapping and Visualization

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -

Canadian Bioinformatics Workshops

From Reads to Results Exome-seq analysis at CCBR

Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.

Konstantin Okonechnikov Qualimap v2: advanced quality control of

RNAseq: a Closer Look at Read Mapping and Quantitation

Computing challenges in working with genomics-scale data

Using command line tools to process sequencing data

NGS File formats Raw data from various vendors => various formats

Day 5 Mapping and Visualization

FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.

Lesson: Sequence processing

Next Generation Sequencing Analysis

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Stubbs Lab Bioinformatics – 5 Review tophat, alignment summary and htseq-count exercises: MDS plots and Differential expression We want to be able to.

Integrative Genomics Viewer (IGV)

VCF format: variants c.f. S. Brown NYU

Gene expression from RNA-Seq

Short Read Sequencing Analysis Workshop

RNA-Seq analysis in R (Bioconductor)

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

EMC Galaxy Course November 24-25, 2014

Pairwise and NGS read alignment

Department of Computer Science

MapView: visualization of short reads alignment on a desktop computer

ChIP-Seq Data Processing and QC

CSC2431 February 3rd 2010 Alecia Fowler

Next-generation sequencing - Mapping short reads

Lecture 14 Algorithm Analysis

Learning to count: quantifying signal

BIOINFORMATICS Fast Alignment

Garbage In, Garbage Out: Quality control on sequence data

Next-generation sequencing - Mapping short reads

CS 6293 Advanced Topics: Translational Bioinformatics

Canadian Bioinformatics Workshops

Alignment of Next-Generation Sequencing Data

Sequence Analysis - RNA-Seq 2

BF528 - Sequence Analysis Fundamentals

RNA-Seq Data Analysis UND Genomics Core.

Quality Control & Nascent Sequencing

Presentation transcript:

Maximize read usage through mapping strategies

Key concepts of session The settings you use MATTER. More reads mapping is not ALWAYS better. Sometimes you need to understand what you threw away. Mapping is yet another quality control step.

What are we doing with mapping? How do we align the bag of reads to the reference? Efficiently (memory and time) Account for inexact matches and ambiguity? Traditional sequence alignment (BLAT, Blast) are too slow for millions of short reads.

Short read mapping Input: Output: A reference genome A collection of many 25-250bp tags/reads User-specified parameters Output: One or more genomic coordinates for each tag In practice, not all reads successfully map to the reference genome. Why?

What makes it hard? Inexact matches Multiple mapping ?

Indexing The key to speeding up matching short reads is to tightly INDEX the genome. There are multiple strategies for building indicies: > 2000X faster than BLAST Suffix Tree: Suffix Array: Hashing:

Bowtie Indexes the reference genome using a scheme based on the Burrows-Wheeler transform (BWT) which is very space efficient. A quality-aware backtracking algorithm that allows mismatches and favors high-quality alignments. Double indexing', a strategy to avoid excessive backtracking

Bowtie caveat “If one or more exact matches exist for a read, then Bowtie is guaranteed to report one, but if the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment.” …unless you use the MUCH slower “best” option

Mature RNA presents a unique problem for read mapping …

Read mapping exon mapping exon-exon junction Unlike DNA-Seq, when mapping RNA-Seq reads back to reference genome, we need to pay attention to exon-exon junction reads

Three mapping strategies. De novo – can align assembled products to genome if have reference Transcriptome – rely on ANNOTATION or do isoform inference and then rely on inferred isoforms. Reference genome – this is TOPHAT/CUFFLINKS sort of approach; Data driven. Diagrams from: Cloonan & Grimmond, Nature Methods 2010

Many splice junctions per gene

Mapping software …

Alignment quality score Base quality values and mismatch positions in a candidate alignment are used to assign a probability value Probability reflect likelihood that candidate position in genome would give rise to the observed read if its bases were sequenced at error rates corresponding to the read’s quality values. Should also reflect “uniqueness”. Alignment score for a read is computed from probability values of all candidate alignments. If there are two candidate alignments for a read with probabilities values 0.9 and 0.3: 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct 1- 0.75, chance highest scoring alignment is wrong Alignment score = -10 log(0.25) = 6.

Sequence Alignment/Map (SAM)

Sequence Alignment/Map (SAM) format http://samtools.sourceforge.net/ Standard format for reporting short read alignment data BAM is compressed version Header Alignment info Header:

Read Information

Read Information

Mapping Flags

How is mapping a quality control step? Poor quality reads will not map with the correct settings If too many reads are thrown away, it may may indicate a problem in pre-mapping qc (protocol, trimming) RED = not mapped YELLOW = mapped, duplicated BLUE = uniquely mapped

Mapping : from fastq to tdf After successful trimming, it’s time to map Again, there are a number of different mapping tools all with pros and cons We will use Bowtie2

In part one of this script, we will: Set our Bowtie2 parameters Setting mapping sensitivity will largely affect time it takes to run the file (fast, sensitive, very sensitive) Generate mapping stats

In the next part of this script, we will: Convert .sam to .bam (binary sam, quicker processing) and index  see handout Generate bedGraph file which gives information about read coverage

Lastly, we will: Read count correct – adjust for read depth post-mapping using flagstat file Convert .bedGraph to .tdf for rapid loading into IGV To run (after adjusting rootname/project – refdir is the same for everyone): $ bash mapping.sh

Open tdfs in IGV Path = $ cd data/nascent-ws/Mapped/tdfs/ Open all SRRs with .tri.tdf : see dowell.colorad.edu/HackCon/pages/visualization.html for IGV tips and tricks

Evaluating your mapping Poor quality reads will not map Too many duplicates may indicate a problem with the library or sequencing better to run a small “test” sample (~10M reads) if you have time and $$ Look at your sample!!

Post-mapping : additional QC RSeQC – assessing where reads are mapping in the genome SCRIPT: rseqc.sh Preseq – determining sample complexity (how many unique reads) SCRIPT: preseq.sh