RNAseq: a Closer Look at Read Mapping and Quantitation

Slides:



Advertisements
Similar presentations
John Dorband, Yaacov Yesha, and Ashwin Ganesan Analysis of DNA Sequence Alignment Tools.
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Fast and accurate short read alignment with Burrows–Wheeler transform
Peter Tsai Bioinformatics Institute, University of Auckland
TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
RNA-seq workshop ALIGNMENT
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Lecture 15 Algorithm Analysis
Construction of Substitution matrices
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
Heuristic Alignment Algorithms Hongchao Li Jan
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
1 BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches 1Yangjun Chen, 2Yujia.
Burrows-Wheeler Transformation Review
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
RNA Quantitation from RNAseq Data
An Efficient Algorithm for Read Matching in DNA Databases
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
VCF format: variants c.f. S. Brown NYU
RNA-Seq analysis in R (Bioconductor)
Homology Search Tools Kun-Mao Chao (趙坤茂)
The short-read alignment in distributed memory environment
Pairwise and NGS read alignment
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Kallisto: near-optimal RNA seq quantification tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
From: TopHat: discovering splice junctions with RNA-Seq
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Lecture 14 Algorithm Analysis
2018, Spring Pusan National University Ki-Joune Li
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Maximize read usage through mapping strategies
Tries 2/27/2019 5:37 PM Tries Tries.
BIOINFORMATICS Fast Alignment
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Alignment of Next-Generation Sequencing Data
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence Analysis - RNA-Seq 2
Presentation transcript:

RNAseq: a Closer Look at Read Mapping and Quantitation Jeremy Buhler for GEP Alumni Workshop

RNAseq Pipeline for Expression Analysis 37251 20653 9827 5121 RNAseq Read Count per Transcript Map reads to transcripts RNA Reads Extract and Sequence RNA Source RNA Abundance per Transcript Quantify

Topics for Today Read mapping and RNA quantitation from read counts are two key computational steps in RNAseq expression analysis. Mapping: how do we do it fast? Quantitation: how do we get from mapped reads to transcript abundance?

Problem: Short-Read Mapping Given a genome or transcriptome database D M reads – DNA seqs of some length s << |D| For each read, does it appear in D with at most k differences, and if so, where? M s

Typical Parameters Database D has size 107 – 1010 bases Number of reads M is 106 – 108 Length s is 75-150 (may vary among reads) # differences k is 0-3 (added, deleted, changed) to account for sequencing errors, polymorphisms

Design Pressures # reads M is huge!  minimal time per read Cannot afford to spend O(|D|) time per read as in BLAST  need to index database in order to spend less time matching each read Index should be small enough to reside in fast memory, so as to maximize mapping speed

Basic Idea: Suffix Tree Index First, sort all suffixes of database string D to create sorted suffix array A. Second, merge all common prefixes of adjacent strings in A to create a suffix tree T. Leaves of tree correspond to suffixes of D.

Construct a Suffix Array (A) 1 2 3 4 5 6 7 8 9 10 10 $ 7 aga$ 9 a$ 6 caga$ 8 ga$ 5 ccaga$ 0 acagaccaga$ 1 cagaccaga$ 2 agaccaga$ 3 gaccaga$ 4 accaga$ D = acagaccaga$ A 0 acagaccaga$ 1 cagaccaga$ 2 agaccaga$ 3 gaccaga$ 4 accaga$ 5 ccaga$ 6 caga$ 7 aga$ 8 ga$ 9 a$ 10 $ Suffix Position Suffix Sort by suffix

Suffix Tree Example D = acagaccaga$ A T Suffix 10 $ 9 a$ 0 aca… … 1 2 3 4 5 6 7 8 9 10 A T Suffix c g a 10 $ 6 1 5 8 3 a g c $ … 9 a$ c g $ 4 7 2 a c $ … 0 aca… 2 agac… 4 acc… 7 aga$ 1 cagac… 3 gac… 5 cca… 6 caga$ 8 ga$ 9

Rapid Matching vs Suffix Tree Where is cag? 9 4 7 2 6 1 5 8 3 a c g $ … Can find all matches to a read in D using time proportional to read length s.

Rapid Matching vs Suffix Tree Where is cag? 9 4 7 2 6 1 5 8 3 a c g $ … Can find all matches to a read in D using time proportional to read length s.

Rapid Matching vs Suffix Tree Where is cag? 9 4 7 2 6 1 5 8 3 a c g $ … Can find all matches to a read in D using time proportional to read length s.

Rapid Matching vs Suffix Tree Where is cag? 9 4 7 2 6 1 5 8 3 a c g $ … Can find all matches to a read in D using time proportional to read length s. D = acagaccaga$ 1 2 3 4 5 6 7 8 9 10

Extension to Inexact Matching To permit matches with k substitutions, try multiple paths, but charge for each mismatch. (Try searching for “aca” with one mismatch.) To permit matches with k differences, can do dynamic programming or heuristic search to compute edit distance of read against many paths in tree. Descent stops for a read when we hit bottom of tree or find that path requires > k differences.

Spliced read mapping (TopHat) Reads Genome Map unspliced reads S1 Initially Unmapped (IUM) reads S1 S3 S1 S2 S3 Split IUM into segments S2 part 1 S2 part 2 Junction flanking index GT AG IUM S2

Burrows-Wheeler Transform (BWT) 1 2 3 4 5 6 7 8 9 10 D = acagaccaga$ D = acagaccaga$ D = acagaccaga$ Suffix Array 10 $ Burrows-Wheeler Matrix 10 $ acagaccaga 9 a$ 9 a$acagaccag 0 acagaccaga$ 0 acagaccaga$ 1 cagaccaga$a 2 agaccaga$ac 3 gaccaga$aca 4 accaga$acag 5 ccaga$acaga 6 caga$acagac 7 aga$acagacc 8 ga$acagacca 4 accaga$ 7 aga$ 2 agaccaga$ 6 caga$ Used for compression and indexing strings Compress string using run-length encoding 1 cagaccaga$ 5 ccaga$ 8 ga$ 3 gaccaga$ BWT: aaaacccg$ga

“Virtual Tree”: FM-Index Suffix trees need 10-100 bytes memory/base indexed Theorem [Ferragina & Manzini, 2000]: we can build an O(1)-time oracle that, given “suffix tree posn” corresponding to a string z, returns posn corresponding to one-char extension z.c Hence, we can simulate suffix tree matching without explicitly storing the tree! Space cost: ~1 byte/base

Commonly Used Mapping Software Bowtie [Langmead et al. 2009, 2012] BWA [Li & Durbin 2009, 2010] SOAP2 [Li et al. 2009] All these tools use variants of FM-index approach.

On to Quantitation… (Switch to Notes)