Next-generation sequencing - Mapping short reads

Slides:



Advertisements
Similar presentations
John Dorband, Yaacov Yesha, and Ashwin Ganesan Analysis of DNA Sequence Alignment Tools.
Advertisements

SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Previous Lecture: Next-Generation DNA Sequencing Technology.
SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: The GNUMAP algorithm: unbiased probabilistic.
Fast and accurate short read alignment with Burrows–Wheeler transform
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Group 1 (1)陳伊瑋 (2)沈國曄 (3)唐婉馨 (4)吳彥緯 (5)魏銘良
Next Generation Sequencing, Assembly, and Alignment Methods
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar.
Heuristic alignment algorithms and cost matrices
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Lecture 15 Algorithm Analysis
Doug Raiford Phage class: introduction to sequence databases.
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Burrows Wheeler Transform and Next-generation sequencing - Mapping short reads.
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
RNAseq: a Closer Look at Read Mapping and Quantitation
1 BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches 1Yangjun Chen, 2Yujia.
Burrows-Wheeler Transformation Review
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Homology Search Tools Kun-Mao Chao (趙坤茂)
VCF format: variants c.f. S. Brown NYU
Genome alignment Usman Roshan.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Genomic Data Clustering on FPGAs for Compression
13 Text Processing Hongfei Yan June 1, 2016.
Bioinformatics: The pair-wise alignment problem
Department of Computer Science
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
CSC2431 February 3rd 2010 Alecia Fowler
Lecture 14 Algorithm Analysis
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Maximize read usage through mapping strategies
BIOINFORMATICS Fast Alignment
Minwise Hashing and Efficient Search
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Approximation Algorithms for the Selection of Robust Tag SNPs
Homology Search Tools Kun-Mao Chao (趙坤茂)
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

Next-generation sequencing - Mapping short reads CS 5263 & CS 4233 Bioinformatics Next-generation sequencing - Mapping short reads

Short read mapping Input: Output: A reference genome A collection of many 25-100bp tags (reads) User-specified parameters Output: One or more genomic coordinates for each tag In practice, only 70-75% of tags successfully map to the reference genome. Why?

Multiple mapping A single tag may occur more than once in the reference genome. The user may choose to ignore tags that appear more than n times. As n gets large, you get more data, but also more noise in the data.

Inexact matching ? An observed tag may not exactly match any position in the reference genome. Sometimes, the tag almost matches one or more positions. Such mismatches may represent a SNP (single-nucleotide polymorphism, see wikipedia) or a bad read-out. The user can specify the maximum number of mismatches, or a phred-style quality score threshold. As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags.

Read Length is Not As Important For Resequencing Jay Shendure

Mapping Reads Back Hash Table (Lookup table) Array Scanning FAST, but requires perfect matches. [O(m n + N)] Array Scanning Can handle mismatches, but not gaps. [O(m N)] Dynamic Programming (Smith Waterman) Indels Mathematically optimal solution Slow (most programs use Hash Mapping as a prefilter) [O(mnN)] Burrows-Wheeler Transform (BW Transform) FAST. [O(m + N)] (without mismatch/gap) Memory efficient. But for gaps/mismatches, it lacks sensitivity

Spaced seed alignment Tags and tag-sized pieces of reference are cut into small “seeds.” Pairs of spaced seeds are stored in an index. Look up spaced seeds for each tag. For each “hit,” confirm the remaining positions. Report results to the user.

Burrows-Wheeler Store entire reference genome. Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution.

Why Burrows-Wheeler? BWT very compact: Linear-time search algorithm Approximately ½ byte per base As large as the original text, plus a few “extras” Can fit onto a standard computer with 2GB of memory Linear-time search algorithm proportional to length of query for exact matches

Burrows-Wheeler Transform (BWT) $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac acaacg$ gc$aaac Burrows-Wheeler Matrix (BWM)

Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

Burrows-Wheeler Matrix 3 1 4 2 5 6 $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac See the suffix array?

Key observation 1$acaacg1 2aacg$ac1 1acaacg$1 3acg$aca2 1caacg$a1 2cg$acaa3 1g$acaac2 a1c1a2a3c2g1$1 “last first (LF) mapping” The i-th occurrence of character X in the last column corresponds to the same text character as the i-th occurrence of X in the first column.

Recover text 5 6 4 3 2 1 6

Exact match 3 1 4 2 5 6

Exact match (another example) BWT(agcagcagact) = tgcc$ggaaaac Search for pattern: gca gca gca gca gca $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac Test with your own seq and pattern at: http://www.allisons.org/ll/AlgDS/Strings/BWT/

Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. a c g T rank 1 5 8 11 BWT 1 2 3 4 5 6 7 8 9 10 11 SA $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac t g c $ a a c g t 1 2 3 4 9 7 4 1 6 3 10 8 5 2 11 FM indices

Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. a c g t rank 1 5 8 11 BWT gca 1 2 3 4 5 6 7 8 9 10 11 SA $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac t g c $ a a c g t 1 2 3 4 9 7 4 1 6 3 10 8 5 2 11 FM indices Next block: From 1 + 0 = 1 to 1 + (4-1) = 4

Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. a c g T rank 1 5 8 11 BWT gca 1 2 3 4 5 6 7 8 9 10 11 SA $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac t g c $ a a c g t 1 2 3 4 9 7 4 1 6 3 10 8 5 2 11 FM indices Next block: From 5 + 0 = 5 to 5 + (2-1) = 6

Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. a c g T rank 1 5 8 11 BWT gca 1 2 3 4 5 6 7 8 9 10 11 SA $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac t g c $ a a c g t 1 2 3 4 9 7 4 1 6 3 10 8 5 2 11 FM indices Next block: From 8 + 1 = 9 to 8 + (3-1) = 10

Inexact match

Main advantage of BWT against suffix array BWT needs less memory than suffix array For human genome m = 3 * 109 : Suffix array: mlog2(m) bits = 4m bytes = 12GB BWT: m/4 bytes plus extras = 1 - 2 GB m/4 bytes to store BWT (2 bits per char) Suffix array and occurrence counts array take 5 m log2 m bits = 20 m bytes In practice, SA and OCC only partially stored, most elements are computed on demand (takes time!) Tradeoff between time and space

Comparison Spaced seeds Requires ~50Gb of memory. Runs 30-fold slower. Is much simpler to program. MAQ Burrows-Wheeler Requires <2Gb of memory. Runs 30-fold faster. Is much more complicated to program. Bowtie

Short-read mapping software   Software Technique Developer License Eland Hashing reads Illumnia ? SOAP Hashing refs BGI Academic Maq Sanger (Li, Heng) GNUPL Bowtie BWT Salzberg/UMD BWA SOAP2 BWT & hashing http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

References (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25  SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics (2008) 24: 713-4 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760 SOAP2: an improved ultrafast tool for short read alignment, Ruiqiang Li, (2009) 25: 1966–1967 (MAQ) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008) 18:1851-8. Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009) http://www.allisons.org/ll/AlgDS/Strings/BWT/