Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Slides:



Advertisements
Similar presentations
SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
Fast and accurate short read alignment with Burrows–Wheeler transform
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Group 1 (1)陳伊瑋 (2)沈國曄 (3)唐婉馨 (4)吳彥緯 (5)魏銘良
Next Generation Sequencing, Assembly, and Alignment Methods
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg 林恩羽 宋曉亞 陳翰平.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Bowtie2: Extending Burrows-Wheeler-based read alignment to longer reads and gapped alignments Ben Langmead 1, 2, Mihai Pop 1, Rafael A. Irizarry 2 and.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
Massive Parallel Sequencing
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
From Smith-Waterman to BLAST
Lecture 15 Algorithm Analysis
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
Short Read Workshop Day 5: Mapping and Visualization
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
RNAseq: a Closer Look at Read Mapping and Quantitation
Advanced Database Searching
Burrows-Wheeler Transformation Review
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
VCF format: variants c.f. S. Brown NYU
The short-read alignment in distributed memory environment
Department of Computer Science
Jin Zhang, Jiayin Wang and Yufeng Wu
Fast Sequence Alignments
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Lecture 14 Algorithm Analysis
Maximize read usage through mapping strategies
BIOINFORMATICS Fast Alignment
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Presentation transcript:

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen

Outline Short-read alignment – Algorithm – Results Comparisons between short-read and long- read alignment Long-read alignment – Algorithm – Results

Motivation Motivation: new DNA sequencing technologies call fast and accurate read alignment programs. MAQ:  Pros: accurate, feature rich and fast enough to align short reads from single individual.  Cons: MAQ does NOT support gapped alignment for single-end reads => unsuitable for alignment longer reads where indels may occur frequently. Alignment with BWT :  efficiently align short sequencing reads against a large reference sequence  allowing mismatches and gaps

Burrows Wheeler Transfrom actgct$ ctgct$a tgct$ac gct$act ct$actg t$actgc $actgct S[i]B[i]i X: actgct W: gcc Z=1

Inexact Matching - number of deference in string W Take string W=“gcc” for example. 1. W(0,0)=“g”, “g” is a substring of X, D(0)=0; 2. W(0,1)=“gc”, “gc” is a substring of X, D(1)=0; 3. W(0,2)=“gcc”, “gcc” is not a substring of X, D(2)=1.

Inexact Matching - Searching

6,6 2,3 4,4 6,6 3,3 1,1 2,3 3,3 6,6 3,3 1,1 3,3 6,6 3,3 1,1 0,6 X: actgct W: gcc t c a g t c a c a g t c a g t c a a 1,1 ^ ^ ^ ^ ^ ^ 4 5 6

Exact Matching Let the D(i)=0, then the algorithm can search for the exact matching

Simulated data Accuracy  BWA is more accurate than Bowtie and SOAPv2 based on criterion 1. Speed  BWA is the fastest second only to SOAPv2. Memory  MAQ’s memory footprint is 1GB, but it increases linearly with the number of reads to be aligned.  BWA only uses 2.3 GB for single-end mapping and 3GB for paired-end ( as much as Bowtie).  SOAPv2 uses 5.4 GB.

Differences between short-read and long- read alignment Short-read alignment Align full-length read Efficient for ungapped alignment or limited gaps Long-read alignment Find local matches Permissive about alignment gaps

Motivations Many programs for short sequencing Not many for reads>200 bp BLAT, SSAHA2 New platforms are producing longer sequences: Roche/454 >400bp, Illumina>100 bp, Pacific > 1000 bp Fast and accurate long-read alignment with Burrows-Wheeler transform New algorithm: Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW

Before NGS FASTA 1988 BLAST 1997 MegaBLAST 2000 SSAHA BLAT 2002 After NGS SOAP 2008 MAQ 2008 Bowtie 2009 BWA 2009 BWA-SW 2010

prefix trie Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Overview Algorithm (1)Build FM-indices for reference and query sequences (2)Represent reference in a prefix trie (3)Represents query in prefix in DAWG (directed acyclic word graph) transformed from the prefix trie of the query sequence String GOOGOL ‘ ∧ ’ start of a string The two numbers in A node gives the SA interval of the node Prefix tree Prefix DAWG Example: a. 3 nodes has SA interval [4,4] b. Their parents have interval [1,2],[1,2] and [1,1] In prefix DAWG The [4,4] node has parents [1,2] and [1,1] Node [4,4] represents the strings ‘OG’, ‘OGO’, ‘OGOL’ ‘

Overview Algorithm (4) Dynamic programming with heuristics to accelerate algorithm Heuristics rules: A) Restrict the dynamic programming algorithm around good matches only B) Report only alignments largely non-overlapping Result of these heuristics is: Savings in computing time Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW

Heuristic strategies for acceleration (1) Z best : Traverse G(W) in outer loop and T(X) in inner loop, and at each node u in G(W) only keep the top Z best scoring nodes in T(X) that match u rather than keeping all the matching nodes Where G(W) prefix DAWG of query sequence W T(X) prefix trie for reference sequence X u root of G(W) (2) Take only best few alignments covering each region of the query sequence Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW

Result Implementation of BWA-SW takes a BWA index and a query FASTA and FASTQ file as inputs. Typical sequencing reads requires less than 4GB. The peak memory is 6.4 GB in total on one query sequence with 1 million base pairs.

Simulated data Speed  BWA-SW is fastest, and its speed is not sensitive to the read length or error rates. Memory  BWA-SW uses about 4GB (as much as BLAT).  SSAHA2 uses 2.4GB for >=500 bp reads, and 5.3 GB for shorter reads.  BWA-SW supports multi-threading while SSAHA2 and BLAT do not. Accuracy  BWA-SW can detect chimera reads, and produces fewer false chimeric reads given lower base errors.

Conclusion Short-read alignment cannot be used for long- read alignment due to: – Full-length read vs local matches. – Ungapped or limited gap vs larger number of gaps. BWA-short is more accurate, use less memory and competitively fast. BWA-long is the best in market in speed, accuracy and memory.

Questions ?????