Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Similar presentations


Presentation on theme: "Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen."— Presentation transcript:

1 Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen

2 Outline Short-read alignment – Algorithm – Results Comparisons between short-read and long- read alignment Long-read alignment – Algorithm – Results

3 Motivation Motivation: new DNA sequencing technologies call fast and accurate read alignment programs. MAQ:  Pros: accurate, feature rich and fast enough to align short reads from single individual.  Cons: MAQ does NOT support gapped alignment for single-end reads => unsuitable for alignment longer reads where indels may occur frequently. Alignment with BWT :  efficiently align short sequencing reads against a large reference sequence  allowing mismatches and gaps

4 Burrows Wheeler Transfrom actgct$ ctgct$a tgct$ac gct$act ct$actg t$actgc $actgct S[i]B[i]i X: actgct W: gcc Z=1

5 Inexact Matching - number of deference in string W Take string W=“gcc” for example. 1. W(0,0)=“g”, “g” is a substring of X, D(0)=0; 2. W(0,1)=“gc”, “gc” is a substring of X, D(1)=0; 3. W(0,2)=“gcc”, “gcc” is not a substring of X, D(2)=1.

6 Inexact Matching - Searching

7 6,6 2,3 4,4 6,6 3,3 1,1 2,3 3,3 6,6 3,3 1,1 3,3 6,6 3,3 1,1 0,6 X: actgct W: gcc t c a g t c a c a g t c a g t c a a 1,1 ^ ^ ^ ^ ^ 1 2 3 ^ 4 5 6

8 Exact Matching Let the D(i)=0, then the algorithm can search for the exact matching

9 Simulated data Accuracy  BWA is more accurate than Bowtie and SOAPv2 based on criterion 1. Speed  BWA is the fastest second only to SOAPv2. Memory  MAQ’s memory footprint is 1GB, but it increases linearly with the number of reads to be aligned.  BWA only uses 2.3 GB for single-end mapping and 3GB for paired-end ( as much as Bowtie).  SOAPv2 uses 5.4 GB.

10 Differences between short-read and long- read alignment Short-read alignment Align full-length read Efficient for ungapped alignment or limited gaps Long-read alignment Find local matches Permissive about alignment gaps

11 Motivations Many programs for short sequencing Not many for reads>200 bp BLAT, SSAHA2 New platforms are producing longer sequences: Roche/454 >400bp, Illumina>100 bp, Pacific > 1000 bp Fast and accurate long-read alignment with Burrows-Wheeler transform New algorithm: Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW

12 Before NGS FASTA 1988 BLAST 1997 MegaBLAST 2000 SSAHA2 2001 BLAT 2002 After NGS SOAP 2008 MAQ 2008 Bowtie 2009 BWA 2009 BWA-SW 2010

13 prefix trie Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Overview Algorithm (1)Build FM-indices for reference and query sequences (2)Represent reference in a prefix trie (3)Represents query in prefix in DAWG (directed acyclic word graph) transformed from the prefix trie of the query sequence String GOOGOL ‘ ∧ ’ start of a string The two numbers in A node gives the SA interval of the node Prefix tree Prefix DAWG Example: a. 3 nodes has SA interval [4,4] b. Their parents have interval [1,2],[1,2] and [1,1] In prefix DAWG The [4,4] node has parents [1,2] and [1,1] Node [4,4] represents the strings ‘OG’, ‘OGO’, ‘OGOL’ ‘

14 Overview Algorithm (4) Dynamic programming with heuristics to accelerate algorithm Heuristics rules: A) Restrict the dynamic programming algorithm around good matches only B) Report only alignments largely non-overlapping Result of these heuristics is: Savings in computing time Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW

15 Heuristic strategies for acceleration (1) Z best : Traverse G(W) in outer loop and T(X) in inner loop, and at each node u in G(W) only keep the top Z best scoring nodes in T(X) that match u rather than keeping all the matching nodes Where G(W) prefix DAWG of query sequence W T(X) prefix trie for reference sequence X u root of G(W) (2) Take only best few alignments covering each region of the query sequence Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW

16 Result Implementation of BWA-SW takes a BWA index and a query FASTA and FASTQ file as inputs. Typical sequencing reads requires less than 4GB. The peak memory is 6.4 GB in total on one query sequence with 1 million base pairs.

17 Simulated data Speed  BWA-SW is fastest, and its speed is not sensitive to the read length or error rates. Memory  BWA-SW uses about 4GB (as much as BLAT).  SSAHA2 uses 2.4GB for >=500 bp reads, and 5.3 GB for shorter reads.  BWA-SW supports multi-threading while SSAHA2 and BLAT do not. Accuracy  BWA-SW can detect chimera reads, and produces fewer false chimeric reads given lower base errors.

18 Conclusion Short-read alignment cannot be used for long- read alignment due to: – Full-length read vs local matches. – Ungapped or limited gap vs larger number of gaps. BWA-short is more accurate, use less memory and competitively fast. BWA-long is the best in market in speed, accuracy and memory.

19 Questions ?????


Download ppt "Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen."

Similar presentations


Ads by Google