Group 1 (1)陳伊瑋 (2)沈國曄 (3)唐婉馨 (4)吳彥緯 (5)魏銘良

Slides:



Advertisements
Similar presentations
Eugene W.Myers and Webb Miller. Outline Introduction Gotoh's algorithm O(N) space Gotoh's algorithm Main algorithm Implementation Conclusion.
Advertisements

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
GNUMap: Unbiased Probabilistic Mapping of Next- Generation Sequencing Reads Nathan Clement Computational Sciences Laboratory Brigham Young University Provo,
Fast and accurate short read alignment with Burrows–Wheeler transform
Computability Start complexity. Motivation by thinking about sorting. Homework: Finish examples.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Sequence Alignment in DNA Under the Guidance of : Prof. Kolin Paul Presented By: Lalchand Gaurav Jain.
Next Generation Sequencing, Assembly, and Alignment Methods
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg 林恩羽 宋曉亞 陳翰平.
Modern Information Retrieval
Bowtie2: Extending Burrows-Wheeler-based read alignment to longer reads and gapped alignments Ben Langmead 1, 2, Mihai Pop 1, Rafael A. Irizarry 2 and.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
1 Performing packet content inspection by longest prefix matching technology Authors: Nen-Fu Huang, Yen-Ming Chu, Yen-Min Wu and Chia- Wen Ho Publisher:
Indexing and Searching
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Genomics Method Seminar - BWA
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
1 UNIT-I BRUTE FORCE ANALYSIS AND DESIGN OF ALGORITHMS CHAPTER 3:
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
From Reads to Results Exome-seq analysis at CCBR
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
RNAseq: a Closer Look at Read Mapping and Quantitation
1 BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches 1Yangjun Chen, 2Yujia.
Burrows-Wheeler Transformation Review
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Tries 07/28/16 11:04 Text Compression
The short-read alignment in distributed memory environment
13 Text Processing Hongfei Yan June 1, 2016.
Yangjun Chen, Yujia Wu Department of Applied Computer Science
Yangjun Chen, Yujia Wu Department of Applied Computer Science
Jin Zhang, Jiayin Wang and Yufeng Wu
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Searching Similar Segments over Textual Event Sequences
A Small and Fast IP Forwarding Table Using Hashing
Maximize read usage through mapping strategies
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Presentation transcript:

Group 1 (1)陳伊瑋 (2)沈國曄 (3)唐婉馨 (4)吳彥緯 (5)魏銘良 Final Presentation Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform Group 1 (1)陳伊瑋 (2)沈國曄 (3)唐婉馨 (4)吳彥緯 (5)魏銘良

Outline Introduction & Background review Prefix trie and Burrows-Wheeler transform Exact Matching Inexact Matching Result & Conclusion Reference

Introduction (1/3) [1] Motivation: Much reads: 50~200 million 32-100 bp reads Reference sequence determined

Introduction (2/3) [2] BLAST/BLAT Suffix array: Requires 12GB for human genome ※ Requires New Alignment Algorithm

Introduction (2/3) [1] Four category of algorithms for this problem Representative Pros Cons Hash the read sequence MAQ Flexible memory footprint No multi-threading Hash the genome ReSEQ Easy multi-threading Large memory Merge-sorting sequences Malhis *** Hard for pairing Burrows-Wheeler Transform Bowtie Relative small memory footprint

Comparison Basing BWT, inexact matching algorithm proposed Feature Speed memory Hash read sequence No multi-threading Memory footprint Hash genome Multi-threading large Merge sorting fast (no pairing) BWT Smaller memory footprint 改進,找圖 Basing BWT, inexact matching algorithm proposed

Outline Introduction & Background review Prefix trie and Burrows-Wheeler transform Exact Matching Inexact Matching Result & Conclusion Reference

Prefix of string ‘GOOGOL’

2.1 Prefix trie and string matching dashed line shows the route of the brute-force search for a query string ‘LOL’, allowing at most one mismatch Suffix array interval ^ mark start of the string

Testing whether a query W is an exact substring of X can be done in O(|W|) time. To allow mismatches, we can exhaustively traverse the trie. We will show later how to accelerate this search by using prefix information of W.

Suffix of string ‘GOOGOL’

2.2 Burrows-Wheeler transform (BWT)

Define some variables A string X = a0a1 : : : an-1 is always ended with symbol $. X[i] = ai, X[i; j] =ai….. aj, a substring of X Xi = X[i, n-1], a suffix of X Suffix array S, S(i) is the start position of the i-th smallest suffix. B[i] = $ when S(i) = 0 and B[i] = X[S(i) - 1] otherwise.

In practice, we usually construct the suffix array first and then generate BWT. Most algorithms for constructing suffix array require at least bits of working space, which amounts to 12GB for human genome. Hon et al. (2007) gave a new algorithm which will only require less than 1GB memory at peak time for constructing the BWT of human genome. This algorithm is implemented in BWT-SW (Lam et al., 2008). We adapted its source code to make it work with BWA (this paper).[3][4]

2.3 Suffix array interval and sequence alignment is called the Suffix array interval of W the set of positions of all occurrences of W in X is

For example the SA interval of string ‘go’ is [1; 2]. The suffix array values in this interval are 3 and 0 which give the positions of all the occurrences of ‘go’. Sequence alignment is equivalent to searching for the SA intervals of substrings of X that match the query. For the exact matching problem, we can find only one such interval. For the inexact matching problem, there may be many.

Outline Introduction & Background review Prefix trie and Burrows-Wheeler transform Exact Matching Inexact Matching Result & Conclusion Reference

Review X = googol$ 𝑹 (𝑾) min { k : W is the prefix of XS(k) } 𝑹 (𝑾) max { k : W is the prefix of XS(k) } 𝑅 (𝑔𝑜) = 1 𝑅 (𝑔𝑜) = 2

Definition X = googol$ C(a) The number of symbols in X[0,n-2] that are lexicographically smaller than a ∈ ∑ C(g) = 0 C(l) = 2 C(o) = 3

Definition X = googol$ O(a,i) The number of occurrences of a in B[0,i] O(o,i) = 0 , 0 <= i <= 4 1 , i = 5 2 , i = 6 O(g,i) = O(l,i) = 1 , 0 <= I <= 6

Definition X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo g o $ o l o g

Meaning X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo C(o) = 3

Meaning X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo

Meaning X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo

Meaning X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo If 𝑅 𝑎𝑊 – R(aW) >= 0, then aW is a substring of X

Example X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo C(o) = 3

Example X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo C(o) = 3 O(o, 0) = 0 R(W) = 1 𝑅 𝑊 = 2 𝑅 𝑜𝑔𝑜 = C(o) + O(o, 0) + 1 = 3 + 0 + 1 = 4

Example X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo C(o) = 3

Example X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo C(o) = 3 O(o, 2) = 1 R(W) = 1 𝑅 𝑊 = 2 𝑅 𝑜𝑔𝑜 = C(o) + O(o, 2) = 3 + 1 = 4

Example X = googol$ 𝑹 𝒂𝑾 C(a) + O(a, R(W) − 1) + 1 W = go aW = ogo 𝑅 𝑎𝑊 – R(aW) = 4 – 4 = 0 ogo is a substring of X S(4) = 2

Outline Introduction & Background review Prefix trie and Burrows-Wheeler transform Exact Matching Inexact Matching Result & Conclusion Reference

Between Exact & Inexact Matching Find all exact substrings (get positions) Inexact Find all similar substrings (get positions) Bounded differences (insertion/deletion/mismatch) Reference string: X Bob spent all his money on a game called “monkey money” money Query string: W

TTAACGTTTATTACGTTTAAGTTTAACCTT An artificial example Reference string: X TTAACGTTTATTACGTTTAAGTTTAACCTT AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAAGTTTAACCTT AACG Query string: W Allowed differences: 2 To follow the procedures of exact matching, we’ll scan W from right to left We have a budget of $2 from the beginning Minus 1 when one difference occurs Stop when bankrupt occurs or W is fully scanned

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACTTG AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACTTG AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT ? AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAAGTTTAACCTT AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG Query string: W Allowed differences: 2

Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG Query string: W Allowed differences: 2

Before illustrating Something we knew in Exact-Matching Magic In O(|W|) time, we can find all positions X: googol$ W:go In O(1) time, we find all updated positions X: googol$ W:ogo Magic “2 numbers” can show all positions

INEXRECUR(W,i,z,k,l) Algorithm A Recursive function AACG W: query string Handle W[i] in this recursion z: the remaining budgets (k,l) represents the previous interval AACG Query string: W

INEXRECUR(W,i,z,k,l) Fully scanned Return the acceptable interval

TTAACGTTTAACTTGTTTAA-GTTTAACCTT INEXRECUR(W,i,z,k,l) TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG → AACG I is ready to collect all similar intervals Insertion to X

TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG → AACG TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG → AACG TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG → AACG deletion from X

TTAACGTTTAACTTGTTTAA-GTTTAACCTT TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG → AACG match

TTAACGTTTAACTTGTTTAAGTTTAACCTT TTAACGTTTAACTTGTTTAAGTTTAACCTT AACG mismatch

Inexact Matchings INEXRECUR(W,|W|-1,allowed_diff,1,|X|-1) gives the inexact-matching intervals

Outline Introduction & Background review Prefix trie and Burrows-Wheeler transform Exact Matching Inexact Matching Result & Conclusion Reference

Implementation Implemented BWA:to do short read alignment based on the BWT of the reference genome. BWA is freely available at the MAQ website: http://maq.sourceforge.net. Format:SAM (Sequence Alignment/Map format). SAMtools:extract alignments in a region, merge/sort alignments, get SNP/indel calls and visualize the alignment. (http://samtools.sourceforge.net)

Evaluated programs BWA MAQ SOAPv2 Bowtie (Li et al., 2008a) Bowtie 0.9.9.2 (Langmead et al., 2009)

Evaluation on simulated data Human genome with 0.09% SNP mutation rate, 0.01% indel mutation rate and 2% uniform sequencing base error rate. CPU time in seconds on a single core of a 2.5GHz Xeon E5420 processor (Time) percent confidently mapped reads (Conf) percent erroneous alignments out of confident mappings (Err)

Bowtie-32bp:151 sec, Err 6.4% SOAP-2.1.7:longer than 35bp. SOAP-2.0.1:is better with 32bp. SOAPv2:5.4GB. Bowtie、BWA:2.3GB~3GB MAQ:1GB. MAQ:for 128bp

Evaluation on real data Human genome :12.2 million read pairs European Read Archive (AC:ERR000589) CPU time in hours on a single core of a 2.5GHz Xeon E5420 processor (Time), percent confidently mapped reads (Conf), percent confident mappings with the mates mapped in the correct orientation and within 300bp (Paired) European Read Archive (AC:ERR000589):12.2 million pairs of 51bp.These reads were produced by Illumina for NA12750, a male included in the 1000 Genomes Project (http://www.1000genomes.org).

slower -BWA: 6.3 hr 89.2% 99.2% Wrong with human-chicken hybrid Bowtie:2,640 BWA : 2,942 MAQ : 3,005 SOAPv2 : 4,531 BWA : 0.06% (=2942*4/12.2M/0.889).

DISCUSSION Implemented BWA. BWA outputs alignment in the SAM format to take the advantage of the downstream analyses implemented in SAMtools. Evaluation on simulated data and real data. BWA is faster than MAQ (similar alignment accuracy).

Outline Introduction & Background review Prefix trie and Burrows-Wheeler transform Exact Matching Inexact Matching Result & Conclusion Reference

Reference [1] Heng Li and Richard Durbin, “ Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform” The Wellcome Trust Sanger Institute, 2009. [2] Bioinformatics for High-throughput sequencing http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/NGS_Overview_Simon_Nicolas.pdf [3] Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., and Yiu, S.-M. (2007). A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica, 48:23–36. [4] Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K., and Yiu, S. M. (2008). Compressed indexing and local alignment of DNA. Bioinformatics, 24(6):791–797.