Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Parallel Implementation of BWT Under the Guidance of : Prof. Kolin Paul Presented By: Lalchand Gaurav Jain.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Sequence Alignment in DNA Under the Guidance of : Prof. Kolin Paul Presented By: Lalchand Gaurav Jain.
Next Generation Sequencing, Assembly, and Alignment Methods
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Modern Information Retrieval
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
MES Genome Informatics I - Lecture V. Short Read Alignment
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Ranking Ida Mele. Introduction The set of software components for the management of large sets of data is made of: MG4J Fastutil the DSI Utilities Sux4J.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
ARRAYS 1 Week 2. Data Structures  Data structure  A particular way of storing and organising data in a computer so that it can be used efficiently 
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
1/20 A Novel Technique for Input Vector Compression in System-on-Chip Testing Student: Chien Nan Lin Satyendra Biswas, Sunil Das, and Altaf Hossain,” Information.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Quantum Two 1. 2 Angular Momentum and Rotations 3.
Lecture 15 Algorithm Analysis
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
Linear Time Suffix Array Construction Using D-Critical Substrings
RNAseq: a Closer Look at Read Mapping and Quantitation
Burrows-Wheeler Transformation Review
COMP9319 Web Data Compression and Search
Compression & Huffman Codes
Tries 07/28/16 11:04 Text Compression
Indexing Graphs for Path Queries with Applications in Genome Research
BWT-Transformation What is BWT-transformation? BWT string compression
Information and Coding Theory
VCF format: variants c.f. S. Brown NYU
Succinct: Enabling Queries on Compressed Data
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Lecture 14 Algorithm Analysis
Arrays Week 2.
BIOINFORMATICS Fast Alignment
Suffix Arrays and Suffix Trees
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen

Motivation Burrows-Wheeler Transformation (BWT) of a large text allows: –Fast exact matching –Compact representation (compared to suffix tree/array) –More readily compressible (basis of bzip ) The FM Index exploits an indexed and compressed BWT to allow: –Exact matching in time linear in the size of the pattern –Memory footprint as much as 50% smaller than original string FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk

Background Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array a c a a c g $g c $ a a a c Suffix array Burrows Wheeler Matrix Last column BWT Text

Problem Memory footprint of building and storing suffix array is much larger than the BWT itself –Human genome: SA: ~12 GB, BWT: ~0.8 GB –Attempt to build BWT over whole human genome on a 32 GB server exhausts memory and crashes (I tried)

Solution Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix Sorting” –Theoretical Computer Science, 387 (3), pp , Sept Observation: –BWT[i] depends only on SA[i], not on any other element of SA Corollary: –No need to keep all of SA in memory at once! Solution: –Build SA and BWT a small “chunk” or “block” at a time –Greatly reduces the memory overhead By something like a factor of B, where B = # of blocks

Solution Typical suffix sort:

Solution Blockwise suffix sort:

Solution Calculate and sort a random sample of the suffixes

Solution Samples are used as “bookends” for “buckets” ? $ B1B1 B2B2 B3B3 B4B4

Solution In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket $ B1B1 B2B2 B3B3 B4B4 Pass 1

Solution After a bucket has been sorted and turned into a BWT segment, it is discarded Pass B B1B1 B2B2 B3B3 B4B4 $

Solution Good time bounds in the presence of long repeats require use of a difference cover sample –Acts like an oracle that determines relative lexicographical order of two suffixes that share a prefix of some length v

Project Goals Basic goal: –Write a correct, usable library implementing blockwise SA sort and BWT building –Characterize performance and time/space tradeoffs Stretch goals: –Fine-tune for performance and memory usage –Implement difference cover sample Question: is this necessary for good performance on real-life inputs?

Concluding Remarks BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way