Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen

Motivation Burrows-Wheeler Transformation (BWT) of a large text allows: –Fast exact matching –Compact representation (compared to suffix tree/array) –More readily compressible (basis of bzip ) The FM Index exploits an indexed and compressed BWT to allow: –Exact matching in time linear in the size of the pattern –Memory footprint as much as 50% smaller than original string FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk

Background Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array a c a a c g $g c $ a a a c Suffix array Burrows Wheeler Matrix Last column BWT Text

Problem Memory footprint of building and storing suffix array is much larger than the BWT itself –Human genome: SA: ~12 GB, BWT: ~0.8 GB –Attempt to build BWT over whole human genome on a 32 GB server exhausts memory and crashes (I tried)

Solution Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix Sorting” –Theoretical Computer Science, 387 (3), pp. 249-257, Sept. 2007 Observation: –BWT[i] depends only on SA[i], not on any other element of SA Corollary: –No need to keep all of SA in memory at once! Solution: –Build SA and BWT a small “chunk” or “block” at a time –Greatly reduces the memory overhead By something like a factor of B, where B = # of blocks

Solution Typical suffix sort:

Solution Blockwise suffix sort:

Solution Calculate and sort a random sample of the suffixes

Solution Samples are used as “bookends” for “buckets” ? $ B1B1 B2B2 B3B3 B4B4

Solution In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket $ B1B1 B2B2 B3B3 B4B4 Pass 1

Solution After a bucket has been sorted and turned into a BWT segment, it is discarded Pass B B1B1 B2B2 B3B3 B4B4 $

Solution Good time bounds in the presence of long repeats require use of a difference cover sample –Acts like an oracle that determines relative lexicographical order of two suffixes that share a prefix of some length v

Project Goals Basic goal: –Write a correct, usable library implementing blockwise SA sort and BWT building –Characterize performance and time/space tradeoffs Stretch goals: –Fine-tune for performance and memory usage –Implement difference cover sample Question: is this necessary for good performance on real-life inputs?

Concluding Remarks BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Similar presentations

Presentation on theme: "Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Similar presentations

Presentation on theme: "Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen."— Presentation transcript:

Similar presentations

About project

Feedback