The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.

The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered

Overview The Burrows-Wheeler transform (bwt). Statistical compression overview Compressing using bwt Analysis of the results of the compression.

General bwt: Transforms the order of the symbols of a text. The bwt output can be very easily compressed. Used by the compressor bzip2.

Calculating bw(s) Add an end-of-string symbol ($) to s Generate a matrix of all the cyclic shifts of s Sort the matrix rows, in right to left lexicographic order bw(s) is the first column of the matrix $ sign is dropped. Its location saved

BWT Example s = mississipi mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi mississippi$ ssissippi$mi $mississippi ssippi$missi ppi$mississi ississippi$m pi$mississip i$mississipp sissippi$mis sippi$missis issippi$miss ippi$mississ bw(s)= (msspipissii, 3) Sorting the rows of the matrix is equivalent to sorting the suffixes of s r (ippississim)

BWT Matrix Properties m ississippi $ s sissippi$m i $ mississipp i s sippi$miss i p pi$mississ i i ssissippi$ m p i$mississi p i $mississip p s issippi$mi s s ippi$missi s i ssippi$mis s i ppi$missis s FL Sorting F gives L s 1 =F 1 F i follows L i in s$ Equal symbols in L are ordered the same as in F

Add $ to get F Reconstructing s ms$spipissiims$spipissii F $iiiimppssss$iiiimppssss L s= m Sort F to get L s 1 =F 1 F i follows L i in s$ Equal order of appearance issi ?

Reconstructing s ms$spipissiims$spipissii F $iiiimppssss$iiiimppssss L L=sort(F) s =F 1 j=1 for i=2 to n { a=# of appearances of F j in {F 1, F 2, …F j } j = index of the a’th appearance of F j in L s = s + F j }

What’s good about bwt? bwt(s) is locally homogenous: For every substring w of s, all the symbols following w in s are grouped together. mississippi$ ssissippi$mi $mississippi ssippi$missi ppi$mississi ississippi$m pi$mississip i$mississipp sissippi$mis sippi$missis issippi$miss ippi$mississ These symbols will usually be homogenous.

What’s good about bwt? miss_mississippi_misses_miss_missouri mmmmmssssss_spiiiiiupii_ssssss_e_ioir follow mi follow _ follow m bwt follow mis

Statistical Compression We will discuss lossless statistical compression with the following notations: s = input string over the alphabet: Σ = { a 1, a 2, a 3, …, a h } h = |Σ| n = |s| n i = number of appearances of a i in s. log x = log 2 x

Zeroth Order Encoding Every input symbol is replaced by the same codeword for all its appearances: a i  c i a 0 1 1 1 e b c 1 1 f 0 d 0 0 Kraft’s Inequality: e  0 a  10 c  111 … Output size: Minimum achieved for:

Zeroth Order Encoding Compressing a string using Huffman Coding or Arithmetic Coding produces an output which size is close to |s|H 0 (s) bits. Specifically: is the Empirical Entropy (zeroth order) of s. Output size is bounded by |s|H 0 (s), where:

Zeroth order Entropy: Example n 1 = n 2 = … = n h : n 1 >> n 2, n 3 …, n h : s = mississippi

k-th Order Encoding The codeword that encodes an input symbol is determined according to that symbol, and its k preceding symbols. w s – A string containing all the symbols following w in s. Output size is bounded by |s|H k (s) bits k-th Order Empirical Entropy of s:

k-th order Entropy: Example s = mississippi (k=1) m s =i  H 0 (i)=0 i s =ssp  H 0 (ssp)=0.92 s s =sisi  H 0 (sisi)=1 p s =pi  H 0 (pi)=1

Did we get an optimal k-th order compressor? k-th Order Encoding and bwt After applying bwt, for every substring w of s, all the symbols following w in s are grouped together: Not yet: Local homogeneity instead of global homogeneity. mmmmmssssss_spiiiiiupii_ssssss_e_ioir i$s_mi i_se

k-th Order Encoding and bwt For example: s=ababababababab…. bwt(s)= abbbbbbbbbbaaaaaaaaa w 1 ($) w 2 (a) w 3 (b) H 1 (s)=0 (w a =bbb…, w b =aaa…) H 0 (w i )=0 H 0 (w 1 w 2 w 3 )=H 0 (s)=1

Compressing bwt bwt Arithmetic coding MoveToFront

MoveToFront Compression Every input symbol is encoded by the number of distinct symbols that occurred since the last appearance of this symbol. Implemented using a list of symbols sorted by recency of usage. Output contains a lot of small numbers if the text is locally homogenous.  Transforms local homogeneity into global homogeneity.

MoveToFront Compression Σ = { d,e,h,l,o,r,w } s= h e l l o w o r l d mtf-list= mtf(s)= { d, e, h, l, o, r, w } 223046 { h, d, e, l, o, r, w }{ e, h, d, l, o, r, w }{ l, e, h, d, o, r, w }{ o, l, e, h, d, r, w }{ w, o, l, e, h, d, r } 1… Initial list may be either: Ordered alphabetically Symbols in order of appearance in the string (need to add it to the output)

bwt0 Compression bwt0(s)  arit( mtf( bw(s) ) ) Theorem 1 For any k: (h=size of alphabet)

Notations x’ = mtf(x) for a string w over {0,1,2, …, m} define: w 01 : w, with all the non-zeros replaced by 1. x’ 01 : x’, with all the non-zeros replaced by 1. Note: |bwt(x)| = |x| |mtf(x)|=|x|

Theorem 1 - Proof Lemma 1 s=s 1 s 2 …s t, s’=mtf(s). Then

Theorem 1 - Proof bw(s) can be partitioned into at most h k substrings w 1, w 2, …, w l such that: s’=mtf(bw(s)). By Lemma 1: Using bound on output of Arit: |s|H k (s)

Lemma 1 - Proof Encoding of s’: For each symbol: is it 0 or not? For non-zeros: encode one of 1, 2, 3, …, h-1 Note: Ignoring some inter-substrings problems. s=s 1 s 2 …s t, s’=mtf(s). Then

Encoding non-zeros of s’ Use prefix code (i  c i ):s’’ = pcnz(s’) c 1 = 10 c 2 = 11 c i = 0 0 0 … 0 0 B(i+1) (i>2) |B(i+1)| - 2|B(i+1)| |c i | <= 2log(i+1) (|c 0 | = 0) m i = # occurrences of i in s’.

Encoding non-zeros of s’=mtf(s) Proof N a Occurrences of symbol a in s: p 1, p 2,…, p N a Sum over all symbols of s For any string s:

Encoding non-zeros of s’ For every i: Summing for all substrings: s=s 1 s 2 …s t

Encoding of s’ For non-zeros: encode one of 1, 2, 3, …, h-1  No more than bits For each symbol: Is it 0 or not?  Encode s’ 01

Encoding s’ 01 If for every s i ’ 01 the number of 0’s is at least as large as the number of 1’s: and It follows that: Otherwise …

Encoding s’ 01 (second case) If s i ’ 01 has more 1’s than 0’s for i=1,2,…l: If there are more 1 ’ s than 0 ’ s in s i ’ 01, then It follows that:

Encoding of s’ For non-zeros: encode one of 1, 2, 3, …, h-1  No more than bits For each symbol: Is it 0 or not? (Encode s’ 01 )  No more than bits Total: (after fixing some inaccuracies)  No more than bits

Improvement Use RLE: bw0 RL (s)  arit( rle( mtf( bw(s) ) ) ) Better performance Better theoretical bound:

Notes Compressor Implementation: Use blocks of text. Sort using one of: Compact suffix trees (long average LCP) Suffix arrays (medium average LCP) General String sorter (short average LCP) Search in a compressed text: Extract suffix-array from bwt(s). Empirical Results…

The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.

Similar presentations

Presentation on theme: "The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.

Similar presentations

Presentation on theme: "The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered."— Presentation transcript:

Similar presentations

About project

Feedback