Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.

Similar presentations


Presentation on theme: "The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered."— Presentation transcript:

1 The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered

2 Overview The Burrows-Wheeler transform (bwt). Statistical compression overview Compressing using bwt Analysis of the results of the compression.

3 General bwt: Transforms the order of the symbols of a text. The bwt output can be very easily compressed. Used by the compressor bzip2.

4 Calculating bw(s) Add an end-of-string symbol ($) to s Generate a matrix of all the cyclic shifts of s Sort the matrix rows, in right to left lexicographic order bw(s) is the first column of the matrix $ sign is dropped. Its location saved

5 BWT Example s = mississipi mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi mississippi$ ssissippi$mi $mississippi ssippi$missi ppi$mississi ississippi$m pi$mississip i$mississipp sissippi$mis sippi$missis issippi$miss ippi$mississ bw(s)= (msspipissii, 3) Sorting the rows of the matrix is equivalent to sorting the suffixes of s r (ippississim)

6 BWT Matrix Properties m ississippi $ s sissippi$m i $ mississipp i s sippi$miss i p pi$mississ i i ssissippi$ m p i$mississi p i $mississip p s issippi$mi s s ippi$missi s i ssippi$mis s i ppi$missis s FL Sorting F gives L s 1 =F 1 F i follows L i in s$ Equal symbols in L are ordered the same as in F

7 Add $ to get F Reconstructing s ms$spipissiims$spipissii F $iiiimppssss$iiiimppssss L s= m Sort F to get L s 1 =F 1 F i follows L i in s$ Equal order of appearance issi ?

8 Reconstructing s ms$spipissiims$spipissii F $iiiimppssss$iiiimppssss L L=sort(F) s =F 1 j=1 for i=2 to n { a=# of appearances of F j in {F 1, F 2, …F j } j = index of the a’th appearance of F j in L s = s + F j }

9 What’s good about bwt? bwt(s) is locally homogenous: For every substring w of s, all the symbols following w in s are grouped together. mississippi$ ssissippi$mi $mississippi ssippi$missi ppi$mississi ississippi$m pi$mississip i$mississipp sissippi$mis sippi$missis issippi$miss ippi$mississ These symbols will usually be homogenous.

10 What’s good about bwt? miss_mississippi_misses_miss_missouri mmmmmssssss_spiiiiiupii_ssssss_e_ioir follow mi follow _ follow m bwt follow mis

11 Statistical Compression We will discuss lossless statistical compression with the following notations: s = input string over the alphabet: Σ = { a 1, a 2, a 3, …, a h } h = |Σ| n = |s| n i = number of appearances of a i in s. log x = log 2 x

12 Zeroth Order Encoding Every input symbol is replaced by the same codeword for all its appearances: a i  c i a 0 1 1 1 e b c 1 1 f 0 d 0 0 Kraft’s Inequality: e  0 a  10 c  111 … Output size: Minimum achieved for:

13 Zeroth Order Encoding Compressing a string using Huffman Coding or Arithmetic Coding produces an output which size is close to |s|H 0 (s) bits. Specifically: is the Empirical Entropy (zeroth order) of s. Output size is bounded by |s|H 0 (s), where:

14 Zeroth order Entropy: Example n 1 = n 2 = … = n h : n 1 >> n 2, n 3 …, n h : s = mississippi

15 k-th Order Encoding The codeword that encodes an input symbol is determined according to that symbol, and its k preceding symbols. w s – A string containing all the symbols following w in s. Output size is bounded by |s|H k (s) bits k-th Order Empirical Entropy of s:

16 k-th order Entropy: Example s = mississippi (k=1) m s =i  H 0 (i)=0 i s =ssp  H 0 (ssp)=0.92 s s =sisi  H 0 (sisi)=1 p s =pi  H 0 (pi)=1

17 Did we get an optimal k-th order compressor? k-th Order Encoding and bwt After applying bwt, for every substring w of s, all the symbols following w in s are grouped together: Not yet: Local homogeneity instead of global homogeneity. mmmmmssssss_spiiiiiupii_ssssss_e_ioir i$s_mi i_se

18 k-th Order Encoding and bwt For example: s=ababababababab…. bwt(s)= abbbbbbbbbbaaaaaaaaa w 1 ($) w 2 (a) w 3 (b) H 1 (s)=0 (w a =bbb…, w b =aaa…) H 0 (w i )=0 H 0 (w 1 w 2 w 3 )=H 0 (s)=1

19 Compressing bwt bwt Arithmetic coding MoveToFront

20 MoveToFront Compression Every input symbol is encoded by the number of distinct symbols that occurred since the last appearance of this symbol. Implemented using a list of symbols sorted by recency of usage. Output contains a lot of small numbers if the text is locally homogenous.  Transforms local homogeneity into global homogeneity.

21 MoveToFront Compression Σ = { d,e,h,l,o,r,w } s= h e l l o w o r l d mtf-list= mtf(s)= { d, e, h, l, o, r, w } 223046 { h, d, e, l, o, r, w }{ e, h, d, l, o, r, w }{ l, e, h, d, o, r, w }{ o, l, e, h, d, r, w }{ w, o, l, e, h, d, r } 1… Initial list may be either: Ordered alphabetically Symbols in order of appearance in the string (need to add it to the output)

22 bwt0 Compression bwt0(s)  arit( mtf( bw(s) ) ) Theorem 1 For any k: (h=size of alphabet)

23 Notations x’ = mtf(x) for a string w over {0,1,2, …, m} define: w 01 : w, with all the non-zeros replaced by 1. x’ 01 : x’, with all the non-zeros replaced by 1. Note: |bwt(x)| = |x| |mtf(x)|=|x|

24 Theorem 1 - Proof Lemma 1 s=s 1 s 2 …s t, s’=mtf(s). Then

25 Theorem 1 - Proof bw(s) can be partitioned into at most h k substrings w 1, w 2, …, w l such that: s’=mtf(bw(s)). By Lemma 1: Using bound on output of Arit: |s|H k (s)

26 Lemma 1 - Proof Encoding of s’: For each symbol: is it 0 or not? For non-zeros: encode one of 1, 2, 3, …, h-1 Note: Ignoring some inter-substrings problems. s=s 1 s 2 …s t, s’=mtf(s). Then

27 Encoding non-zeros of s’ Use prefix code (i  c i ):s’’ = pcnz(s’) c 1 = 10 c 2 = 11 c i = 0 0 0 … 0 0 B(i+1) (i>2) |B(i+1)| - 2|B(i+1)| |c i | <= 2log(i+1) (|c 0 | = 0) m i = # occurrences of i in s’.

28 Encoding non-zeros of s’=mtf(s) Proof N a Occurrences of symbol a in s: p 1, p 2,…, p N a Sum over all symbols of s For any string s:

29 Encoding non-zeros of s’ For every i: Summing for all substrings: s=s 1 s 2 …s t

30 Encoding of s’ For non-zeros: encode one of 1, 2, 3, …, h-1  No more than bits For each symbol: Is it 0 or not?  Encode s’ 01

31 Encoding s’ 01 If for every s i ’ 01 the number of 0’s is at least as large as the number of 1’s: and It follows that: Otherwise …

32 Encoding s’ 01 (second case) If s i ’ 01 has more 1’s than 0’s for i=1,2,…l: If there are more 1 ’ s than 0 ’ s in s i ’ 01, then It follows that:

33 Encoding of s’ For non-zeros: encode one of 1, 2, 3, …, h-1  No more than bits For each symbol: Is it 0 or not? (Encode s’ 01 )  No more than bits Total: (after fixing some inaccuracies)  No more than bits

34 Improvement Use RLE: bw0 RL (s)  arit( rle( mtf( bw(s) ) ) ) Better performance Better theoretical bound:

35 Notes Compressor Implementation: Use blocks of text. Sort using one of: Compact suffix trees (long average LCP) Suffix arrays (medium average LCP) General String sorter (short average LCP) Search in a compressed text: Extract suffix-array from bwt(s). Empirical Results…


Download ppt "The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered."

Similar presentations


Ads by Google