Presentation is loading. Please wait.

Presentation is loading. Please wait.

BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007.

Similar presentations


Presentation on theme: "BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007."— Presentation transcript:

1 BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007

2 Results Cannot show constant c<2 s.t. Similarly, no c<1.26 for BW RL no c<1.3 for BW DC Probabilistic technique.

3 Outline ● Part I: Definitions ● Part II: Results ● Part III: Proofs ● Part IV: Experimental Results

4 Part I: Definitions

5 BW0 The Main Burrows-Wheeler Compression Algorithm: Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front Order-0 Encoding Text with local uniformity Text in English (similar contexts -> similar character) Integer string with many small numbers

6 The BWT ● Invented by Burrows-and-Wheeler (‘94) ● Analogous to Fourier Transform (smooth!) string with context-regularity BWT string with spikes (close repetitions) mississippi ipssmpissii [Fenwick]

7 p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The BWT T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s FL=BWT(T) T BWT sorts the characters by their post-context

8 BWT Facts 1.permutes the text 2.(≤n+1)-to-1 function

9 Move To Front ● By Bentley, Sleator, Tarjan and Wei (’86) string with spikes (close repetitions) ipssmpissii integer string with small numbers 0,0,0,0,0,2,4,3,0,1,0 move-to-front

10 Move to Front a,b,r,c,dabracadabra

11 Move to Front a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

12 Move to Front b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

13 Move to Front r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

14 Move to Front a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

15 Move to Front c,a,r,b,d0,1,2,2,3abracadabra a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

16 Move to Front a,c,r,b,d0,1,2,2,3,1abracadabra c,a,r,b,d0,1,2,2,3abracadabra a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

17 Move to Front 0,1,2,2,3,1,4,1,4,4,2abracadabra a,c,r,b,d0,1,2,2,3,1abracadabra c,a,r,b,d0,1,2,2,3abracadabra a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

18 After MTF ● Now we have a string with small numbers: lots of 0s, many 1s, … ● Skewed frequencies: Run Arithmetic! Character frequencies

19 BW0 The Main Burrows-Wheeler Compression Algorithm: Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front Order-0 Encoding Text with local uniformity Text in English (similar contexts -> similar character) Integer string with many small numbers

20 BW RL (e.g. bzip) Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front ? RLE Run- Length encoding Order-0 Encoding

21 Many more BWT-based algorithms ● BW DC : Encodes using distance coding instead of MTF ● BW with inversion frequencies coding ● Booster-Based [Ferragina-Giancarlo- Manzini-Sciortino] ● Block-based compressor of Effros et al.

22 order-0 entropy Lower bound for compression without context information S=“ACABBA” 1/2 `A’s: Each represented by 1 bit 1/3 `B’s: Each represented by log(3) bits 1/6 `C’s: Each represented by log(6) bits 6*H 0 (S)=3*1+2*log(3)+1*log(6)

23 order-k entropy = Lower bound for compression with order-k contexts

24 order-k entropy mississippi: Context for i: “mssp” Context for s: “isis” Context for p: “ip”

25 Part II: Results

26 Measuring against H k ● When performing worst-case analysis of lossless text compressors, we usually measure against H k ● The goal – a bound of the form: |A(s)|≤ c×nH k (s)+ lower order term ● Optimal: |A(s)|≤ nH k (s)+ lower order term

27 Bounds lowerUpper BW02 [KaplanVerbin07] 3.33 [ManziniGagie07] BW DC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06] BW RL 1.26 [KaplanVerbin07] 5 [Manzini99] gzip11 PPM11

28 Bounds lowerUpper BW02 [KaplanVerbin07] 3.33 [ManziniGagie07] BW DC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06] BW RL 1.26 [KaplanVerbin07] 5 [Manzini99] gzip11 PPM11 a

29 Bounds lowerUpper BW02 [KaplanVerbin07] 3.33 [ManziniGagie07] BW DC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06] BW RL 1.26 [KaplanVerbin07] 5 [Manzini99] gzip11 PPM11

30 Bounds lower BW02 [KaplanVerbin07] BW DC 1.3 [KaplanVerbin07] BW RL 1.26 [KaplanVerbin07] gzip1 PPM1 Surprising!! Since BWT-based compressors work better than gzip in practice!

31 Possible Explanations 1.Asymptotics: and real compressors cut into blocks, so 2.English Text is not Markovian! Analyzing on different model might show BWT's superiority

32 Part III: Proofs

33 Lower bound ● Wish to analyze BW0=BWT+MTF+Order0 ● Need to show s s.t. ● Consider string s: 10 3 `a', 10 6 `b'  Entropy of s ● BWT(s):  same frequencies  MTF(BWT(s)) has: 2*10 3 `1', 10 6 -10 3 `0‘  Compressed size: about need BWT(s) to have many isolated `a’s

34 many isolated `a’s ● Goal: find s such that in BWT(s), most `a’s are isolated ● Solution: probabilistic.  BWT is (≤n+1)-to-1 function. ● A random string s’ has ≥1/(n+1) chance of being a BWT-image ● A random string has ≥1-1/n 2 chance of having “many” isolated `a’s  Therefore, such a string exists

35 General Calculation ● s contains pn `a’s, (1-p)n `b’s.  Entropy of s: ● MTF(BWT(s)) contains 2p(1-p)n `1’s, rest `0’s  compressed size of MTF(BWT(s)): ● Ratio:

36 Lower bounds on BW DC, BW RL ● Similar technique.  p infinitesimally small gives compressible string.  So maximize ratio over p. ● Gives weird constants, but quite strong

37 Experimental Results Sanity Check: Picking texts from above Markov models really shows behavior in practice Picking text from “realistic” Markov sources also shows non-optimal behavior (“realistic” = generated from actual texts) On long Markov text, gzip works better than BWT

38 Bottom Line ● BWT compressors are not optimal  (vs. order-k entropy) ● We believe that they are good since English text is not Markovian. ● Find theoretical justification! ● also improve constants, find BWT algs with better ratios,...

39 Thank You!

40 Additional Slides (taken out for lack of time)

41 BWT - Invertibility ● Go forward, one character at a time

42 Main Property: L  F mapping ● The i th occurrence of c in L corresponds to the i th occurrence of c in F. ● This happens because the characters in L are sorted by their post-context, and the occurrences of character c in F are sorted by their post-context. p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m # mississipp i i #mississip p i ppi#missis s FL unknown

43 BW0 vs. Lempel-Ziv ● BW0 dynamically takes advantage of context-regularity ● Robust, smooth, alternative for Lempel-Ziv

44 BW0 vs. Statistical Coding ● Statistical Coding (e.g. PPM):  Builds a model for each context  Prediction -> Compression Exploits similarities between similar contexts Optimally models each context Explicit partitioning – produces a model for each context No explicit partitioning to contexts PPMBW0

45 Compressed Text Indexing ● Application of BWT ● Compressed representation of text, that supports:  fast pattern matching (without decompression!)  Partial decompression ● So, no need to ever decompress!  space usage: |BW0(s)|+o(n) ● See more in [Ferragina-Manzini]

46 Musings ● On one hand: BWT based algorithms are not optimal, while Lempel-Ziv is. ● On the other hand: BWT compresses much better ● Reasons: 1. Results are Asymptotic. (EE reason) 2. English text was not generated by a Markov source (real reason?) ● Goal: Get a more honest way to analyze ● Use a statistic different than H k ?


Download ppt "BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007."

Similar presentations


Ads by Google