BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Advertisements

Boosting Textual Compression in Optimal Linear Time.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
15-583:Algorithms in the Real World
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Fudan University Chen, Yaoliang 1. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo.
An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin.
SWE 423: Multimedia Systems
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Data Compression Basics
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
Advanced Algorithms for Massive DataSets Data Compression.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Algorithms in the Real World
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Source Coding-Compression
Dr.-Ing. Khaled Shawky Hassan
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
1 Classification of Compression Methods. 2 Data Compression  A means of reducing the size of blocks of data by removing  Unused material: e.g.) silence.
Computational Complexity Jang, HaYoung BioIntelligence Lab.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)
Information Retrieval
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
CPS 100e, Fall Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s used.
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
SEAC-3 J.Teuhola Information-Theoretic Foundations Founder: Claude Shannon, 1940’s Gives bounds for:  Ultimate data compression  Ultimate transmission.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Page 1 Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB.
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Compressed storage of the Web-graph
Burrows-Wheeler Transformation Review
Chapter 14 Genetic Algorithms.
Succinct Data Structures
Advanced Algorithms for Massive DataSets
Algorithms in the Real World
Algorithms in the Real World
Burrows Wheeler Transform In Image Compression
General Strong Polarization
Image Transforms for Robust Coding
Problem with Huffman Coding
Advanced Seminar in Data Structures
IMPLEMENTATION OF A DIGITAL COMMUNUCATION SYSTEM
General Strong Polarization
General Strong Polarization
CPS 296.3:Algorithms in the Real World
Presentation transcript:

BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007

Results Cannot show constant c<2 s.t. Similarly, no c<1.26 for BW RL no c<1.3 for BW DC Probabilistic technique.

Outline ● Part I: Definitions ● Part II: Results ● Part III: Proofs ● Part IV: Experimental Results

Part I: Definitions

BW0 The Main Burrows-Wheeler Compression Algorithm: Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front Order-0 Encoding Text with local uniformity Text in English (similar contexts -> similar character) Integer string with many small numbers

The BWT ● Invented by Burrows-and-Wheeler (‘94) ● Analogous to Fourier Transform (smooth!) string with context-regularity BWT string with spikes (close repetitions) mississippi ipssmpissii [Fenwick]

p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The BWT T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s FL=BWT(T) T BWT sorts the characters by their post-context

BWT Facts 1.permutes the text 2.(≤n+1)-to-1 function

Move To Front ● By Bentley, Sleator, Tarjan and Wei (’86) string with spikes (close repetitions) ipssmpissii integer string with small numbers 0,0,0,0,0,2,4,3,0,1,0 move-to-front

Move to Front a,b,r,c,dabracadabra

Move to Front a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

Move to Front b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

Move to Front r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

Move to Front a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

Move to Front c,a,r,b,d0,1,2,2,3abracadabra a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

Move to Front a,c,r,b,d0,1,2,2,3,1abracadabra c,a,r,b,d0,1,2,2,3abracadabra a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

Move to Front 0,1,2,2,3,1,4,1,4,4,2abracadabra a,c,r,b,d0,1,2,2,3,1abracadabra c,a,r,b,d0,1,2,2,3abracadabra a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra

After MTF ● Now we have a string with small numbers: lots of 0s, many 1s, … ● Skewed frequencies: Run Arithmetic! Character frequencies

BW0 The Main Burrows-Wheeler Compression Algorithm: Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front Order-0 Encoding Text with local uniformity Text in English (similar contexts -> similar character) Integer string with many small numbers

BW RL (e.g. bzip) Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front ? RLE Run- Length encoding Order-0 Encoding

Many more BWT-based algorithms ● BW DC : Encodes using distance coding instead of MTF ● BW with inversion frequencies coding ● Booster-Based [Ferragina-Giancarlo- Manzini-Sciortino] ● Block-based compressor of Effros et al.

order-0 entropy Lower bound for compression without context information S=“ACABBA” 1/2 `A’s: Each represented by 1 bit 1/3 `B’s: Each represented by log(3) bits 1/6 `C’s: Each represented by log(6) bits 6*H 0 (S)=3*1+2*log(3)+1*log(6)

order-k entropy = Lower bound for compression with order-k contexts

order-k entropy mississippi: Context for i: “mssp” Context for s: “isis” Context for p: “ip”

Part II: Results

Measuring against H k ● When performing worst-case analysis of lossless text compressors, we usually measure against H k ● The goal – a bound of the form: |A(s)|≤ c×nH k (s)+ lower order term ● Optimal: |A(s)|≤ nH k (s)+ lower order term

Bounds lowerUpper BW02 [KaplanVerbin07] 3.33 [ManziniGagie07] BW DC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06] BW RL 1.26 [KaplanVerbin07] 5 [Manzini99] gzip11 PPM11

Bounds lowerUpper BW02 [KaplanVerbin07] 3.33 [ManziniGagie07] BW DC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06] BW RL 1.26 [KaplanVerbin07] 5 [Manzini99] gzip11 PPM11 a

Bounds lowerUpper BW02 [KaplanVerbin07] 3.33 [ManziniGagie07] BW DC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06] BW RL 1.26 [KaplanVerbin07] 5 [Manzini99] gzip11 PPM11

Bounds lower BW02 [KaplanVerbin07] BW DC 1.3 [KaplanVerbin07] BW RL 1.26 [KaplanVerbin07] gzip1 PPM1 Surprising!! Since BWT-based compressors work better than gzip in practice!

Possible Explanations 1.Asymptotics: and real compressors cut into blocks, so 2.English Text is not Markovian! Analyzing on different model might show BWT's superiority

Part III: Proofs

Lower bound ● Wish to analyze BW0=BWT+MTF+Order0 ● Need to show s s.t. ● Consider string s: 10 3 `a', 10 6 `b'  Entropy of s ● BWT(s):  same frequencies  MTF(BWT(s)) has: 2*10 3 `1', `0‘  Compressed size: about need BWT(s) to have many isolated `a’s

many isolated `a’s ● Goal: find s such that in BWT(s), most `a’s are isolated ● Solution: probabilistic.  BWT is (≤n+1)-to-1 function. ● A random string s’ has ≥1/(n+1) chance of being a BWT-image ● A random string has ≥1-1/n 2 chance of having “many” isolated `a’s  Therefore, such a string exists

General Calculation ● s contains pn `a’s, (1-p)n `b’s.  Entropy of s: ● MTF(BWT(s)) contains 2p(1-p)n `1’s, rest `0’s  compressed size of MTF(BWT(s)): ● Ratio:

Lower bounds on BW DC, BW RL ● Similar technique.  p infinitesimally small gives compressible string.  So maximize ratio over p. ● Gives weird constants, but quite strong

Experimental Results Sanity Check: Picking texts from above Markov models really shows behavior in practice Picking text from “realistic” Markov sources also shows non-optimal behavior (“realistic” = generated from actual texts) On long Markov text, gzip works better than BWT

Bottom Line ● BWT compressors are not optimal  (vs. order-k entropy) ● We believe that they are good since English text is not Markovian. ● Find theoretical justification! ● also improve constants, find BWT algs with better ratios,...

Thank You!

Additional Slides (taken out for lack of time)

BWT - Invertibility ● Go forward, one character at a time

Main Property: L  F mapping ● The i th occurrence of c in L corresponds to the i th occurrence of c in F. ● This happens because the characters in L are sorted by their post-context, and the occurrences of character c in F are sorted by their post-context. p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m # mississipp i i #mississip p i ppi#missis s FL unknown

BW0 vs. Lempel-Ziv ● BW0 dynamically takes advantage of context-regularity ● Robust, smooth, alternative for Lempel-Ziv

BW0 vs. Statistical Coding ● Statistical Coding (e.g. PPM):  Builds a model for each context  Prediction -> Compression Exploits similarities between similar contexts Optimally models each context Explicit partitioning – produces a model for each context No explicit partitioning to contexts PPMBW0

Compressed Text Indexing ● Application of BWT ● Compressed representation of text, that supports:  fast pattern matching (without decompression!)  Partial decompression ● So, no need to ever decompress!  space usage: |BW0(s)|+o(n) ● See more in [Ferragina-Manzini]

Musings ● On one hand: BWT based algorithms are not optimal, while Lempel-Ziv is. ● On the other hand: BWT compresses much better ● Reasons: 1. Results are Asymptotic. (EE reason) 2. English text was not generated by a Markov source (real reason?) ● Goal: Get a more honest way to analyze ● Use a statistic different than H k ?