TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo
Improving Table Compression with Combinatorial Optimization- J. ACM 03 A. L. Buchsbaum, G.L. Flowler and R. Giancarlo Boosting Textual Compression in Optimal Linear Time J. ACM 05 P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino Permutation, Partitions and Combinatorial Compression Boosting – TM 256 Unipa 04 P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino
Table Compression gzip aabba aabba aabba aabba …bbbaa Feed Table in Row Major Order to gzip
Table Compression aabba aabba aabba aabba gzip On-Line (no Training): Partition Table and Compress separately
Table Compression Off-Line (Training): Permute Columns, Partition, Compress aabba aabba aabba aabba aaabb aaabb aaabb aaabb gzip
Table Compression On-Line Optimal Solution Same speed as gzip 40-60% gain in Compression over gzip and bzip2 Off-Line Good Heuristics (Traveling Salesman Problem) Tolerably slower than gzip Additional 10-20% gain in Compression Applications Data warehousing Data Base of Multiple Alignments - PFAM
Table Compression Column Permutations via TSP Build complete directed weighted graph G column T[i] is vertex i weight of (i,j): min( C (T[i])+ C (T[j]), C (T[i]T[j])) Find a good tour and therefore a good permutation of the table columns Permute, Partition, Compress
The PPC Paradigm Base Compressor C, i.e., gzip, Huffman, Arithmetic Codes Objects to be compressed: x 1, x 2, …,x n Find suitable permutation of objects Permute objects and partition Compress each piece of the partition seperately via C Boosting the performance of Base Compressor C
Back to Table Compression Binh Dao Vo and Kiem-Phong Vo-DCC04 Using Column Dependency to Compress Tables Lex sort PPC
Back to Table Compression Column Dependency for Table Compression Elegant algorithms to infer dependency and rearrange data Theory: NP- Hard Heuristics: 5-50% improvement in compression over TSP reordering
A Transition Exercise: Specialize TSP Reordering to strings String x 1 x 2 … x n lcp(i,j)= length of longest common prefix of x i+1 … x n and x j+1 … x n Symbols i and j have relation weight n-lcp(i,j) s = mississippi#
A Transition Exercise (continued) Define undirected graph G, where node i is labeled with x i and (i,j) has weight given by relation An optimal tour is given by the lex sort of all cyclic shifts of S All contexts are packed together optimally PPC
The Burrows and Wheeler Transform (1994) pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows #mississipp i i#mississip p bwt(s) s ippi#missis s
Qualitatively, we show that c’ is shorter than c, if s is compressible Time( A boost ) = Time ( A ), i.e. no slowdown A is used as a black-box Our technique takes a poor compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost Boosting Textual Compression in optimal time
|c | ≤ λ |s| H (s) + µ |s| Technically, we prove that 0 k Our technique takes a poor compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost kk + log 2 |s| + k ’ “Poor” means H 0 bounds for A Boosting
Three Key Components: Burrows-Wheeler Transform, Suffix Tree and a Greedy processing of them Our technique takes a 0th order compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost We achieve the best known compression ratio
Boosting Outline BWT Find optimal partition of permuted string Greedy processing of suffix tree Compress each piece of partition separately via base compressor A
Related Work Foschini, Grossi, Gupta and Vitter- DCC04 Fast Compression with a Static Model in High Order Entropy It ca be seen as a Compression Booster of Run length Encoding Ingredients: BWT Wavelet Trees [GGV03] efficient encoding of the Integers [E75]
Related Work Liefke and Suciu Compression for XML Files Group Together XML Strings based on similarities Greatly Improves the performance of Gzip
Related Work Johnson et. al Compression of Boolean Matrices Permute Columns so that Number of Runs is Minimized NP- hard; Actually Max SNP Hard TSP + Hamming Distance
Related Work Shortest Common Superstring [G97] Oldest Instance of Permute, Partition and Compress
Conclusions Permute Data Before Compression It is efficient and fun… In particular, if chosen permutation is not invertible