Presentation is loading. Please wait.

Presentation is loading. Please wait.

TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.

Similar presentations


Presentation on theme: "TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo."— Presentation transcript:

1 TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

2 Improving Table Compression with Combinatorial Optimization- J. ACM 03 A. L. Buchsbaum, G.L. Flowler and R. Giancarlo Boosting Textual Compression in Optimal Linear Time J. ACM 05 P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino Permutation, Partitions and Combinatorial Compression Boosting – TM 256 Unipa 04 P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino

3 Table Compression gzip aabba aabba aabba aabba …bbbaa Feed Table in Row Major Order to gzip

4 Table Compression aabba aabba aabba aabba gzip On-Line (no Training): Partition Table and Compress separately

5 Table Compression Off-Line (Training): Permute Columns, Partition, Compress aabba aabba aabba aabba aaabb aaabb aaabb aaabb gzip

6 Table Compression On-Line Optimal Solution Same speed as gzip 40-60% gain in Compression over gzip and bzip2 Off-Line Good Heuristics (Traveling Salesman Problem) Tolerably slower than gzip Additional 10-20% gain in Compression Applications Data warehousing Data Base of Multiple Alignments - PFAM

7 Table Compression Column Permutations via TSP Build complete directed weighted graph G column T[i] is vertex i weight of (i,j): min( C (T[i])+ C (T[j]), C (T[i]T[j])) Find a good tour and therefore a good permutation of the table columns Permute, Partition, Compress

8 The PPC Paradigm Base Compressor C, i.e., gzip, Huffman, Arithmetic Codes Objects to be compressed: x 1, x 2, …,x n Find suitable permutation of objects Permute objects and partition Compress each piece of the partition seperately via C Boosting the performance of Base Compressor C

9 Back to Table Compression Binh Dao Vo and Kiem-Phong Vo-DCC04 Using Column Dependency to Compress Tables 9088771 07922 9733360 07932 9084640 07922 9733600 07932 908 973 908 973 908 973 2 3 Lex sort PPC

10 Back to Table Compression Column Dependency for Table Compression Elegant algorithms to infer dependency and rearrange data Theory: NP- Hard Heuristics: 5-50% improvement in compression over TSP reordering

11 A Transition Exercise: Specialize TSP Reordering to strings String x 1 x 2 … x n lcp(i,j)= length of longest common prefix of x i+1 … x n and x j+1 … x n Symbols i and j have relation weight n-lcp(i,j) s = mississippi# 

12 A Transition Exercise (continued) Define undirected graph G, where node i is labeled with x i and (i,j) has weight given by relation An optimal tour is given by the lex sort of all cyclic shifts of S All contexts are packed together optimally PPC

13 The Burrows and Wheeler Transform (1994) pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows #mississipp i i#mississip p bwt(s) s ippi#missis s

14 Qualitatively, we show that c’ is shorter than c, if s is compressible Time( A boost ) = Time ( A ), i.e. no slowdown A is used as a black-box Our technique takes a poor compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost Boosting Textual Compression in optimal time

15 |c | ≤ λ |s| H (s) + µ |s| Technically, we prove that 0 k Our technique takes a poor compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost kk + log 2 |s| +  k ’ “Poor” means H 0 bounds for A Boosting

16 Three Key Components: Burrows-Wheeler Transform, Suffix Tree and a Greedy processing of them Our technique takes a 0th order compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost We achieve the best known compression ratio

17 Boosting Outline BWT Find optimal partition of permuted string Greedy processing of suffix tree Compress each piece of partition separately via base compressor A

18 Related Work Foschini, Grossi, Gupta and Vitter- DCC04 Fast Compression with a Static Model in High Order Entropy It ca be seen as a Compression Booster of Run length Encoding Ingredients: BWT Wavelet Trees [GGV03] efficient encoding of the Integers [E75]

19 Related Work Liefke and Suciu Compression for XML Files Group Together XML Strings based on similarities Greatly Improves the performance of Gzip

20 Related Work Johnson et. al. 2005 Compression of Boolean Matrices Permute Columns so that Number of Runs is Minimized NP- hard; Actually Max SNP Hard TSP + Hamming Distance

21 Related Work Shortest Common Superstring [G97] Oldest Instance of Permute, Partition and Compress

22 Conclusions Permute Data Before Compression It is efficient and fun… In particular, if chosen permutation is not invertible


Download ppt "TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo."

Similar presentations


Ads by Google