TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Advertisements

Boosting Textual Compression in Optimal Linear Time.
Greedy Technique Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: feasible, i.e. satisfying the.
Chapter 9 Greedy Technique. Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: b feasible - b feasible.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Paolo Ferragina, Università di Pisa Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Combinatorial Algorithms
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Fudan University Chen, Yaoliang 1. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Chapter 9 Greedy Technique Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
Advanced Algorithms for Massive DataSets Data Compression.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
1 Integrality constraints Integrality constraints are often crucial when modeling optimizayion problems as linear programs. We have seen that if our linear.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Hon Wai Leong, NUS (CS6234, Spring 2009) Page 1 Copyright © 2009 by Leong Hon Wai CS6234 Lecture 1 -- (14-Jan-09) “Introduction”  Combinatorial Optimization.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Complexity Classes (Ch. 34) The class P: class of problems that can be solved in time that is polynomial in the size of the input, n. if input size is.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Case Studies: Bin Packing.
Approximation Algorithms
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
For Wednesday No reading No homework There will be homework for Friday, as well the program being due – plan ahead.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 9 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Minimal Spanning Tree Problems in What is a minimal spanning tree An MST is a tree (set of edges) that connects all nodes in a graph, using.
Dvir Shabtay Moshe Kaspi The Department of IE&M Ben-Gurion University of the Negev, Israel.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)
1 Greedy Technique Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: b feasible b locally optimal.
Information Retrieval
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Spring, 2010 Lecture 2 Tuesday, 2/2/10 Design Patterns for Optimization.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Approximation algorithms
Compressed storage of the Web-graph
Burrows-Wheeler Transformation Review
Algorithms for Finding Distance-Edge-Colorings of Graphs
Information and Coding Theory
Greedy Technique.
Advanced Algorithms for Massive DataSets
Optimization problems such as
Lecture 2-2 NP Class.
Algorithms in the Real World
Courtsey & Copyright: DESIGN AND ANALYSIS OF ALGORITHMS Courtsey & Copyright:
Problem with Huffman Coding
Advanced Seminar in Data Structures
Major Design Strategies
Approximation Algorithms
CPS 296.3:Algorithms in the Real World
Presentation transcript:

TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Improving Table Compression with Combinatorial Optimization- J. ACM 03 A. L. Buchsbaum, G.L. Flowler and R. Giancarlo Boosting Textual Compression in Optimal Linear Time J. ACM 05 P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino Permutation, Partitions and Combinatorial Compression Boosting – TM 256 Unipa 04 P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino

Table Compression gzip aabba aabba aabba aabba …bbbaa Feed Table in Row Major Order to gzip

Table Compression aabba aabba aabba aabba gzip On-Line (no Training): Partition Table and Compress separately

Table Compression Off-Line (Training): Permute Columns, Partition, Compress aabba aabba aabba aabba aaabb aaabb aaabb aaabb gzip

Table Compression On-Line Optimal Solution Same speed as gzip 40-60% gain in Compression over gzip and bzip2 Off-Line Good Heuristics (Traveling Salesman Problem) Tolerably slower than gzip Additional 10-20% gain in Compression Applications Data warehousing Data Base of Multiple Alignments - PFAM

Table Compression Column Permutations via TSP Build complete directed weighted graph G column T[i] is vertex i weight of (i,j): min( C (T[i])+ C (T[j]), C (T[i]T[j])) Find a good tour and therefore a good permutation of the table columns Permute, Partition, Compress

The PPC Paradigm Base Compressor C, i.e., gzip, Huffman, Arithmetic Codes Objects to be compressed: x 1, x 2, …,x n Find suitable permutation of objects Permute objects and partition Compress each piece of the partition seperately via C Boosting the performance of Base Compressor C

Back to Table Compression Binh Dao Vo and Kiem-Phong Vo-DCC04 Using Column Dependency to Compress Tables Lex sort PPC

Back to Table Compression Column Dependency for Table Compression Elegant algorithms to infer dependency and rearrange data Theory: NP- Hard Heuristics: 5-50% improvement in compression over TSP reordering

A Transition Exercise: Specialize TSP Reordering to strings String x 1 x 2 … x n lcp(i,j)= length of longest common prefix of x i+1 … x n and x j+1 … x n Symbols i and j have relation weight n-lcp(i,j) s = mississippi# 

A Transition Exercise (continued) Define undirected graph G, where node i is labeled with x i and (i,j) has weight given by relation An optimal tour is given by the lex sort of all cyclic shifts of S All contexts are packed together optimally PPC

The Burrows and Wheeler Transform (1994) pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows #mississipp i i#mississip p bwt(s) s ippi#missis s

Qualitatively, we show that c’ is shorter than c, if s is compressible Time( A boost ) = Time ( A ), i.e. no slowdown A is used as a black-box Our technique takes a poor compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost Boosting Textual Compression in optimal time

|c | ≤ λ |s| H (s) + µ |s| Technically, we prove that 0 k Our technique takes a poor compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost kk + log 2 |s| +  k ’ “Poor” means H 0 bounds for A Boosting

Three Key Components: Burrows-Wheeler Transform, Suffix Tree and a Greedy processing of them Our technique takes a 0th order compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost We achieve the best known compression ratio

Boosting Outline BWT Find optimal partition of permuted string Greedy processing of suffix tree Compress each piece of partition separately via base compressor A

Related Work Foschini, Grossi, Gupta and Vitter- DCC04 Fast Compression with a Static Model in High Order Entropy It ca be seen as a Compression Booster of Run length Encoding Ingredients: BWT Wavelet Trees [GGV03] efficient encoding of the Integers [E75]

Related Work Liefke and Suciu Compression for XML Files Group Together XML Strings based on similarities Greatly Improves the performance of Gzip

Related Work Johnson et. al Compression of Boolean Matrices Permute Columns so that Number of Runs is Minimized NP- hard; Actually Max SNP Hard TSP + Hamming Distance

Related Work Shortest Common Superstring [G97] Oldest Instance of Permute, Partition and Compress

Conclusions Permute Data Before Compression It is efficient and fun… In particular, if chosen permutation is not invertible