A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Advertisements

Boosting Textual Compression in Optimal Linear Time.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Data Compression CS 147 Minh Nguyen.
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Processing of large document collections
The course Project #1: Dictionary search with 1 error The problem consists of building a data structure to index a large dictionary of strings.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
SWE 423: Multimedia Systems Chapter 7: Data Compression (3)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Spatial and Temporal Data Mining
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Chapter 2 Source Coding (part 2)
Source Coding-Compression
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Random access to arrays of variable-length items
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Multi-media Data compression
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Images. Audio. Cryptography - Steganography MultiMedia Compression } Movies.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Encoding/Decoding May 9, 2016 Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck 1.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Linear Time Suffix Array Construction Using D-Critical Substrings
Index construction: Compression of postings
HUFFMAN CODES.
Succinct Data Structures
Succinct Data Structures
Data Compression.
Succinct Data Structures
Reducing the Space Requirement of LZ-index
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Huffman Coding.
Advanced Algorithms Analysis and Design
CSE 589 Applied Algorithms Spring 1999
Index construction: Compression of postings
Greedy Algorithms TOPICS Greedy Strategy Activity Selection
Data Structure and Algorithms
Advanced Seminar in Data Structures
Query processing: phrase queries and positional indexes
Huffman Coding Greedy Algorithm
Huffman codes Binary character code: each character is represented by a unique binary string. A data file can be coded in two ways: a b c d e f frequency(%)
Analysis of Algorithms CS 477/677
Presentation transcript:

A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy

The Problem Given a string S[1, n] drawn from an alphabet  of size   encode S in a compressed data structure S' within entropy bounds  extract any substring of size  (log  n) symbols in constant time Thus, S' completely replaces S under the RAM model.

Previous works Sadakane and Grossi [SODA'06] introduced a scheme:  nH k (S) + o(n log  ) bits  Ziv-Lempel’s string encoding, succinct dictionaries and data structures to path-decoding in Lz-tries G onz á lez and Navarro [CPM'06] simplify it  slightly better space complexity in o() term but requires to fix the order k in advance  statistical encoder (namely, Arithmetic encoding), succinct binary dictionaries and tables The term o() depends on k. The scheme is effective when k=o(log  n).

Our work We propose a simpler storage scheme  improves space complexity  drops the use of any compressor (either LZ-like or statistical)  deploys only binary encodings and tables An interesting corollary  our scheme used upon the Burrows-Wheeler Transformed string bwt(S) achieves a compressed- space min(nH k (S), nH k (bwt(S))) + o(n log  ) bits first time that such a kind of bound is achieved. there are cases in which one entropy is smaller than the other

Our storage scheme S  b P V... frequency T...     B  b = ½ log  n n/b blocks O(  b ) = O(n ½ ) distinct blocks A table T stores the distinct blocks sorted per decreasing frequency of occurrence in S's partition.The function enc encodes the i-th block of T with the i-th element of B. The enc()s are not uniquely decodable codewords. A pointer to the start in V of each codeword is needed. enc enc(  ) enc(  )enc(  )enc(  )enc(  )enc(  )enc(  )...  0  1000  P is stored using a two-level storage scheme

Decode a block in constant time S  b P Extract 5-th block of S  Access to P[5] and P[6]  Fetch the codeword 00 from V  len = P[6] – P[5] = 2 Now codeword is uniquely decodable  Access T in position 2 len +d 2 2 +(00) 2 = = 4 V frequency T...     B  5 53  Since len = O(log n) bits, all operations are executed in constant time 0 d

Space analysis Blocks table T:  O(  b ) = O(n ½ ) entries  Each entry is represented with O(log n) bits  T requires O(n ½ log n) = o(n) bits We use a two-level storage scheme [Munro 96] for the starting positions of encs (P)  bits The real challenge is to bound the space of V Let us show it by introducing an alternative encoding whose bound is simpler to evaluate

Empirical entropy The 0-th order empirical entropy of S is defined as  where P(c) is the frequency of the symbol c in S We define w S as the symbols following the context w in S  Let S = mississippi and w = si, then w S = sp The k-th order empirical entropy of S is defined as

Statistical encoding For every position k < i < n, F i denotes the frequency of seeing S[i] within w S, where w=S[i-k, i-1] Arithmetic encoding represents S within bits. Grouping all the terms referring to the same k-th order context (w), we obtain a summation upper bounded by bits.

Blocked statistical encoding Let us consider a compressor E that encodes each block S i of S individually  first k symbols are represented explicitly with k log  bits  b-k symbols are encoded with the k-th order Arithmetic The codeword so assigned to S i uniquely identifies it among the other distinct blocks This blocking approach increases the previous bound by O((n/b) k log  ) = o(n log  ) bits, with k=o( log  n )  accounts the cost of storing the first k symbols of the n/b blocks

Our bound: |V| + o(n log  ) Let us show that |V| < |E(S)| < nH k (S) + o(n log  ) The codewords assigned by E are a subset of B The codewords assigned by enc are the shortest binary strings in B enc is better than E because it follows a golden rule in data compression: it assigns shortest codewords to more frequent blocks Thus, the space occupancy of our scheme is nH k (S) + o(n log  ) bits (k=o(log  n))

Summary of the main result We presented a storage scheme that  O(1) time access to any substring of length  (log  n)  Space occupancy in nH k (S) + o(n log  ) bits  Better space bound in o() and much simpler approach This can be used to convert any succinct data structure into a compressed data structure Open problems  The o() term should be investigated more deeply because it usually dominates the k-th order entropy term  Experiments are needed

Thank you!!!