Presentation is loading. Please wait.

Presentation is loading. Please wait.

A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

Similar presentations


Presentation on theme: "A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy."— Presentation transcript:

1 A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy

2 The Problem Given a string S[1, n] drawn from an alphabet  of size   encode S in a compressed data structure S' within entropy bounds  extract any substring of size  (log  n) symbols in constant time Thus, S' completely replaces S under the RAM model.

3 Previous works Sadakane and Grossi [SODA'06] introduced a scheme:  nH k (S) + o(n log  ) bits  Ziv-Lempel’s string encoding, succinct dictionaries and data structures to path-decoding in Lz-tries G onz á lez and Navarro [CPM'06] simplify it  slightly better space complexity in o() term but requires to fix the order k in advance  statistical encoder (namely, Arithmetic encoding), succinct binary dictionaries and tables The term o() depends on k. The scheme is effective when k=o(log  n).

4 Our work We propose a simpler storage scheme  improves space complexity  drops the use of any compressor (either LZ-like or statistical)  deploys only binary encodings and tables An interesting corollary  our scheme used upon the Burrows-Wheeler Transformed string bwt(S) achieves a compressed- space min(nH k (S), nH k (bwt(S))) + o(n log  ) bits first time that such a kind of bound is achieved. there are cases in which one entropy is smaller than the other

5 Our storage scheme S  b P V... frequency T...     B 000 11 10 01 00 1 0  b = ½ log  n n/b blocks O(  b ) = O(n ½ ) distinct blocks A table T stores the distinct blocks sorted per decreasing frequency of occurrence in S's partition.The function enc encodes the i-th block of T with the i-th element of B. The enc()s are not uniquely decodable codewords. A pointer to the start in V of each codeword is needed. enc enc(  ) enc(  )enc(  )enc(  )enc(  )enc(  )enc(  )...  0  1000 ... 1122356 01000... P is stored using a two-level storage scheme

6 Decode a block in constant time S  b P Extract 5-th block of S  Access to P[5] and P[6]  Fetch the codeword 00 from V  len = P[6] – P[5] = 2 Now codeword is uniquely decodable  Access T in position 2 len +d 2 2 +(00) 2 = 2 2 +0 = 4 V 112236... frequency T...     B 000 11 10 01 00 1 0  5 53  01000... Since len = O(log n) bits, all operations are executed in constant time 0 d

7 Space analysis Blocks table T:  O(  b ) = O(n ½ ) entries  Each entry is represented with O(log n) bits  T requires O(n ½ log n) = o(n) bits We use a two-level storage scheme [Munro 96] for the starting positions of encs (P)  bits The real challenge is to bound the space of V Let us show it by introducing an alternative encoding whose bound is simpler to evaluate

8 Empirical entropy The 0-th order empirical entropy of S is defined as  where P(c) is the frequency of the symbol c in S We define w S as the symbols following the context w in S  Let S = mississippi and w = si, then w S = sp The k-th order empirical entropy of S is defined as

9 Statistical encoding For every position k < i < n, F i denotes the frequency of seeing S[i] within w S, where w=S[i-k, i-1] Arithmetic encoding represents S within bits. Grouping all the terms referring to the same k-th order context (w), we obtain a summation upper bounded by bits.

10 Blocked statistical encoding Let us consider a compressor E that encodes each block S i of S individually  first k symbols are represented explicitly with k log  bits  b-k symbols are encoded with the k-th order Arithmetic The codeword so assigned to S i uniquely identifies it among the other distinct blocks This blocking approach increases the previous bound by O((n/b) k log  ) = o(n log  ) bits, with k=o( log  n )  accounts the cost of storing the first k symbols of the n/b blocks

11 Our bound: |V| + o(n log  ) Let us show that |V| < |E(S)| < nH k (S) + o(n log  ) The codewords assigned by E are a subset of B The codewords assigned by enc are the shortest binary strings in B enc is better than E because it follows a golden rule in data compression: it assigns shortest codewords to more frequent blocks Thus, the space occupancy of our scheme is nH k (S) + o(n log  ) bits (k=o(log  n))

12 Summary of the main result We presented a storage scheme that  O(1) time access to any substring of length  (log  n)  Space occupancy in nH k (S) + o(n log  ) bits  Better space bound in o() and much simpler approach This can be used to convert any succinct data structure into a compressed data structure Open problems  The o() term should be investigated more deeply because it usually dominates the k-th order entropy term  Experiments are needed

13 Thank you!!!


Download ppt "A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy."

Similar presentations


Ads by Google