Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Similar presentations


Presentation on theme: "Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa."— Presentation transcript:

1 Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa

2 Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.

3 Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1 a bc d 0 01 1

4 Average Length For a code C with codeword length L[s], the average length is defined as p(A) =.7 [0], p(B) = p(C) = p(D) =.1 [1--] L a =.7 * 1 +.3 * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, L a (C)  L a (C’)

5 Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability  higher information Entropy is the weighted average of i(s) 0-th order empirical entropy of string T i(s) 0 <= H <= log |  | H -> 0, skewed distribution H max, uniform distribution

6 Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. p(A) =.7, p(B) = p(C) = p(D) =.1 H ≈ 1.36 bits Huffman ≈ 1.5 bits per symb Shannon In practice Avg cw length Empirical H vs Compression ratio An optimal code is surely one that…

7 Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper

8  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

9 It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597

10  code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

11 Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1  binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

12 PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data: [base, base + 2 b -1]  [0,2 b -1]

13 A basic problem ! Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T Array of pointers (log m) bits per string, (n log m) bits overall. We could drop the separating NULL  Independent of string-length distribution It is effective for few strings  It is bad for medium/large sets of strings Space = 32 * n bits

14 A basic problem ! 10000100000100100010010000001000010000.... B Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T 10#2#5#6#20#31#3#3#.... T 1010101011101010111111111.... X AbacoBattleCarColdCodDefenseGoogleYahoo.... X 1000101001001000100001010.... B We could drop msb We aim at achieving ≈n log(m/n) bits

15 Rank/Select 00101001010101011111110000011010101.... B Rank b (i) = number of b in B[1,i] Select b (i) = position of the i-th b in B Rank 1 (6) = 2 Select 1 (3) = 8 m = |B| n = #1 Do exist data structures that solve this problem in O(1) query time and n log(m/n) + o(m) bits of space

16 z = 3, w=2 Elias-Fano useful for Rank/Select If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n log (m/n) bits - H takes 2n bits 0 1 2 3 4 5 6 7 In unary Actually you can do binary search over B, but compressed ! Select 1 on B  uses L and Select 1 on H taking +o(n) space

17 If you wish to play with Rank and Select A next lecture… m/10 + n log m/n Rank in 0.4  sec, Select in < 1  sec For binary search cfr 2n + n log (m/n) only select vs 32n bits of explicit pointers


Download ppt "Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa."

Similar presentations


Ads by Google