Presentation is loading. Please wait.

Presentation is loading. Please wait.

Index construction: Compression of postings

Similar presentations


Presentation on theme: "Index construction: Compression of postings"— Presentation transcript:

1 Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa

2 Sec. 3.1 Delta encoding Then you compress the resulting integers with variable-length prefix-free codes, as follows…

3 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Variable-byte codes Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=214+1  binary(v) = Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !! T-nibble: We could design this code over t-bits, not just t=8

4 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
PForDelta coding 10 01 42 23 2 1 a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Translate data: [base, base + 2b-2]  [0,2b-2] Encode exceptions with value 2b-1 Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions

5 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
g-code Binary Length-1 Binary length x > 0 and Binary length = log2 x +1 e.g., 9 represented as g-code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal) Optimal for Pr(x) = 1/2x2, and i.i.d integers

6 It is a prefix-free encoding…
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" It is a prefix-free encoding… Given the following sequence of g-coded integers, reconstruct the original sequence: 8 59 7 6 3

7 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
d-code Binary length Use g-coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as d-coding x takes about log2 x + 2 log2( log2 x ) + 2 bits. Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers

8 Elias-Fano If w = log (m/n) and z = log n, where m = |B| and n = #1 then L takes n w = n log (m/n) bits H takes n 1s + n 0s = 2n bits z = 3, w=2 (Select1 on H) In unary How to get the i-th number ? Take the i-th group of w bits in L and then represent the value (Select1(H,i) – i) in z bits

9 Rank and Select data structures
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Rank and Select data structures

10 A basic problem ! D D B (n log m) bits = 32 n bits.
Abaco, Battle, Car, Cold, Cod .... D Array of n string pointers to strings of total length m (n log m) bits = 32 n bits. it depends on the number of strings it is independent of string length Abaco Battle Car Cold Cod .... D B Spaces are introduced for simplicity

11 Rank/Select Wish to index the bit vector B (possibly compressed). B
Rank1(6) = 2 m = |B| n = #1 Rankb(i) = number of b in B[1,i] Selectb(i) = position of the i-th b in B Two approaches: Takes |B| + o(|B|) bits of space, (2) Aims at achieving n log(m/n) bits

12 The Bit-Vector Index: |B| + o(|B|)
m = |B| n = #1s The Bit-Vector Index: |B| + o(|B|) Goal. B is read-only, and the additional index takes o(m) bits. Rank B Z 8 18 block pos #1 z (bucket-relative) Rank1 0000 1 .... ... 1011 2 (absolute) Rank1 Setting Z = poly(log m) and z=(1/2) log m: Extra space is + (m/Z) log m + (m/z) log Z + o(m) + O(m loglog m / log m) = o(m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is needed and read-only!

13 The Select operation B m = |B| n = #1s
size r is variable until the subarray includes k 1s Sparse case: If r > k2 , we store explicitly the position of the k 1s Dense case: k ≤ r ≤ k2, recurse... One level is enough!! ... still need a table of size o(m). Setting k ≈ polylog m Extra space is + o(m), and B is not touched! Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is needed and read-only!

14 Via Elias-Fano (B is not needed)
Recall that by setting w = log (m/n) and z = log n, where m = |B| and n = #1 then Space = n log (m/n) bits + 2n bits (Build Select1 on H so we need extra |H| + o(|H|) bits = 2n + o(n) bits ) z = 3, w=2 Select1(i) on B  uses L and (Select1(H,i) – i) in +o(n) space Rank1(i) on B  Needs binary search over B

15 If you wish to play with Rank and Select
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" If you wish to play with Rank and Select m/10 + n log (m/n) Rank in 0.4 msec, Select in < 1 msec vs 32n bits of explicit pointers


Download ppt "Index construction: Compression of postings"

Similar presentations


Ads by Google