Information Retrieval Space occupancy evaluation.

Information Retrieval Space occupancy evaluation

Storage analysis First we will consider space for the postings Recall that access is sequential Then will do the same for the dictionary Recall that access is random Finally we will analyze the storage of the documents… Random access is here crucial for “snippet retrieval”

Information Retrieval Postings storage

Recall that… Brutus the Calpurnia 12358132134 248163264128 1316

Postings: two conflicting forces A term like Calpurnia occurs in maybe one doc out of a million. Hence we would like to store this pointer using log 2 #docs bits. A term like the occurs in virtually every doc, so that number of bits is too much. We would prefer 0/1 vector in this case.

Gap-coding for postings Store the gaps between consecutive docIDs: Brutus: 33, 47, 154, 159, 202 … 33, 14, 107, 5, 43 … Advantages: Store smaller integers Smaller and smaller if clustered How much is the saving by the  -encoding ?

 code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597

A rough analysis on Zipf law and  -coding… Zipf Law: kth term occurs  c n/k times, for a proper constant c depending on the data collection.  -coding on the k-th term costs: n/k gaps, use 2log 2 k +1 bits for each gap; Log is concave, thus the maximum cost is (2n/k) log 2 k + (n/k) bits

Sum over k from 1 to m=500K Do this by breaking values of k into groups: group i consists of 2 i-1  k < 2 i. Group i has 2 i-1 components in the sum, each contributing ~(2n/k) log 2 k ~ (2ni)/2 i-1. Summing over i from 1  19 (500k terms), we get a net estimate of 340Mbits. Then add #docs bit per each occurrence (they are 1Gb) because of +1 in  : 1.34 Gbit ~ 170Mb  20-bit coding, would have required 20 Gbits [~ 2.5 Gb]

 code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1  binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

Fixed Binary Codewords [~8 bits per TREC12] Get fast (de)compress & reduce space wasting fixed number of bits [width] for a varying number of items [span] Example: 38,17,13,34,6,2,1,3,1,2,3,1 Flat binary code, needs 6 bits per item, total 72  -code would need 36 bits Greedily split and use fixed-binary codes: <12; (6,4:38,17,13,34),(3,1:6),(2,7:2,1,3,1,2,3,1) Width=#bitsSpan=#items What widths and spans are preferrable ? Every group forced to fit in 1 machine word!!

Golomb  codes [7.54 bits on TREC12] It is a parametric code: depends on k We set d = 2  log k  +1 - k, hence d may be represented in  log k  bits Quotient q=  (v-1)/k , and the rest is r= v – k * q – 1 In the second case r > d, so that (r+d)/2 > d  Consequently, the first  log k  bits discriminate the case in decompression Useful when integers concentrated around k: k=3, v=9  d=1, Usually k  0.69 * mean(v) [Bernoulli model] Optimal for Pr(x) = p (1-p) x-1, where mean(x)=1/p, and i.i.d ints If k is a power-of-2, then we have the simpler Rice code [q times 0s] 1 If r ≤ d use  log k  bits (enough), else use  log k  bits r ≤ d ? r : (r+d)

PFor and PForDelta Thick line: decompr R2c Thin line: decompr R2R

IL compression S9, S16 stuff ints in a word using 9/16 configurations Positions impact for a factor 4, but more on decompr since worse cache use on 1000 queries

Integer coding vs postings length S9, S16 stuff ints in a word using 9/16 configurations

Information Retrieval Dictionary storage (several incremental solutions)

Recall the Heaps Law… Empirically validated model: V = kN b where b ≈ 0.5, k ≈ 30  100; N = # tokens Some considerations: V is decreased by case-folding, stemming Indexing all numbers could make it extremely large (so usually don’t) Spelling errors contribute a fair bit of size

1 st solution: basic idea… Array of fixed-width entries 500,000 terms; 28 bytes/term = 14MB. Binary search 20 bytes4 bytes each Wasteful, avg word is 8 char

2 nd solution: raw term-sequence ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Binary search these pointers Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space.

3 rd solution: blocking Store pointers to every kth on term string. Example below: k=4. Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Save 12 bytes on 3 pointers Lose 4*1 bytes on term lengths. Net saving is 8 bytes every 4 dict words

Impact of k on search & space Search for a dictionary word Binary search down to 4-term block; Then linear search through terms in block.  Increasing k, we would slow down the linear scan but reduce the dictionary space occupancy to ~8MB

Encodes automate Suffix length to drop 4 th solution: front coding Idea: Sorted words commonly have long common prefix – store differences only wrt the previous term in a block of k. 8automata8automate9automatic10automation  8automata1  e1  ic1  on Lucene stores k=128 First term is full

Information Retrieval Space occupancy evaluation.

Similar presentations

Presentation on theme: "Information Retrieval Space occupancy evaluation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Space occupancy evaluation.

Similar presentations

Presentation on theme: "Information Retrieval Space occupancy evaluation."— Presentation transcript:

Similar presentations

About project

Feedback