Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1

Parametric and zone indexes Thus far, a doc has been a term sequence But documents have multiple parts: Author Title Date of publication Language Format etc. These are the metadata about a document Sec. 6.1

Zone A zone is a region of the doc that can contain an arbitrary amount of text e.g., Title Abstract References … Build inverted indexes on fields AND zones to permit querying E.g., “find docs with merchant in the title zone and matching the query gentle rain” Sec. 6.1

Example zone indexes Encode zones in dictionary vs. postings. Sec. 6.1

Tiered indexes Break postings up into a hierarchy of lists Most important … Least important Inverted index thus broken up into tiers of decreasing importance At query time use top tier unless it fails to yield K docs If so drop to lower tiers Sec. 7.2.1

Example tiered index Sec. 7.2.1

Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper

 code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597

 code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1  binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data: [base, base + 2 b -1]  [0,2 b -1]

Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.

Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1 a bc d 0 01 1

Average Length For a code C with codeword length L[s], the average length is defined as p(A) =.7 [0], p(B) = p(C) = p(D) =.1 [1--] L a =.7 * 1 +.3 * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, L a (C)  L a (C’)

Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability  higher information Entropy is the weighted average of i(s) 0-th order empirical entropy of string T i(s)

Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. p(A) =.7, p(B) = p(C) = p(D) =.1 H ≈ 1.36 bits Huffman ≈ 1.5 bits per symb Shannon In practice Avg cw length Empirical H vs Compression ratio

Statistical Coding How do we use probability p(s) to encode s? Huffman codes Arithmetic codes

Document Compression Huffman coding

Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,… Properties: Generates optimal prefix codes Cheap to encode and decode L a (Huff) = H if probabilities are powers of 2 Otherwise, L a (Huff) < H +1  < +1 bit per symb on avg!!

Running Example p(a) =.1, p(b) =.2, p(c ) =.2, p(d) =.5 a(.1)b(.2)d(.5)c(.2) a=000, b=001, c=01, d=1 There are 2 n-1 “equivalent” Huffman trees (.3) (.5) (1) What about ties (and thus, tree depth) ? 0 0 0 1 1 1

Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. a(.1)b(.2) (.3) c(.2) (.5)d(.5) 0 0 0 1 1 1 abc...  00000101 101001...  dcb

Problem with Huffman Coding Take a two symbol alphabet  = {a,b}. Whichever is their probability, Huffman uses 1 bit for each symbol and thus takes n bits to encode a message of n symbols This is ok when the probabilities are almost the same, but what about p(a) =.999. The optimal code for a is bits So optimal coding should use n *.0014 bits, which is much less than the n bits taken by Huffman

Document Compression Arithmetic coding

Introduction It uses “fractional” parts of bits!! Gets nH(T) + 2 bits vs. nH(T)+n of Huffman Used in JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation is not too bad. Ideal performance. In practice, it is 0.02 * n

Symbol interval Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. a =.2 c =.3 b =.5     cum[c] = p(a)+p(b) =.7 cum[b] = p(a) =.2 cum[a] =.0 The interval for a particular symbol will be called the symbol interval (e.g for b it is [.2,.7))

Sequence interval Coding the message sequence: bac The final sequence interval is [.27,.3) a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.2 0.22 0.27 0.3 (0.7-0.2)*0.3=0.15 (0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.3=0.03 (0.3-0.2)*0.2=0.02 (0.7-0.2)*0.2=0.1 (0.7-0.2)*0.5 = 0.25

The algorithm To code a sequence of symbols with probabilities p i (i = 1..n) use the following algorithm: p(a) =.2 p(c) =.3 p(b) =.5 0.27 0.2 0.3

The algorithm Each message narrows the interval by a factor p[T i ] Final interval size is Sequence interval [ l n, l n + s n ) Take a number inside

Decoding Example Decoding the number.49, knowing the message is of length 3: The message is bbc. a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.3 0.35 0.475 0.55 0.49

How do we encode that number? If x = v/2 k (dyadic fraction) then the encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

How do we encode that number? Binary fractional representation: FractionalEncode(x) 1.x = 2 * x 2.If x < 1 output 0 3.else {output 1; x = x - 1; } 2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1  4/3 – 1 = 1/3 Incremental Generation

Which number do we encode? Truncate the encoding to the first d =  log 2 (2/s n )  bits Truncation gets a smaller number… how much smaller? Truncation  Compression l n + s n lnln l n + s n /2 0∞0∞

Bound on code length Theorem: For a text T of length n, the Arithmetic encoder generates at most  log 2 (2/s n )  < 1 + log 2 2/s n = 1 + (1 - log 2 s n ) = 2 - log 2 (∏ i=1,n p(T i ) ) = 2 - log 2 (∏  [p(  ) occ(  ) ]) = 2 - ∑  occ(  ) * log 2 p(  ) ≈ 2 + ∑  ( n*p(  ) ) * log 2 (1/p(  )) = 2 + n H(T) bits T = acabc s n = p(a) *p(c) *p(a) *p(b) *p(c) = p(a) 2 * p(b) * p(c) 2

Document Compression Dictionary-based compressors

LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa Dictionary (all substrings starting here) aacaacabcaaaaaa c ac ac

LZ77 Decoding Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code

You find this at: www.gzip.org/zlib/

Dictionary search Exact string search Paper on Cuckoo Hashing

Exact String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing ?

Hashing with chaining

Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m =  (load factor)

Search cost m =  (n)

In practice A trivial hash function is: prime

A “provably good” hash is Each a i is selected at random in [0,m) k0k0 k1k1 k2k2 krkr ≈log 2 m r ≈ L / log 2 m a0a0 a1a1 a2a2 arar K a prime l = max string len m = table size not necessarily: (...mod p) mod m

Cuckoo Hashing ABC ED 2 hash tables, and 2 random choices where an item can be stored

ABC ED F A running example

ABFC ED

ABFC ED G

EGBFC AD

Cuckoo Hashing Examples ABC ED F G Random (bipartite) graph: node=cell, edge=key

Natural Extensions More than 2 hashes (choices) per key. Very different: hypergraphs instead of graphs. Higher memory utilization 3 choices : 90+% in experiments 4 choices : about 97% 2 hashes + bins of B-size. Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory...but more local

Dictionary search Prefix-string search Reading 3.1 and 5.2

Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.

Trie: speeding-up searches 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo Pro: O(p) search time Cons: edge + node labels and tree structure

Front-coding: squeezing strings http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html... 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 33  45% 0 http://checkmate.com/All/Natural/Washcloth.html... ….systile syzygetic syzygial syzygy…. 2 55 Gzip may be much better...

….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo…. systile szaielyite CT on a sample 2-level indexing Disk Internal Memory A disadvantage: Trade-off ≈ speed vs space (because of bucket size) 2 advantages: Search ≈ typically 1 I/O Space ≈ Front-coding over buckets

Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Similar presentations

Presentation on theme: "Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Similar presentations

Presentation on theme: "Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1."— Presentation transcript:

Similar presentations

About project

Feedback