Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.

Similar presentations


Presentation on theme: "Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1."— Presentation transcript:

1 Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1

2 Parametric and zone indexes Thus far, a doc has been a term sequence But documents have multiple parts: Author Title Date of publication Language Format etc. These are the metadata about a document Sec. 6.1

3 Zone A zone is a region of the doc that can contain an arbitrary amount of text e.g., Title Abstract References … Build inverted indexes on fields AND zones to permit querying E.g., “find docs with merchant in the title zone and matching the query gentle rain” Sec. 6.1

4 Example zone indexes Encode zones in dictionary vs. postings. Sec. 6.1

5 Tiered indexes Break postings up into a hierarchy of lists Most important … Least important Inverted index thus broken up into tiers of decreasing importance At query time use top tier unless it fails to yield K docs If so drop to lower tiers Sec. 7.2.1

6 Example tiered index Sec. 7.2.1

7 Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper

8  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

9 It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597

10  code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

11 Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1  binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

12 PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data: [base, base + 2 b -1]  [0,2 b -1]

13 Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

14 Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.

15 Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1 a bc d 0 01 1

16 Average Length For a code C with codeword length L[s], the average length is defined as p(A) =.7 [0], p(B) = p(C) = p(D) =.1 [1--] L a =.7 * 1 +.3 * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, L a (C)  L a (C’)

17 Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability  higher information Entropy is the weighted average of i(s) 0-th order empirical entropy of string T i(s)

18 Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. p(A) =.7, p(B) = p(C) = p(D) =.1 H ≈ 1.36 bits Huffman ≈ 1.5 bits per symb Shannon In practice Avg cw length Empirical H vs Compression ratio

19 Statistical Coding How do we use probability p(s) to encode s? Huffman codes Arithmetic codes

20 Document Compression Huffman coding

21 Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,… Properties: Generates optimal prefix codes Cheap to encode and decode L a (Huff) = H if probabilities are powers of 2 Otherwise, L a (Huff) < H +1  < +1 bit per symb on avg!!

22 Running Example p(a) =.1, p(b) =.2, p(c ) =.2, p(d) =.5 a(.1)b(.2)d(.5)c(.2) a=000, b=001, c=01, d=1 There are 2 n-1 “equivalent” Huffman trees (.3) (.5) (1) What about ties (and thus, tree depth) ? 0 0 0 1 1 1

23 Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. a(.1)b(.2) (.3) c(.2) (.5)d(.5) 0 0 0 1 1 1 abc...  00000101 101001...  dcb

24 Problem with Huffman Coding Take a two symbol alphabet  = {a,b}. Whichever is their probability, Huffman uses 1 bit for each symbol and thus takes n bits to encode a message of n symbols This is ok when the probabilities are almost the same, but what about p(a) =.999. The optimal code for a is bits So optimal coding should use n *.0014 bits, which is much less than the n bits taken by Huffman

25 Document Compression Arithmetic coding

26 Introduction It uses “fractional” parts of bits!! Gets nH(T) + 2 bits vs. nH(T)+n of Huffman Used in JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation is not too bad. Ideal performance. In practice, it is 0.02 * n

27 Symbol interval Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. a =.2 c =.3 b =.5     cum[c] = p(a)+p(b) =.7 cum[b] = p(a) =.2 cum[a] =.0 The interval for a particular symbol will be called the symbol interval (e.g for b it is [.2,.7))

28 Sequence interval Coding the message sequence: bac The final sequence interval is [.27,.3) a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.2 0.22 0.27 0.3 (0.7-0.2)*0.3=0.15 (0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.3=0.03 (0.3-0.2)*0.2=0.02 (0.7-0.2)*0.2=0.1 (0.7-0.2)*0.5 = 0.25

29 The algorithm To code a sequence of symbols with probabilities p i (i = 1..n) use the following algorithm: p(a) =.2 p(c) =.3 p(b) =.5 0.27 0.2 0.3

30 The algorithm Each message narrows the interval by a factor p[T i ] Final interval size is Sequence interval [ l n, l n + s n ) Take a number inside

31 Decoding Example Decoding the number.49, knowing the message is of length 3: The message is bbc. a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.3 0.35 0.475 0.55 0.49

32 How do we encode that number? If x = v/2 k (dyadic fraction) then the encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

33 How do we encode that number? Binary fractional representation: FractionalEncode(x) 1.x = 2 * x 2.If x < 1 output 0 3.else {output 1; x = x - 1; } 2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1  4/3 – 1 = 1/3 Incremental Generation

34 Which number do we encode? Truncate the encoding to the first d =  log 2 (2/s n )  bits Truncation gets a smaller number… how much smaller? Truncation  Compression l n + s n lnln l n + s n /2 0∞0∞

35 Bound on code length Theorem: For a text T of length n, the Arithmetic encoder generates at most  log 2 (2/s n )  < 1 + log 2 2/s n = 1 + (1 - log 2 s n ) = 2 - log 2 (∏ i=1,n p(T i ) ) = 2 - log 2 (∏  [p(  ) occ(  ) ]) = 2 - ∑  occ(  ) * log 2 p(  ) ≈ 2 + ∑  ( n*p(  ) ) * log 2 (1/p(  )) = 2 + n H(T) bits T = acabc s n = p(a) *p(c) *p(a) *p(b) *p(c) = p(a) 2 * p(b) * p(c) 2

36 Document Compression Dictionary-based compressors

37 LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa Dictionary (all substrings starting here) aacaacabcaaaaaa c ac ac

38 LZ77 Decoding Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] Output is correct: abcdcdcdcdcdce

39 LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code

40 You find this at: www.gzip.org/zlib/

41 Dictionary search Exact string search Paper on Cuckoo Hashing

42 Exact String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing ?

43 Hashing with chaining

44 Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m =  (load factor)

45 Search cost m =  (n)

46 In practice A trivial hash function is: prime

47 A “provably good” hash is Each a i is selected at random in [0,m) k0k0 k1k1 k2k2 krkr ≈log 2 m r ≈ L / log 2 m a0a0 a1a1 a2a2 arar K a prime l = max string len m = table size not necessarily: (...mod p) mod m

48 Cuckoo Hashing ABC ED 2 hash tables, and 2 random choices where an item can be stored

49 ABC ED F A running example

50 ABFC ED

51 ABFC ED G

52 EGBFC AD

53 Cuckoo Hashing Examples ABC ED F G Random (bipartite) graph: node=cell, edge=key

54 Natural Extensions More than 2 hashes (choices) per key. Very different: hypergraphs instead of graphs. Higher memory utilization 3 choices : 90+% in experiments 4 choices : about 97% 2 hashes + bins of B-size. Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory...but more local

55 Dictionary search Prefix-string search Reading 3.1 and 5.2

56 Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.

57 Trie: speeding-up searches 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo Pro: O(p) search time Cons: edge + node labels and tree structure

58 Front-coding: squeezing strings http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html... 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 33  45% 0 http://checkmate.com/All/Natural/Washcloth.html... ….systile syzygetic syzygial syzygy…. 2 55 Gzip may be much better...

59 ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo…. systile szaielyite CT on a sample 2-level indexing Disk Internal Memory A disadvantage: Trade-off ≈ speed vs space (because of bucket size) 2 advantages: Search ≈ typically 1 I/O Space ≈ Front-coding over buckets


Download ppt "Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1."

Similar presentations


Ads by Google