Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Inverted Index. The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon Example: Doc 1: A C F Doc 2: B E D.

Similar presentations


Presentation on theme: "1 Inverted Index. The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon Example: Doc 1: A C F Doc 2: B E D."— Presentation transcript:

1 1 Inverted Index

2 The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon Example: Doc 1: A C F Doc 2: B E D B Doc 3: A B D F 2 If we only want to store docIDs, B’s posting list will be: 2 3 If we only want to store positions within docIDs, B’s posting list will be: (2; 1, 4), (3, 2) Or actually 2 2 1 4 3 1 2 Positions increase the size of the posting list!

3 The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon From now on, we will assume that posting lists are simply lists of document ids Document ids in a posting list are sorted –A posting list is simply an increasing list of integers The inverted index is very large –We discuss methods to compress the inverted index 3

4 Postings compression The postings file is much larger than the dictionary, factor of at least 10. Key desideratum: store each posting compactly. A posting for our purposes is a docID. For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers. Alternatively, we can use log 2 800,000 ≈ 20 bits per docID. Our goal: use a lot less than 20 bits per docID. Sec. 5.3 4

5 Postings: two conflicting forces A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log 2 1M ~ 20 bits. A term like the occurs in virtually every doc, so 20 bits/posting is too expensive. –Prefer 0/1 bitmap vector in this case Sec. 5.3 5

6 Postings file entry We store the list of docs containing a term in increasing order of docID. –computer: 33,47,154,159,202 … Consequence: it suffices to store gaps. –33,14,107,5,43 … Hope: most gaps can be encoded/stored with far fewer than 20 bits. –What happens if we use fixed length encoding? Sec. 5.3 6

7 Three postings entries Sec. 5.3 7

8 Variable length encoding Aim: –For arachnocentric, we will use ~20 bits/gap entry. –For the, we will use ~1 bit/gap entry. If the average gap for a term is G, we want to use ~log 2 G bits/gap entry. Key challenge: encode every integer (gap) with about as few bits as needed for that integer. This requires a variable length encoding Variable length codes achieve this by using short codes for small numbers Sec. 5.3 8

9 Types of Compression Methods Length –Variable Byte –Variable bit Encoding/decoding prior information –Non-parameterized –Parameterized 9

10 Types of Compression Methods We will start by discussing non- parameterized methods –Variable byte –Variable bit Afterwards we discuss two parameterized methods that are both variable bit 10

11 Variable Byte Compression Document ids (=numbers) are stored using a varying number of bytes Numbers are byte-aligned Many compression methods have been developed. We discuss: –Varint –Length-Precoded Varint –Group Varint 11

12 Varint codes For a gap value G, we want to use close to the fewest bytes needed to hold log 2 G bits Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c If G ≤127, binary-encode it in the 7 available bits and set c =1 Else encode G’s higer-order 7 bits and then use additional bytes to encode the next 7 higher order bits using the same algorithm At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0. Sec. 5.3 12

13 Example docIDs824829215406 gaps5214577 varint code00000110 10111000 1000010100001101 00001100 10110001 Postings stored as the byte concatenation 000001101011100010000101000011010000110010110001 Key property: varint encoded postings are uniquely prefix-decodable. For a small gap (5), VB uses a whole byte. Sec. 5.3 13

14 Length-Precoded Varint Currently, we must check the first bit of each byte before deciding how to proceed. Length-Precoded Varint aims at lowering the number of “branch-and-checks” Store each number in 1-4 bytes. Use the first 2 bits of the first byte to indicate the number of bytes used 14

15 Example Varint encoding: –7 bits per byte with continuation bit 10000001 10001111 00000011 11111111 00000111 01111111 11111111 Length-Precoded Varint encoding: –Encode byte length as low 2 bits 00000001 00001111 01000011 11111111 10000111 11111111 11111111 15 1 511131071 115511131071

16 Length-Precoded Varint: Pros and Cons Pros –Less branching –Less bit shifts Cons –Still requires branching/bit shifts –What is the largest number that can be represented? 16

17 Group Varint Encoding Introduced by Jeff Dean (Google) Idea: encode groups of 4 values in 5-17 bytes –Pull out 4 2-bit binary lengths into single byte prefix –Decoding uses a 256-entry table to determine the masks of all proceeding numbers 17

18 Example Length-Precoded Varint encoding: –Encode byte length as low 2 bits 00000001 00001111 01000011 11111111 10000111 11111111 11111111 Group Varint encoding: 00000110 00000001 00001111 00000011 11111111 00000111 11111111 11111111 18 115511131071 115511131071

19 Other Variable Unit codes Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles). –When would smaller units of alignment be superior? When would larger units of alignment be superior? Variable byte codes: –Used by many commercial/research systems –Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next). Sec. 5.3 19

20 Variable bit Codes In variable bit codes, each code word can use a different number of bits to encode Examples: –Unary codes –Gamma codes –Delta codes Other well-know examples: –Golomb codes, Rice codes 20

21 Unary code Represent n as n-1 1s with a final 0. Unary code for 3 is 110. Unary code for 40 is 1111111111111111111111111111111111111110. Unary code for 80 is: 111111111111111111111111111111111111111111 11111111111111111111111111111111111110 This doesn’t look promising, but…. 21

22 Gamma codes We can compress better with bit-level codes –The Gamma code is the best known of these. Represent a gap G as a pair length and offset offset is G in binary, with the leading bit cut off –For example 13 → 1101 → 101 length is the length of binary code –For 13 (1101), this is 4. We encode length with unary code: 1110. Gamma code of 13 is the concatenation of length and offset: 1110101 Sec. 5.3 22

23 Gamma code examples numberlengthoffset  -code 0None (why is this ok for us?) 100 210010,0 310110,1 411000110,00 911100011110,001 1311101011110,101 2411110100011110,1000 51111111111011111111111111110,11111111 102511111111110000000000111111111110,0000000001 Sec. 5.3 23

24 Try it Encode 7 Decode 11001 Given a series of bits, divide into separate Gamma Codes: 11110100110111010 24

25 Gamma code properties G is encoded using 2  log G  + 1 bits All gamma codes have an odd number of bits Almost within a factor of 2 of best possible, log 2 G Gamma code is uniquely prefix-decodable Gamma code is parameter-free Sec. 5.3 25

26 Delta codes Similar to gamma codes, except that length is encoded in gamma code Example: Compute the delta code of 9 Decode: 1011110100 Gamma codes = more compact for smaller numbers Delta codes = more compact for larger numbers

27 Disadvantages of Variable Bit Codes Machines have word boundaries – 8, 16, 32, 64 bits –Operations that cross word boundaries are slower Compressing and manipulating at the granularity of bits can be slow Variable byte encoding is aligned and thus potentially more efficient Regardless of efficiency, variable byte is conceptually simpler at little additional space cost Sec. 5.3 27

28 Think About It Question: Can we do binary search on a Gamma or Delta Coded sequence of increasing numbers? Question: Can we do binary search on a Varint Coded sequence of increasing numbers? 28

29 Parameterized Methods A parameterized encoding gets the probability distribution of the input symbols, and creates encodings accordingly We will discuss 2 important parameterized methods for compression: –Canonical Huffman codes –Arithmetic encoding These methods can also be used for compressing the dictionary! 29

30 Huffman Codes: Review Surprising history of Huffman codes Huffman codes are optimal prefix codes for symbol by symbol encoding, i.e., codes in which no codeword is a prefix of another Input: a set of symbols, along with a probability for each symbol 30 A0.1 B0.2 C0.05 D E0.3 F0.2 G0.1

31 Creating a Huffman Code: Greedy Algorithm Create a node for each symbol and assign with the probability of the symbol While there is more than 1 node without a parent –choose 2 nodes with the lowest probabilities and create a new node with both nodes as children. –Assign the new node the sum of probabilities of children The tree derived gives the code for each symbol (leftwards is 0, and rightwards is 1) 31

32 Problems with Huffman Codes The tree must be stored for decoding. –Can significantly increase memory requirements If the tree does not fit into main memory, then traversal (for decoding) is very expensive Solution: Canonical Huffman Codes 32

33 Canonical Huffman Codes Intuition: A canonical Huffman code can be efficiently described by just giving: –The list of symbols –Length of codeword of each symbol This information is sufficient for decoding 33

34 Properties of Canonical Huffman Codes 1.Codewords of a given length are consecutive binary numbers 2.Given two symbols s, s’ with codewords of the same length then: cw(s) < cw(s’) if and only if s<s’ 3.The first shortest codeword is a string of 0-s 4.The last longest codeword is a string of 1-s 34

35 Properties of Canonical Huffman Codes (cont) 5.Suppose that –d is the last codeword of length i –the next length of codeword appearing in the code is j –the first codeword of length j is c Then c=2 j-i (d+1) 35

36 Try it Suppose that we have the following lengths per symbol, what is the canonical Huffman code: 36 A3 B2 C4 D4 E2 F3 G3

37 Decoding Let l 1,…,l n be the lengths of codewords appearing in the canonical code The decoding process will use the following information, for each distinct length l i : –the codeword c i of the first symbol with length l i –the number of words n i of length l i –easily computed using the information about symbol lengths 37

38 Decoding (cont) i = 0 Repeat –i = i+1 –Let d be the word derived by reading l i symbols Until d <= c i + n i -1 Return the d-c i +1 th symbol (in lexicographic order) of length I Example: Decode 10001110 38

39 Some More Details How do we compute the lengths for each symbol? How do we compute the probabilities of each symbol? –model per posting list –single model for all posting lists –model for each group of posting lists (grouped by size) 39

40 Huffman Code Drawbacks Each symbol is coded separately Each symbol uses a whole number of bits Can be very inefficient when there are extremely likely/unlikely values 40

41 How Much can We Compress? Given: (1) A set of symbols, (2) Each symbol s has an associated probability P(s) Shannon’s lower bound on the average number of bits per symbol needed is:  s –P(s) log P(s) –Roughly speaking, each symbol s with probability P(s) needs at least –log P(s) bits to represent –Example: the outcome of a fair coin needs –log 0.5=1 bit to represent Ideally, we aim to find a compression method that reaches Shannon’s bound 41

42 Example Suppose A has probability 0.99 and B has probability 0.01. How many bits will Huffman’s code use for 10 A-s? Shannon’s bound gives us a requirement of - log(0.99)=0.015 bits per word, i.e., only 0.15 bits in total! Inefficiency of Huffman’s code is bounded from above by where s m is the most likely symbol 42

43 Arithmetic Coding Comes closer to Shannon’s bound by coding symbols together Input: Set of symbols S with probabilities, input text s 1,…,s n Output: length n of the input text and a number (written in binary) in [0,1) In order to explain the algorithm, numbers will be shown as decimal, but obviously they are always binary 43

44 ArithemeticEncoding(s 1 …s n ) low := 0 high := 1 for i=1 to n do (low,high) := Restrict(low,high,s i ) return any number between low and high 44

45 Restrict(low,high,s i ) low_bound := sum{P(s) | s  S and s<s i } high_bound := low_bound + P(s i ) range := high - low new_low := low + range*low_bound new_high := low + range*high_bound return (new_low, new_high) 45

46 Example Suppose that we have symbols –A with probability 0.2 –B with probability 0.3 –C with probability 0.5 Encode ACCB 46

47 ArithmeticDecoding(k,n) low := 0 high := 1 for i = 1 to n do –for each s  S do (new_low,new_high) := Restrict(low,high,s) –if new_low  k < new_high then Output “s” low := new_low high := new_high break 47

48 Think about it Decode the string 0.34 of length 3 In general, what is the size of the encoding of an input? –to store a number in an interval of size high-low, we need –log(high-low) bits –The size of the final interval is, and needs bits 48

49 Adaptive Arithmetic Coding In order to decode, the probabilities of each symbol must be known. –This must be stored, which adds to overhead The probabilities may change over the course of the text –Cannot be modeled thus far In adaptive arithmetic coding the encoder (and decoder) compute the probabilities on the fly by counting symbol frequencies 49

50 50 An example - I String bccb from the alphabet {a,b,c} Zero-frequency problem solved initializing at 1 all character counters When the first b is to be coded all symbols have a 33% probability (why?) The arithmetic coder maintains two numbers, low and high, which represent a subinterval [low,high) of the range [0,1) Initially low=0 and high=1

51 51 An example - II The range between low and high is divided between the symbols of the alphabet, according to their probabilities low high 0 1 0.3333 0.6667 a b c (P[c]=1/3) (P[b]=1/3) (P[a]=1/3)

52 52 An example - III low high 0 1 0.3333 0.6667 a b c b low = 0.3333 high = 0.6667  P[a]=1/4  P[b]=2/4  P[c]=1/4 new probabilities

53 53 An example - IV new probabilities P[a]=1/5 P[b]=2/5 P[c]=2/5 low high 0.3333 0.6667 0.4167 0.5834 a b c c low = 0.5834 high = 0.6667 (P[c]=1/4) (P[b]=2/4) (P[a]=1/4)

54 54 An example - V new probabilities P[a]=1/6 P[b]=2/6 P[c]=3/6 low high 0.5834 0.6667 0.6001 0.6334 a b c c low = 0.6334 high = 0.6667 (P[c]=2/5) (P[b]=2/5) (P[a]=1/5)

55 55 An example - VI Final interval [0.6390,0.6501) we can send 0.64 low high 0.6334 0.6667 0.6390 0.6501 a b c low = 0.6390 high = 0.6501 b (P[c]=3/6) (P[b]=2/6) (P[a]=1/6)

56 56 An example - summary Starting from the range between 0 and 1 we restrict ourself each time to the subinterval that codify the given symbol At the end the whole sequence can be codified by any of the numbers in the final range (but mind the brackets...)

57 57 An example - summary 0 1 0.3333 0.6667 a b c 0.3333 1/3 0.4167 0.5834 1/4 2/4 1/4 a b c 0. 5834 0. 6667 2/5 1/5 0.6001 0.6334 a b c 0. 6667 0.6334 a b c 0.6390 0.6501 3/6 2/6 1/6 [0.6390, 0.6501)0.64


Download ppt "1 Inverted Index. The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon Example: Doc 1: A C F Doc 2: B E D."

Similar presentations


Ads by Google