Presentation is loading. Please wait.

Presentation is loading. Please wait.

Storage 1 Some of these slides are based on Stanford IR Course slides at

Similar presentations


Presentation on theme: "Storage 1 Some of these slides are based on Stanford IR Course slides at"— Presentation transcript:

1 Storage 1 Some of these slides are based on Stanford IR Course slides at http://www.stanford.edu/class/cs276/

2 Basic assumptions of Information Retrieval Collection: A set of documents –Assume it is a static collection for the moment Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task 2 Sec. 1.1

3 how trap mice alive The classic search model Collection User task Info need Query Results Search engine Query refinement Get rid of mice in a politically correct way Info about removing mice without killing them Misconception?Misformulation? Search 3

4 Boolean Retrieval Boolean retrieval is a simplified version of the actual search problem. Simplifying Assumption 1: The user accurately translates his task into a query (=Boolean combination of keywords) –(trap or remove) AND mice AND NOT kill Simplifying Assumption 2: A document is relevant to the user’s task if and only if it satisfies the Boolean combination of keywords 4

5 Boolean Retrieval Limitations Precise matching of query to documents, in real life might –Miss task-relevant documents –Return non-task-relevant documents No ranking of quality of result HOWEVER: A good start for understanding and modeling information retrieval –We will start by assuming the Boolean model! 5

6 Problem at Hand Given: –Huge collection of documents –Boolean keyword query Return: –Documents satisfying the query Dimension tradeoffs: –Speed –Memory size –Types of queries to be supported 6

7 Ideas? 7

8 Option 1: Store “As Is” Pages are stored "as is" as files in the file system Can find words in files using a grep style tool –Grep is a command-line utility for searching plain-text data sets for lines matching a regular expression. –Uses Boyer-Moore algorithm for substring search To process the data, it must be transferred from disk to main memory, and then searched for substrings –For large data, disk transfer is already a bottleneck! 8

9 Typical System Parameters (2007) Average Seek Time5ms=5*10 -3 s Transfer Time per byte 0.02  s=2*10 -8 s Low-level Processor Operation 0.01  s=10 -8 s Size of Main MemorySeveral GBs Size of Disk Space1TB Bottom Line: Seek and transfer are expensive operations! Try to avoid as much as possible 9

10 What do you think Suppose we have 10MB of text stored continuously. –How long will it take to read the data? Suppose we have 1GB of text stored in 100 continuous chunks. –How long will it take to read the data? Are queries processed quickly? Is this space efficient? 10

11 Option 2: Relational Database How would we find documents containing rain? Rain and Spain? Rain and not Spain? Is this better or worse than using the file system with grep? 11 DocIDDoc 1Rain, rain, go away... 2The rain in Spain falls mainly in the plain Model A

12 DB: Other Ways to Model the Data 12 DocIdWid... APPEARS DocIDWord... APPEARS Two options. Which is better? WordWid... WORD_INDEX Model B Model C

13 Relational Database Example 13 The rain in Spain falls mainly on the plain. Rain, rain go away. DocID: 1 DocID: 2

14 Relational Database Example 14 WORD_INDEX APPEARS Note the case- folding More about this later

15 Query Processing How are queries processed? Example query: rain –SELECT DocId –FROM WORD_INDEX W, APPEARS A –WHERE W.Wid=A.Wid and W.Word='rain' How can we answer the queries: –rain and go ? –rain and not Spain ? 15 Is Model C better than Model A?

16 Space Efficiency? Does it save more space than saving as files? –Depends on word frequency! Why? If a word appears in a thousand documents, then its wid will be repeated 1000 times. Why waste the space? If a word appears in a thousand documents, we will have to access a thousand rows in order to find the documents 16

17 Query Efficiency? Does not easily support queries that require multiple words Note: Some databases have special support for textual queries. Special purpose indices 17

18 Option 3: Bitmaps 18 There is a vector of 1s and 0s for each word. Queries are computed using bitwise operations on the vectors – efficiently implemented in the hardware.

19 Option 3: Bitmaps 19 How would you compute: Q1 = rain Q2 = rain and Spain Q3 = rain or not Spain

20 Bitmaps Tradeoffs Bitmaps can be efficiently processed However, they have high memory requirements. Example: –1M of documents, each with 1K of terms –500K distinct terms in total –What is the size of the matrix? –How many 1s will it have? Summary: A lot of wasted space for the 0s 20

21 The Index Repository A Good Solution 21

22 Two Structures Dictionary: –list of all terms in the documents –For each term in the document, store a pointer to its list in the inverted file Inverted Index: –For each term in the dictionary, an inverted list that stores pointers to all occurrences of the term in the documents. –Usually, pointers = document numbers –Usually, pointers are sorted –Sometimes also store term locations within documents (Why?) 22

23 Example Doc 1: A B C Doc 2: E B D Doc 3: A B D F How do you find documents with A and D? 23 A B C D E F 13 123 1 23 2 3 The Devil is in the Details! Dictionary (Lexicon) Posting Lists (Inverted Index)

24 Goal Store dictionary in main memory Store inverted index on disk Use compression techniques to save space –Saves a little money on storage –Keep more stuff in memory, to increase speed –Increase speed of data transfer from disk to memory [read compressed data | decompress] can be faster than [read uncompressed data] 24

25 Coming Up… Document Statistics: –How big will the dictionary be? –How big will the inverted index be? Storing the dictionary –Space saving techniques Storing the inverted index –Compression techniques 25

26 Document Statistics Empirical Laws 26

27 Some Terminology Collection: set of documents Token: “word” appearing in at least one document, sometimes also called a term Vocabulary size: Number of different tokens appearing in the collection Collection size: Number of tokens appearing in the collection 27

28 Vocabulary vs. collection size How big is the term vocabulary? –That is, how many distinct words are there? Can we assume an upper bound? In practice, the vocabulary will keep growing with the collection size 28

29 Vocabulary vs. collection size Heaps Law estimates the size of the vocabulary as a function of the size of the collection: M = kT b where: –M is the size of the vocabulary –T is the number of tokens in the collection –Typically 30 ≤ k ≤ 100 and b ≈ 0.5 In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½ (“empirical law”) 29

30 Heaps’ Law For RCV1, the dashed line log 10 M = 0.49 log 10 T + 1.64 is the best least squares fit. Thus, M = 10 1.64 T 0.49 so k = 10 1.64 ≈ 44 and b = 0.49. Good empirical fit for Reuters RCV1 ! For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms 30

31 Collection Size In natural language, there are a few very frequent terms and many very rare terms. Zipf’s law states that the i-th most frequent term has frequency proportional to 1/i. cf i = K/i where –cf i is number of occurrences of the i-th most frequent token –K is a normalizing constant 31

32 Zipf consequences If the most frequent term (the) occurs cf 1 times –then the second most frequent term (of) occurs cf 1 /2 times –the third most frequent term (and) occurs cf 1 /3 times … Equivalent: cf i = K/i where K is a normalizing factor, so –log cf i = log K - log i –Linear relationship between log cf i and log i 32

33 Zipf’s law for Reuters RCV1 33

34 The Dictionary Data Structures 34

35 Dictionary: Reminder Doc 1: A B C Doc 2: E B D Doc 3: A B D F Want to store: –Terms –Their frequencies –Pointer from each term to inverted index 35 A B C D E F 13 123 1 23 2 3 Dictionary (Lexicon) Posting Lists (Inverted Index)

36 The Dictionary Assumptions: we are interested in simple queries: –No phrases –No wildcards Goals: –Efficient (i.e., log) access –Small size (fit in main memory) Want to store: –Word –Address of inverted index entry –Length of inverted index entry = word frequency (why?) 36

37 Why compress the dictionary? Search begins with the dictionary We want to keep it in memory Memory footprint competition with other applications Embedded/mobile devices may have very little memory Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time So, compressing the dictionary is important 37

38 Some Assumptions To assess different storage solutions, in this part we will assume: –There are 400,000 different terms –Average word length is 8 letters (Why? Average token length is 4.5 letters!) –Each letter requires a byte of storage –Term frequency can be stored in 4 bytes –Pointers to the inverted index require 4 bytes We will see a series of different storage options 38

39 Dictionary storage - first cut Array of fixed-width entries, assuming maximum word length of 20 Search Complexity? Size: –400,000 terms –20 letters per word –4 bytes for frequency –4 bytes for posting list pointer –Total: 11.2 MB. 39

40 Fixed-width terms are wasteful Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. –Avg. dictionary word length is 8 characters –On average, we wasted 12 characters per word! And we still can’t handle words longer than 20 letters, like: supercalifragilisticexpialidocious 40

41 Compressing the term list: Dictionary-as-a-String Store dictionary as a (long) string of characters: –Pointer to next word shows end of current word ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 41

42 Compressing the term list: Dictionary-as-a-String How do we know where terms end? How do we search the dictionary? –Complexity? ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 42

43 Compressing the term list: Dictionary-as-a-String String length: –400,000 terms * 8 bytes on avg. per term = 3.2MB Array size: –400,000 terms * (4 bytes for frequency + 4 bytes for posting list pointer + 3 bytes for pointer into string) = 4.4MB Total: 7.6MB ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 43 Think About it: Why 3 bytes per pointer into String?

44 Blocking Blocking is a method to save on storage of pointers into the string of words –Instead of storing a pointer, for each term, we store a pointer to every k-th term –In order to know where words end, we also store term lengths …. systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 44 Term Ptr LengthPosting Ptr Freq 733 929 844 126 117

45 Blocking Why are there term pointers missing below? Why is there a length value missing below? How is search performed? Complexity? …. systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 45 Term Ptr LengthPosting Ptr Freq 733 929 844 126 117

46 Blocking How many bytes should we use to store the length? How much size does this index require, as a function of k? How much size when k = 4? …. systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 46 Term Ptr LengthPosting Ptr Freq 733 929 844 126 117

47 Front Coding We now consider an alternative method of saving on space. Adjacent words tend to have common prefixes –Why? Size of the string can be reduced if we take advantage of common prefixes With front coding we –Remove common prefixes –Store the common prefix size –Store pointer into the concatenated string 47

48 Front Coding Example 48 jezebel jezer jezerit jeziah jeziel …ebelritiahel… Term Ptr Prefix size Posting Ptr Freq 333 429 544 3126 47

49 Front Coding Example What is the search time? What is the size of the index, assuming that the common prefix is of size 3, on average? 49 …ebelritiahel… Term Ptr Prefix size Posting Ptr Freq 333 429 544 3126 47

50 (k-1)-in-k Front Coding Front coding saves space, but binary search of the index is no longer possible To allow for binary search, “(k-1)-in-k” front coding can be used In this method, in every block of k words, the first is completely given, and all others are front-coded Binary search can be based on the complete words to find the correct block Combines ideas of blocking and front coding 50

51 3-in-4 Front Coding Example 51 …jezebelritiahjeziel… Term Ptr Prefix size LengthPosting Ptr Freq 733 4529 5744 3126 67 jezebel jezer jezerit jeziah jeziel What is the search time? Why are there missing prefix values? What is the size of the index, assuming that the common prefix is of size 3, on average?

52 52 Inverted Index

53 Inverted Index: Reminder Doc 1: A B C Doc 2: E B D Doc 3: A B D F Want to store: –Document ids 53 A B C D E F 13 123 1 23 2 3 Dictionary (Lexicon) Posting Lists (Inverted Index)

54 The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon Example: Doc 1: A C F Doc 2: B E D B Doc 3: A B D F 54 If we only want to store docIDs, B’s posting list will be: 2 3 If we only want to store positions within docIDs, B’s posting list will be: (2; 1, 4), (3, 2) Or actually 2 2 1 4 3 1 2 Positions increase the size of the posting list!

55 The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon From now on, we will assume that posting lists are simply lists of document ids Document ids in a posting list are sorted –A posting list is simply an increasing list of integers The inverted index is very large –We discuss methods to compress the inverted index 55

56 Postings compression The postings file is much larger than the dictionary, factor of at least 10. Key desideratum: store each posting compactly. A posting for our purposes is a docID. For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers. Alternatively, we can use log 2 800,000 ≈ 20 bits per docID. Our goal: use a lot less than 20 bits per docID. Sec. 5.3 56

57 Postings: two conflicting forces A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log 2 1M ~ 20 bits. A term like the occurs in virtually every doc, so 20 bits/posting is too expensive. –Prefer 0/1 bitmap vector in this case Sec. 5.3 57

58 Postings file entry We store the list of docs containing a term in increasing order of docID. –computer: 33,47,154,159,202 … Consequence: it suffices to store gaps. –33,14,107,5,43 … Hope: most gaps can be encoded/stored with far fewer than 20 bits. –What happens if we use fixed length encoding? Sec. 5.3 58

59 Three postings entries Sec. 5.3 59

60 Variable length encoding Aim: –For arachnocentric, we will use ~20 bits/gap entry. –For the, we will use ~1 bit/gap entry. If the average gap for a term is G, we want to use ~log 2 G bits/gap entry. Key challenge: encode every integer (gap) with about as few bits as needed for that integer. This requires a variable length encoding Variable length codes achieve this by using short codes for small numbers Sec. 5.3 60

61 Types of Compression Methods Length –Variable Byte –Variable bit Encoding/decoding prior information –Non-parameterized –Parameterized 61

62 Types of Compression Methods We will start by discussing non- parameterized methods –Variable byte –Variable bit Afterwards we discuss two parameterized methods that are both variable bit 62

63 Variable Byte Compression Document ids (=numbers) are stored using a varying number of bytes Numbers are byte-aligned Many compression methods have been developed. We discuss: –Varint –Length-Precoded Varint –Group Varint 63

64 Varint codes For a gap value G, we want to use close to the fewest bytes needed to hold log 2 G bits Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c If G ≤127, binary-encode it in the 7 available bits and set c =1 Else encode G’s higer-order 7 bits and then use additional bytes to encode the next 7 higher order bits using the same algorithm At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0. Sec. 5.3 64

65 Example docIDs824829215406 gaps5214577 varint code00000110 10111000 1000010100001101 00001100 10110001 Postings stored as the byte concatenation 000001101011100010000101000011010000110010110001 Key property: varint encoded postings are uniquely prefix-decodable. For a small gap (5), VB uses a whole byte. Sec. 5.3 65

66 Length-Precoded Varint Currently, we must check the first bit of each byte before deciding how to proceed. Length-Precoded Varint aims at lowering the number of “branch-and-checks” Store each number in 1-4 bytes. Use the first 2 bits of the first byte to indicate the number of bytes used 66

67 Example Varint encoding: –7 bits per byte with continuation bit 10000001 10001111 00000011 11111111 00000111 01111111 11111111 Length-Precoded Varint encoding: –Encode byte length as low 2 bits 00000001 00001111 01000011 11111111 10000111 11111111 11111111 67 115511131071 115511131071

68 Length-Precoded Varint: Pros and Cons Pros –Less branching –Less bit shifts Cons –Still requires branching/bit shifts –What is the largest number that can be represented? 68

69 Group Varint Encoding Introduced by Jeff Dean (Google) Idea: encode groups of 4 values in 5-17 bytes –Pull out 4 2-bit binary lengths into single byte prefix –Decoding uses a 256-entry table to determine the masks of all proceeding numbers 69

70 Example Length-Precoded Varint encoding: –Encode byte length as low 2 bits 00000001 00001111 01000011 11111111 10000111 11111111 11111111 Group Varint encoding: 00000110 00000001 00001111 00000011 11111111 00000111 11111111 11111111 70 115511131071 115511131071

71 Other Variable Unit codes Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles). –When would smaller units of alignment be superior? When would larger units of alignment be superior? Variable byte codes: –Used by many commercial/research systems –Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next). Sec. 5.3 71

72 Variable bit Codes In variable bit codes, each code word can use a different number of bits to encode Examples: –Unary codes –Gamma codes –Delta codes Other well-know examples: –Golomb codes, Rice codes 72

73 Unary code Represent n as n-1 1s with a final 0. Unary code for 3 is 110. Unary code for 40 is 1111111111111111111111111111111111111110. Unary code for 80 is: 111111111111111111111111111111111111111111 11111111111111111111111111111111111110 This doesn’t look promising, but…. 73

74 Gamma codes We can compress better with bit-level codes –The Gamma code is the best known of these. Represent a gap G as a pair length and offset offset is G in binary, with the leading bit cut off –For example 13 → 1101 → 101 length is the length of binary code –For 13 (1101), this is 4. We encode length with unary code: 1110. Gamma code of 13 is the concatenation of length and offset: 1110101 Sec. 5.3 74

75 Gamma code examples numberlengthoffset  -code 0None (why is this ok for us?) 100 210010,0 310110,1 411000110,00 911100011110,001 1311101011110,101 2411110100011110,1000 51111111111011111111111111110,11111111 102511111111110000000000111111111110,0000000001 Sec. 5.3 75

76 Gamma code properties G is encoded using 2  log G  + 1 bits All gamma codes have an odd number of bits Almost within a factor of 2 of best possible, log 2 G Gamma code is uniquely prefix-decodable Gamma code is parameter-free Sec. 5.3 76

77 Delta codes Similar to gamma codes, except that length is encoded in gamma code Example: Compute the delta code of 9 Decode: 1011110100 Gamma codes = more compact for smaller numbers Delta codes = more compact for larger numbers 77

78 Disadvantages of Variable Bit Codes Machines have word boundaries – 8, 16, 32, 64 bits –Operations that cross word boundaries are slower Compressing and manipulating at the granularity of bits can be slow Variable byte encoding is aligned and thus potentially more efficient Regardless of efficiency, variable byte is conceptually simpler at little additional space cost Sec. 5.3 78

79 Think About It Question: Can we do binary search on a Gamma or Delta Coded sequence of increasing numbers? Question: Can we do binary search on a Varint Coded sequence of increasing numbers? 79

80 Parameterized Methods A parameterized encoding gets the probability distribution of the input symbols, and creates encodings accordingly We will discuss 2 important parameterized methods for compression: –Canonical Huffman codes –Arithmetic encoding These methods can also be used for compressing the dictionary! 80

81 Huffman Codes: Review Surprising history of Huffman codes Huffman codes are optimal prefix codes for symbol by symbol encoding, i.e., codes in which no codeword is a prefix of another Input: a set of symbols, along with a probability for each symbol 81 A0.1 B0.2 C0.05 D E0.3 F0.2 G0.1

82 Creating a Huffman Code: Greedy Algorithm Create a node for each symbol and assign with the probability of the symbol While there is more than 1 node without a parent –choose 2 nodes with the lowest probabilities and create a new node with both nodes as children. –Assign the new node the sum of probabilities of children The tree derived gives the code for each symbol (leftwards is 0, and rightwards is 1) 82

83 Problems with Huffman Codes The tree must be stored for decoding. –Can significantly increase memory requirements If the tree does not fit into main memory, then traversal (for decoding) is very expensive Solution: Canonical Huffman Codes 83

84 Canonical Huffman Codes Intuition: A canonical Huffman code can be efficiently described by just giving: –The list of symbols –Length of codeword of each symbol This information is sufficient for decoding 84

85 Properties of Canonical Huffman Codes 1.Codewords of a given length are consecutive binary numbers 2.Given two symbols s, s’ with codewords of the same length then: cw(s) < cw(s’) if and only if s<s’ 3.The first shortest codeword is a string of 0-s 4.The last longest codeword is a string of 1-s 85

86 Properties of Canonical Huffman Codes (cont) 5.Suppose that –d is the last codeword of length i –the next length of codeword appearing in the code is j –the first codeword of length j is c Then c=2 j-i (d+1) 86

87 Try it Suppose that we have the following lengths per symbol, what is the canonical Huffman code: 87 A3 B2 C4 D4 E2 F3 G3

88 Decoding Let l 1,…,l n be the lengths of codewords appearing in the canonical code The decoding process will use the following information, for each distinct length l i : –the codeword c i of the first symbol with length l i –the number of words n i of length l i –easily computed using the information about symbol lengths 88

89 Decoding (cont) i = 0 Repeat –i = i+1 –Let d be the word derived by reading l i symbols Until d <= c i + n i -1 Return the d-c i +1 th symbol (in lexicographic order) of length I Example: Decode 10001110 89

90 Some More Details How do we compute the lengths for each symbol? How do we compute the probabilities of each symbol? –model per posting list –single model for all posting lists –model for each group of posting lists (grouped by size) 90

91 Huffman Code Drawbacks Each symbol is coded separately Each symbol uses a whole number of bits Can be very inefficient when there are extremely likely/unlikely values 91

92 How Much can We Compress? Given: (1) A set of symbols, (2) Each symbol s has an associated probability P(s) Shannon’s lower bound on the average number of bits per symbol needed is:  s –P(s) log P(s) –Roughly speaking, each symbol s with probability P(s) needs at least –log P(s) bits to represent –Example: the outcome of a fair coin needs –log 0.5=1 bit to represent Ideally, we aim to find a compression method that reaches Shannon’s bound 92

93 Example Suppose A has probability 0.99 and B has probability 0.01. How many bits will Huffman’s code use for 10 A-s? Shannon’s bound gives us a requirement of - log(0.99)=0.015 bits per word, i.e., only 0.15 bits in total! Inefficiency of Huffman’s code is bounded from above by where s m is the most likely symbol 93

94 Arithmetic Coding Comes closer to Shannon’s bound by coding symbols together Input: Set of symbols S with probabilities, input text s 1,…,s n Output: length n of the input text and a number (written in binary) in [0,1) In order to explain the algorithm, numbers will be shown as decimal, but obviously they are always binary 94

95 ArithemeticEncoding(s 1 …s n ) low := 0 high := 1 for i=1 to n do (low,high) := Restrict(low,high,s i ) return any number between low and high 95

96 Restrict(low,high,s i ) low_bound := sum{P(s) | s  S and s<s i } high_bound := low_bound + P(s i ) range := high - low new_low := low + range*low_bound new_high := low + range*high_bound return (new_low, new_high) 96

97 ArithmeticDecoding(k,n) low := 0 high := 1 for i = 1 to n do –for each s  S do (new_low,new_high) := Restrict(low,high,s) –if new_low  k < new_high then Output “s” low := new_low high := new_high break 97

98 Think about it Decode the string 0.34 of length 3, given alphabet consisting of A, B both with prob 0.5 In general, what is the size of the encoding of an input? –to store a number in an interval of size high-low, we need –log(high-low) bits –The size of the final interval is, and needs bits 98

99 Adaptive Arithmetic Coding In order to decode, the probabilities of each symbol must be known. –This must be stored, which adds to overhead The probabilities may change over the course of the text –Cannot be modeled thus far In adaptive arithmetic coding the encoder (and decoder) compute the probabilities on the fly by counting symbol frequencies 99

100 An example - I String bccb from the alphabet {a,b,c} Zero-frequency problem solved initializing at 1 all character counters When the first b is to be coded all symbols have a 33% probability (why?) The arithmetic coder maintains two numbers, low and high, which represent a subinterval [low,high) of the range [0,1) Initially low=0 and high=1 100

101 An example - II The range between low and high is divided between the symbols of the alphabet, according to their probabilities 101 low high 0 1 0.3333 0.6667 a b c (P[c]=1/3) (P[b]=1/3) (P[a]=1/3)

102 An example - III 102 low high 0 1 0.3333 0.6667 a b c b low = 0.3333 high = 0.6667  P[a]=1/4  P[b]=2/4  P[c]=1/4 new probabilities

103 An example - IV new probabilities P[a]=1/5 P[b]=2/5 P[c]=2/5 103 low high 0.3333 0.6667 0.4167 0.5834 a b c c low = 0.5834 high = 0.6667 (P[c]=1/4) (P[b]=2/4) (P[a]=1/4)

104 An example - V new probabilities P[a]=1/6 P[b]=2/6 P[c]=3/6 104 low high 0.5834 0.6667 0.6001 0.6334 a b c c low = 0.6334 high = 0.6667 (P[c]=2/5) (P[b]=2/5) (P[a]=1/5)

105 An example - VI Final interval [0.6390,0.6501) we can send 0.64 105 low high 0.6334 0.6667 0.6390 0.6501 a b c low = 0.6390 high = 0.6501 b (P[c]=3/6) (P[b]=2/6) (P[a]=1/6)

106 An example - summary Starting from the range between 0 and 1 we restrict ourself each time to the subinterval that codify the given symbol At the end the whole sequence can be codified by any of the numbers in the final range (but mind the brackets...) 106

107 An example - summary 107 0 1 0.3333 0.6667 a b c 0.3333 1/3 0.4167 0.5834 1/4 2/4 1/4 a b c 0. 5834 0. 6667 2/5 1/5 0.6001 0.6334 a b c 0. 6667 0.6334 a b c 0.6390 0.6501 3/6 2/6 1/6 [0.6390, 0.6501)0.64


Download ppt "Storage 1 Some of these slides are based on Stanford IR Course slides at"

Similar presentations


Ads by Google