Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo http://researchmap.jp/sada/resources/

2 Frequency Counts Input ： integer a i at time t i (i = 1, 2,…) Query: all elements which appear some fraction of stream, and their frequency. Namely N: number of elements at query time Return elements which appear at least sN times Memory: O(log N) words The problem is expressed in the cash register model

3 Approximate Frequency Counts It is not possible to solve the Frequency Counts problem exactly using small working space We find approximate solutions 1.Output all elements whose frequency is at least sN. That is, there is no false negative. 2.Do not output elements whose frequency is less than (s  )N. 3.Estimated frequency is at least (true frequency)  N Required memory: words

4 Memory Reduction by Two-pass Algorithm Input: integer a i at time t i (i = 1, 2,…, N) Query: elements appeared at least sN times Memory: O(1/s) words Two-pass algorithm –Pass 1: find all elements appeared at least sN times –Pass 2: output elements which really appear at least sN times

5 Algorithm (Pass 1) K: set of elements (initially empty) count[]: stores frequency of elements in K For each a i (i = 1,...,N) If a i  K, count[a i ] = count[a i ] + 1 If not, insert a i in K and set count[a i ] = 1 If |K| > 1/s for each a  K count[a] = count[a]  1 if count[a] becomes 0, delete a from K

6 Lemma 5: When pass 1 ends, K contains all entries whose frequency is at least sN. Proof: Assume that at the end of pass 1, there exists an entry a which is not contained in K. Every time a is deleted, frequencies of other 1/s  1 elements are decreased by one. If a is deleted f times, the total frequency decreases at least f/s. Because the total frequency N is greater than the number of decreases, it holds f/s < N ． Therefore every entry with f  sN is in K at the end of pass 1.

7 Algorithm (Pass 2) For each entry in K, compute its frequency Output entries with frequency at least sN

8 Limitation of 1-pass Algorithms Input ： integers a i in {1,...,n} (i = 1, 2,…, N) Query: return all elements appeared at least sN times Prop.: Any 1-pass algorithm solving this problem requires  (n log (N/n)) bit space in the worst case. (assume N > 4n > 16/s)

9 Proof: Consider the situation that the first N/2 elements of the stream have appeared. Assume that each elements appeared less than sN times. We prove that in this case we have to remember frequencies of all elements precisely to obtain the correct answer when all elements have come. Assume that an element a appeared k < sN times when the first N/2 elements came. The element a is in the answer if the frequency of a in the second half is at least sN  k, and not in the answer if less than sN  k. That is, there are two types of remaining stream: one for which the answer contains a and one does not.

10 Claim: For any pair of distinct first half streams, we must distinguish them to obtain correct answer. Proof: Assume a appears x 1 times in stream S 1 and x 2 times in stream S 2 (x 1  x 2 ). If a appears sN  x 2 times in the second half, a is frequent for stream S 2 but not frequent for stream S 1. If the encodings of first half of the streams are the same, we cannot answer correctly for both the streams. We count the number of combinations such that frequency of each letter is less than sN total frequency is N/2.

11 Consider an (n+1) element set T = {b 0 =0, b 1,..., b n-1, b n = N/2+n}. Define the frequency of element i in the first half of the stream as b i  b i  1  1. Any n element set whose total frequency is N/2 can be represented by this set. (Note that this also contains a set such that the frequency of an element is more than sN.) Consider only the sets satisfying for each b i. Then the frequency of each element is at most N/n. Because sn > 1, it holds N/n < sN, satisfying the condition. We count the number of such T’s.

12 We can select each b i from any of 1+N/2n candidates. (the last one is chosen such that the total frequency is N/2.) Thus there are combinations. The size of a data structure which can distinguish any of them must use bits (the information-theoretic lower bound).

13 Randomized 1-pass Approximation Algorithm Find solutions satisfying the following conditions with probability at least 1 . 1. Output all elements with frequency at least sN 2. Any element with frequency less than (s  )N is not output 3. Estimated frequency is at least (true frequency)  N Required memory: words (expectation)

14 Data Structure S: a set of entries (e, f) –e: element number –f: estimated frequency of e Initial state –S: empty –r =1: sampling rate

15 Algorithm When an element e arrives –if e is in S, increase f by one –otherwise, insert (e,1) in S with probability 1/r When N = 2rt, for each entry (e,f) of S –flip a coin until the head comes, and decrease f by the number of tails –if f becomes 0, delete the entry from S When N = 2rt, double r For a query, return all elements with f  (s  )N

16 Resampling Consider (e, f)  S. Some occurrences of e may be discarded with probability 1  1/r –Once inserted into S, it will not be discarded anymore. When N = 2rt, halve the sampling rate –We want to change S as if elements have been sampled with the new sampling rate. –An already discarded element will be also discarded if the sampling rate is halved. –We discard the first several occurrences (one on average) of each element in S with probability ½.

17 Theorem: This algorithm outputs an answer satisfying the conditions with probability more than 1 . The expected number of entries in S is Proof: The first 2t elements are stored in S. If an error occurs, rt  N  2rt, that is,. The probability that the frequency decreases at least  N is at most

18 The number of answers (elements whose frequency is at least sN) is at most 1/s. The probability that there exists an answer whose error is more than  N is at most. To make this probability at most , we set Because N  2rt and each element is inserted into S with probability 1/r, the expected number of entries is at most ． (If the same elements appear multiple times, the number of entries decreases.)

19 Optimal-space 1-pass Algorithm Find a solution deterministically satisfying 1. Output all elements with frequency at least sN 2. Do not output elements with frequency less than (s  )N 3. Error of estimated frequency is at most  N Required memory: words S: set of entries (e, f,  ) (at most m entries) –e: index of element –f: estimated frequency of e –  : error in frequency

20 Algorithm When an element e arrives –If e is in S, increase f by one –If e is not in S and S is not full, insert (e,1,0) –If S is full Let (e m, f m,  m ) be the entry with minimum f m Replace it with (e, f m +1, f m ) For a query: output any (e,f,  )  S with f > sN

21 Lemma1: Proof: We show every time an element comes, the value of f increases by one for one of the entries in S. It holds if e is already in S. It also holds if e is not in S and S is not full. If S is full, the new entry takes over the f value of the old one.

22 Lemma 2 ： The minimum f value (= f m ) in S is at most Proof: From Lemma 1, summation of m f values is N. Thus the minimum value among them is at most. Lemma 3: For each entry (e, f,  ), 0    f m. Proof: It is obvious 0  . If S is not full,  = 0 for any entry. If S is full,  is replaced with f m, and after that  does not change and only f will change. Therefore   f m holds.

23 Lemma 4 For each (e,f,  )  S ， let F be the true frequency of e. Then F  f  F+f m. Proof: First we show F  f. It is obvious if e is always in S. Consider the case e is once discarded and inserted into S again. While e is not in S, F does not change. When e is discarded, its f value becomes  i for another element. Even if another replacement happens, the new  value is at least  i. Therefore the f value of the entry for reoccurrence of e is at least  i, which is at least the original f value. Therefore F  f.

24 Next we show f  F+f m. For the first occurrence of e, it holds because f = F = 1. Assume that at some time it holds f  F+f m. If e is always in S, it holds because f and F increase by the same amount. Consider the case e is once discarded and inserted into S again. Let (e, f ’,  ) be the new entry. When e is not in S, F does not change. f ’ is the minimum value f’ m of estimated frequencies plus 1. holds because 1  F.

25 Theorem 1: Any element whose true frequency F satisfies F > f m is in S. Proof: Assume that an element e is not in S even if its true frequency satisfies F > f m. When it is removed from S, its true frequency is F. From Lemma 4, f  F. Because the value of f m is non-decreasing, the minimum value f’ m at that time is at most the final f m. This means f > f’ m. However it is contradicting that the algorithm removes the entry with minimum f m (f’ m ).

26 Correctness of Algorithm Set m = 1/  ⇒ required memory is O(1/  ) For each entry,   f m  N/m =  N Any element whose true frequency satisfies F >  N is in S ⇒ Any element with F > sN is in S The algorithm outputs any element with f > sN. Because f  F, any element with F > sN is output. Since sN (s  )N. f  F  f    f   N

27 Performance Guarantee For an entry (e, f,  ), if f   > sN, its true frequency satisfies F > sN. If it holds for any output element, it means the algorithm output the exact solution. This does not always hold (depending on the input stream)

28 Top-k Elements Input: integer a i  U at time t i (i = 1, 2,…, N) Query: output k most frequent elements This problem cannot be solved exactly using o(|U|) memory. We find an approximate solution. –F k : frequency of k-th frequent element –output any element whose true frequency is more than (1  ) F k.

29 We use the algorithm for approximate frequency counts. The algorithm returns e 1,…, e k for a query. Let f i be the estimated frequency of e i. Let E i denote the i-th frequent element and F i be its true frequency. (Note ： it may happen E i  e i ) Theorem 2 ： f i  F i Proof: Consider four cases. 1) If E i is not in S From Theorem 1, F i  f m. Then f i  f m  F i

30 2) If E i is the j-th output of the algorithm (j > i) From Lemma 4, f j  F i ． Because j > i, f i  f j. Thus f i  F i ． 3) If E i is the i-th output From Lemma 4, f i  F i ． 4) If E i is the j-th output (j < i) Some element E x (x < i) is y (  i)-th output. From Lemma 4, f y  F x. Since x < i, true frequency F x of E x is F x  F i Because y  i, f i  f y Therefore f i  F i

31 Prop.: If f i   i > f k+1 for an entry (e i, f i,  i ), e i must be an element in Top-k. Proof: From Theorem 2, f k+1  F k+1. Because true frequency of e i is at least f i   i, and it is more than F k+1. Therefore e i is in Top-k. Cor.: If, e 1, e 2,..., e k are Top-k elements (guaranteed) Cor.: If, e 1, e 2,..., e k are in the order of true frequency (guaranteed order)

32 Theorem 3: The algorithm outputs k elements with frequency more than (1  )F k using memory. Proof.: From Theorem 2, f i  F i. Therefore True frequencies of these elements are at least f i   i. To satisfy the condition, it is enough if Because, the claim holds if we set

33 Input Stream (i t, c t ) at time t (t = 1, 2,…) –i t  U = {1,2,...,u} –It means Range of c t –c t is non-negative (cash register model) –c t can be negative (turnstile model) Range of a i (t) –always non-negative (strict turnstile model) –can be negative (non-strict turnstile model)

34 Queries Point query Q(i): approximate value of a i Range query Q(l,r): approximate value of Inner product query Q(a, b): approx. value of Heavy hitters/frequent items –output all elements with a i  (  ) ||a|| 1  -Quantile: output j k (k = 0,...,1/  ) such that

35 Count-Min Sketches Data structure of CM Sketch for parameter ( ,  ) –2D array count[1,1]...count[d,w] (initially 0) –d pairwise independent hash functions Space ： words When an element (i t, c t ) arrives –for each 1  j  d, +c t h1h1 hdhd 1 w

Pairwise Independent Hash Functions Definition A family of functions H = {h | h: [N]  [M]} is called a family of pairwise independent hash functions if for all i  j  [N] and k, l  [M], Example M: prime 36 [n] = {1, 2, …, n}

37 Point Query For strict turnstile model Theorem 1: the approx. value by CM sketch is Proof: every time a i appears, is increased by c t (1  j  d ). This counter may be increased by other elements with the same hash value. (always) (with Prob. at least 1  )

38 Define random variables I i,j,k as From pairwise independence of hash functions Let X i,j be a random variable with X i,j  0

39 From Markov’s inequality The algorithm has the same approximation ratio for any b > 1 by setting, but the space (O(wd)) is minimized if b = e.

40 Point Query For non-strict turnstile model Theorem 2: The approx. value by CM sketch is Proof: (with Pr. at least 1  1/4 ) (median of d values)

41 (Pr. the error of approx. value of a i is outside range) =(Pr. median of d values is outside range) =(Pr. d/2 values are to the right of range) +(Pr. d/2 values are to the left of range) <(Pr. d/2 values are outside range) Let Y j be a random variable taking 1 if Then E[Y j ] < 1/8. Let, then From Chernoff bound,

42 The Chernoff Bound Theorem ： Let X 1, X 2,...,X n be independent random variables and assume Pr[X i =1] = p i (0 < p i < 1). Let. Then Note: If all p i are the same, called Bernoulli trials.

43 Inner Product Query a, b: non-negative vectors Let Theorem 3: The approx. value of is Proof: (always) (Pr. at least 1  )

44 Because the vectors are non-negative From pairwise independence of hash functions From Markov’s inequality

45 Range Query We want to approximate Using a point query for every point in the range We use dyadic range (Pr. 1   ) [1,8] [9,16] [1,4] [5,8][9,12] [13,16] [1,2] [3,4][5,6] [7,8] [9,10] [11,12][13,14] [15,16] [1][2] [3][4] [5][6] [7][8] [9][10] [11][12] [13][14] [15][16] [4,11]

46 Any range is represented by union of at most 2 log 2 u dyadic ranges. Any point in [1...u] is contained in log 2 u dyadic ranges with distinct lengths. We say a range is of level k if the length is 2 k. We use a CM sketch with parameter ( ,  ) for each level k. Input ： A k [1...u/2 k ] ． Each element is summation of 2 k a i ’ s When a i comes, update log 2 u A k [i]’s including it. (Pr. 1  )

47 For a range query, we use point queries on dyadic ranges representing the query range. It holds Theorem 4: The approx. value of is (always) (Pr. at least 1  )

48 Proof: Expected error of each is at most Thus the expected error of is at most The probability that it exceeds is at most 1/e By using d hash functions, the failure probability is bounded by . Time to update the data structure: Space ：

49 Improving Approximation Precision To make, set –space ： For levels with large k, instead of using CM sketch, store exact values If a query range is represented by a small number of dyadic ranges, the precision is high.

50 Frequent Item Query In strict turnstile model, if c t < 0, ||a|| 1 will decrease. The algorithms so far cannot be used. We use dyadic ranges For each level, use CM sketches Updating is the same as point query Query processing: parallel binary search –Search from the highest level of dyadic ranges –If the freq. in a range is at least (  +  ) ||a|| 1, continue –Output elements with freq. at least (  +  ) ||a|| 1 at the lowest level

51 Theorem 6: Any element with freq. at least (  +  ) ||a|| 1 is output. With Pr. at least 1 , any element with freq. less than  ||a|| 1 is not output. Space ： Update time: Proof: In each level, the number of elements with freq. at least  ||a|| 1 is at most 1/ . In each level, the number of point queries is at most 2/ . Therefore in total 2/  ·log 2 u times.

52 The probability that a query fails is at most. The probability that some of the 2log 2 u/  queries fail is at most .

53 Quantile Query Return j k (k = 0,...,1/  ) with We use CM sketch for range queries. Because the left end point of query ranges is 1, we use only one dyadic range for each level. (Pr. )

54 Data Structures We consider an imaginary complete binary tree for U = {1,2,...,u} Each leaf stores frequency of a t We maintain the data structure so that each node v satisfies (v p : parent of v ， v s : sibling of v) (m: parameter) Do not apply (1) on leaves Do not apply (2) on root

55 If there is a node v not satisfying the invariant –count(v p ) += count(v) + count(v s ) –delete v and v s 1461111 1 46 22 n = 15, m = 5, u = 8

56 When an element a t comes –increase count by 1 for the corresponding leaf –check if the leaf count satisfies (1), (2) –if count is small, it is merged with others Lemma 1: The number of nodes of the tree is at most 3m. Proof: Let Q be the node set, n be the total freq. From invariant (2),

57 In the left side, the count value of each node appears at most three times. Therefore

58 When an element comes, if its frequency is small, it is accumulated in an ancestor node. If all the ancestors of a leaf have 0 count values, the leaf count is correct. Otherwise the leaf count is at most the real value. Lemma 2: The error of a leaf count value is at most Proof: The error is at most the summation of count values in ancestors of the leaf.

59 Consider all leaves a l,...,a r in the subtree rooted at an internal node v. If the count values of all ancestors of v are 0, the summation of frequencies of a l,...,a r is equal to the summation of count values in the subtree. The error in the sum of count values in a subtree is at most 34 [5,6] [7,8] 125678 [1,2][3,4] [5,8] [1,4] [1,8]

60 Query Algorithm Traverse tree nodes in post-order c = (sum of count values in visited nodes) If c becomes at least, output the node If the node is not a leaf, output the rightmost value in the range corresponding to the node. 46 22 1 34 [5,6] [7,8] 125678 [1,2][3,4] [5,8] [1,4] [1,8] n = 15, m = 5, u = 8  = 0.5, k = 1

61 Theorem 1: If s nodes are used, the output satisfies Proof: Assume that an element x is output at node v. Consider nodes storing counts of elements 1,...,x. Because nodes are visited in post-order, counts which are not stored in nodes before v are stored in ancestors of v. Therefore the error is at most.

62 From Lemma 1, s  3m. Thus Required memory to bound the error by  is

63 Relation between Heavy Hitters and  -Quantile ( ,  )-heavy hitters –output all elements with a i   ||a|| 1 –do not output any element with a i < (  ) ||a|| 1 –error is at most  ||a|| 1 ( ,  )-quantile: output j k (k = 0,...,1/  ) satisfying Assume that the answers are obtained with Pr. 1.

64 Lemma ： (  + ,2  )-heavy hitters can be obtained from ( ,  /2)-quantile. Proof: Let j k (k = 0,...,1/  )be outputs of ( ,  /2)-quantile If j k = j k+1, output j k as a heavy hitter. Estimated frequency is (number of equal j k ’s  1)   ||a|| 1 (error 2  ||a|| 1 ) For an element with frequency less than (  ) ||a|| 1, it does not happen j k = j k+1 and it is not output. Any element with freq. at least (  +  ) ||a|| 1 is output. aaaabbbbbbcccdddddddddeeeeffffff

65 Lemma: ( ,  )-quantile is obtained from log u (  /log u,  /log u)-heavy hitters (u: #distinct elements) Proof: Partition range [1,u] into dyadic intervals of length 2 i (i = 0,...,log u  1). For each i, use (  /log u,  /log u)-heavy hitter 34 [5,6] [7,8] 125678 [1,2][3,4] [5,8] [1,4] [1,8]

66 To obtain quantiles, do a binary search using estimated frequencies of elements [1,i]. A range [1,i] is represented by at most log u dyadic intervals. For each dyadic interval, if it is output as a heavy hitter, we add the estimated frequency to quantile. The error in heavy hitters is at most  /log u. ⇒ the error in quantile is at most .

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo"— Presentation transcript:

Similar presentations

About project

Feedback