Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo http://researchmap.jp/sada/resources/

2 Data Stream Processing Data stream –a sequence of data Examples –transaction data in banks –communication recodes of phones and internet –data obtained by censor networks Purpose of processing –searching data –computing statistics, data analyses –decision-making

3 Problems in Data Stream Processing Necessary to process data quickly All data cannot be stored in memory –In many cases, we consider only approximate solutions

4 Example of Stream Processing Input: integer a i at time t i (i = 1, 2,…) Query: average value of a 1, a 2,..., a n appeared before the query Available memory: O(1) integers can be stored The answer can be computed by storing only the number of integers appeared so far, and summation of them

5 Finding Missing Numbers Input: a permutation of 1 to n, but an element is missing Example: n = 5 and input is 3, 4, 1, 5 Output: the missing number m (m = 2 in the above example) Available memory: O(log n) bits

6 Algorithm and Required Memory For each input integer, add it to a variable s After receiving n  1 values, return m = n(n+1)/2  s Required memory: bits

7 Finding Duplicates Input: n+1 integers in range [1,n] –at least one duplicating value (pigeonhole principle) –possibility of multiple duplications Output: a duplicating value (one is enough) Available memory: O(log n) bits

8 Multiple-pass Algorithm Input stream can be read multiple times In each pass, stream can be read only sequentially Evaluate algorithms by the number r of passes and required memory m 1 3 7 2 4 6 5 9 1 8

9 A Simple Algorithm r = n, m = O(log n) bits –In i-th pass, check if the value i appears more than once r = 1, m = O(n log n) bits –Read all values in memory and sort them –Not a stream algorithm r = O(log n), m = O(log n) bits is possible?

10 A Clever Algorithm In the first pass, count the number of values which are at most n/2 –counter: log n bits If the counter value is more than n/2 –there are more than n/2 values in [1, n/2] –at least one duplicate If the counter value is at most n/2 –there are more than n/2 values in (n/2, n] –at least one duplicate in (n/2, n] In later passes, do the same for a reduced range

11 Complexities Each pass uses 3log n bits of memory –two integers for representing a range –one integer for the counter Number of passes –After a pass, the size of the range is halved –log n passes r = O(log n), m = O(log n) bits algorithm

12 [1, n] [1, n/2](n/2, n] [1, n/4](n/4, n/2](n/2, 3n/4](3n/4, n] Search Tree Depth of search tree = number of passes r

13 How to reduce number of passes Use not a binary but a k-ary tree for the search tree –k counters are necessary (m = O(k log n) bits) Number of passes –r = log k n [1, n] [1, n/4](n/4, n/2](n/2, 3n/4](3n/4, n] k = 4

14 Tradeoff between number of passes and required memory Assume that each node of search tree has k children Condition that the problem can be solved in r passes using m bits of memory –k r  n (at least n nodes at depth r) –k  m (algorithm uses k counters) (in reality more memory is necessary to store k counters) If m = O(log n)

15 If k is variable Assume that on the path from the root of the search tree to a leaf, a node at depth i has k i children Conditions The greater k i, the smaller r ⇒ r is minimum if k i = m ⇒ same as the case k is fixed for all nodes

16 Data Stream Models Input stream: a 1, a 2,... Items arrive one by one in order They represent a function A: [1...N] → R Three models –Time Series model –Cash Register model –Turnstile model (strict, non-strict)

17 Time series model –A[i] = a i (elements of A appear in order) –amount of communication in each unit time Cash register model –a i = (j, I i ), I i  0 –A i [j] = A i [j] + I i (increase value of A[i]) –frequency of IP addresses in packets Strict turnstile model –a i = (j, I i ) –A i [j] = A i [j] + I i (I i can be negative, A is nonnegative) –number of people in a station Non-strict turnstile model –both I i and A can be negative

18 Evaluation Criteria of Algorithms Time to process one element of stream (Processing time) Space to store data structure for A t at time t (Storage) Time to compute a function on A (Compute or query time) (number of passes) We want to bound them by polylog(N,t)

19 Query examples on IP communication logs 1.amount of http communication from a range of IP addresses 2.number of IP addresses which use a link in one day, or number of IP addresses which are currently communicating using a link 3.Which hosts heavily communicate (top k) 4.number of communications consisting of only one packet (denial-of-service attack)

20 5.amount of identical or similar communication between two routers 6.Most related k communications in a day

21 Examples (Cont’d) Number of IP addresses from which packets are sent –Input: log of IP packets Let a 1, a 2,... be IP packets. a i has source IP address s i A[0...N  1]: number of packets sent from source IP address i (initially 0) A packet a i comes, increase A[s i ] by one (Cash register model) Number of distinct IP addresses = (number of non-zero elements in A[i])

22 Frequency Counts Input ： integer a i at time t i (i = 1, 2,…) Query: all elements which appear some fraction of stream, and their frequency. Namely N: number of elements at query time Return elements which appear at least sN times Memory: O(log N) words The problem is expressed in the cash register model

23 Approximate Frequency Counts It is not possible to solve the Frequency Counts problem exactly using small working space We find approximate solutions 1.Output all elements whose frequency is at least sN. That is, there is no false negative. 2.Do not output elements whose frequency is less than (s  )N. 3.Estimated frequency is at least (true frequency)  N Required memory: words

24 Definitions of Terms Partition the input stream conceptually into buckets –each bucket contains elements Number each bucket from 1 –current buckt number is f e : true frequency of element e appeared so far Note: , w are constants, N (number of elements), b, f e are variables

25 Data Structure Data structure D: a set of tuples (e, f,  ) –e: element in the stream –f: estimated frequency of element e –  : maximum error of f Initially, D is an empty set

26 Algorithm When an element e arrives check if e exists in D if so, increase f by one if not, create a new entry (e, 1, b  1) At a bucket boundary (N ≡ 0 mod w) delete all entries (e, f,  ) such that f +   b For a query return all entries such that f  (s  )N

27 Lemma 1: When deletions occur, b   N Proof: Deletions occur when N ≡ 0 mod w. So,

28 Lemma 2: If an entry (e, f,  ) is deleted, f e  b Proof: By induction on b. If b = 1, if an entry (e, f,  ) is deleted, f = 1. This case f = f e = b. Assume that the claim holds if the block number is at most b  1. Consider the case that an entry (e, f,  ) is deleted in bucket b’ = b+1. This entry was inserted when bucket number is  +1. Assume that the same entry has been once inserted into D and then deleted before bucket  +1.

29 In this case, bucket number for the previous deletion is at most . Since b’   +1,  < b’. Thus we can use assumption of induction. The true frequency of e at the previous deletion is at most . f is the true frequency since the current entry for e was inserted in D True frequency f e for buckets 1 to b’ is at most f + . Condition of deletion is f +   b’. So if a deletion occurs, f e  b’ and the claim holds for bucket b’.

30 Lemma 3: If element e does not exist in D, f e   N Proof: If e is just deleted (N ≡ 0 mod w), from Lemmas 1, 2, it holds. Consider other N. If e does not exist in D, it means e is not in the current bucket. So, f e is equal to the value for the end of last bucket, and N is greater. So the claim holds.

31 Lemma 4: If (e, f,  )  D, f  f e  f +  N Proof: If  = 0, the entry is inserted in bucket 1 and never deleted. Thus f = f e. If  > 0, e may be deleted somewhere in the first  buckets. From Lemma 2, true frequency of e when it was deleted last time is at most . Thus f e  f + . Because   b  1   N, f e  f +  N. It is obvious that f  f e.

32 Correctness of Algorithm Conditions to be satisfied 1. Output all elements whose frequency is at least sN. That is, no false negative. 2. Any element whose frequency is less than (s  )N is not output (assume s >>  ). 3. Estimated frequency is at least (true frequency)  N 4. Required memory: words

33 Condition 1 From Lemma 3, all elements e such that f e >  N are contained in D. Because s > , D contains all elements whose true frequency is at least sN. Algorithm outputs all entries such that f  (s  )N. sN  f e  f +  N  f  (s  )N Therefore al elements e such that f e  sN are output.

34 Conditions 2, 3 From Lemma 4, f  f e. If f e < (s  )N, f < (s  )N. Algorithm does not output such entry. From Lemma 4, f e   N  f.

35 Analysis on Space Complexity Consider elements whose true frequency is more than  N. Number of such elements is less than 1/ . Because the algorithm also stores other entries, we have to estimates number of such entries. Theorem: The number of entries the algorithm stores is at most

36 Proof: Let d i be the number of entries in D whose bucket number is b  i+1 (1  i  b). Such an element appear at least i times in buckets b  i+1 to b. (in bucket b  i+1, f =1 and  =b  i. therefore it is deleted unless f is increased by one in each bucket.) Because each bucket is of size w,

37 Claim: Proof: Induction on j. If j = 1, it holds. Assume it holds for j = 1,2,...,p  1. It holds for j = p.

38 Number of entries |D| is

39 Memory Reduction by Two-pass Algorithm Input: integer a i at time t i (i = 1, 2,…, N) Query: elements appeared at least sN times Memory: O(1/s) words Two-pass algorithm –Pass 1: find all elements appeared at least sN times –Pass 2: output elements which really appear at least sN times

40 Algorithm (Pass 1) K: set of elements (initially empty) count[]: stores frequency of elements in K For each a i (i = 1,...,N) If a i  K, count[a i ] = count[a i ] + 1 If not, insert a i in K and set count[a i ] = 1 If |K| > 1/s for each a  K count[a] = count[a]  1 if count[a] becomes 0, delete a from K

41 Lemma 5: When pass 1 ends, K contains all entries whose frequency is at least sN. Proof: Assume that at the end of pass 1, there exists an entry a which is not contained in K. Every time a is deleted, frequencies of other 1/s  1 elements are decreased by one. If a is deleted f times, the total frequency decreases at least f/s. Because the total frequency N is greater than the number of decreases, it holds f/s < N ． Therefore every entry with f  sN is in K at the end of pass 1.

42 Algorithm (Pass 2) For each entry in K, compute its frequency Output entries with frequency at least sN

43 Limitation of 1-pass Algorithms Input ： integers a i in {1,...,n} (i = 1, 2,…, N) Query: return all elements appeared at least sN times Prop.: Any 1-pass algorithm solving this problem requires  (n log (N/n)) bit space in the worst case. (assume N > 4n > 16/s)

44 Proof: Consider the situation that the first N/2 elements of the stream have appeared. Assume that each elements appeared less than sN times. We prove that in this case we have to remember frequencies of all elements precisely to obtain the correct answer when all elements have come. Assume that an element a appeared k < sN times when the first N/2 elements came. The element a is in the answer if the frequency of a in the second half is at least sN  k, and not in the answer if less than sN  k.

45 That is, there are two types of remaining stream: one for which the answer contains a and one does not. For any k, such two types of streams exist. Therefore it is necessary to store the frequency of each letter correctly. We count the number of combinations such that frequency of each letter is less than sN total frequency is N/2.

46 Consider an (n+1) element set T = {b 0 =0, b 1,..., b n-1, b n = N/2+n}. Denote the frequency of element i in the first half of the stream by b i  b i  1  1. Any n element set whose total frequency is N/2 can be represented by this set. (Note that this also contains a set such that the frequency of an element is more than sN.) Consider only the sets satisfying for each b i. Then the frequency of each element is at most N/n. Because sn > 1, it holds N/n < sN, satisfying the condition. We count the number of such T’s.

47 We can select each b i from any of 1+N/2n candidates. (the last one is chosen such that the total frequency is N/2.) Thus there are combinations. The size of a data structure which can distinguish any of them must use bits (the information-theoretic lower bound).

48 Randomized 1-pass Approximation Algorithm Find solutions satisfying the following conditions with probability at least 1 . 1. Output all elements with frequency at least sN 2. Any element with frequency less than (s  )N is not output 3. Estimated frequency is at least (true frequency)  N Required memory: words (expectation)

49 Data Structure S: a set of entries (e, f) –e: element number –f: estimated frequency of e Initial state –S: empty –r =1: sampling rate

50 Algorithm When an element e arrives –if e is in S, increase f by one –otherwise, insert (e,1) in S with probability 1/r When N = 2rt, for each entry (e,f) of S –flip a coin until the head comes, and decrease f by the number of tails –if f becomes 0, delete the entry from S When N = 2rt, double r For a query, return all elements with f  (s  )N

51 Resampling Consider (e, f)  S. Some occurrences of e may be discarded with probability 1  1/r –Once inserted into S, it will not be discarded anymore. When N = 2rt, halve the sampling rate –We want to change S as if elements have been sampled with the new sampling rate. –An already discarded element will be also discarded if the sampling rate is halved. –We discard the first several occurrences (one on average) of each element in S with probability ½.

52 Theorem: This algorithm outputs an answer satisfying the conditions with probability more than 1 . The expected number of entries in S is Proof: The first 2t elements are stored in S. If an error occurs, rt  N  2rt, that is,. The probability that the frequency decreases at least  N is at most

53 The number of answers (elements whose frequency is at least sN) is at most 1/s. The probability that there exists an answer whose error is more than  N is at most. To make this probability at most , we set Because N  2rt and each element is inserted into S with probability 1/r, the expected number of entries is at most ． (If the same elements appear multiple times, the number of entries decreases.)

54 Finding Rare Items Input: integer a i at time i (i = 1, 2,…) –a i  U = {1,2,...,u} Let c t [j] = |{a i | a i = j, i  t}| : number of occurrences of j until time t Query: compute rarity  [t] at time t Memory: o(u) bits

55 If large memory can be used For each i  U = {1,2,...,u}, use a 2-bit counter –0: i never appeared –1: i appeared once –2: i appeared more than once Using 2u bit memry,  [t] is computed exactly. Is it possible using o(u) bit memory?

56 Lower bound of required memory Proposition: Any deterministic algorithm using o(u) bit working space cannot compute exact solutions. Proof: Assume to the contrary that there exists a deterministic algorithm which can compute  [t] exactly. And we also assume that any element of a set S  U appeared once until time t. S can be recovered using the following algorithm. Put each element i  U in the stream at time t+i  [t+i] <  [t+i  1] ⇔ i  S To represent S, we need u bits. Contradiction.

57 Approximation Algorithm At the start time, we choose k elements from U uniformly at random, then count frequencies of only those elements. –X 1 [t],..., X k [t] ： freq. of each element at time t For a query, return For each i, it holds where the probability is on the choice of all X i, is a good approximation of  [t]. It is a good approximation only if  [t]  1/k.

58 The Chernoff Bound Theorem ： Let X 1, X 2,...,X n be independent random variables and assume Pr[X i =1] = p i (0 < p i < 1). Let. Then Note: If all p i are the same, called Bernoulli trials.

59 Let Y i be a random variable for a trial of choosing an element of U, and let Y i =1 if it is rare. Then Pr[Y i =1] =  [t], and if we define,

60 The probability that is not an  approximation of  [t] decreases exponentially to k. However if  [t] < 1/k, the error probability is large. In general, u is huge (IP addresses, etc.). Therefore k should be large.

61 Change of Definition of Rarity The ratio of rare items to really appeared items In this case, it is also not possible to solve the problem exactly using o(u) bit space.

62 Min-wise Independent Hash Functions Definition ： [u] = {1,...,u} Definition ： A family of hash functions H  [u]  [u] is min-wise independent ⇔ For any X  [u] and x  X,

63 Approximation Algorithm Choose k min-wise hash functions randomly Let C i (t): number of elements with the current hash value For each input a i, update the variables For a query, return

64 is the probability that an element among a 1,...,a t which has the minimum hash value appeared only once. From the property of min-wise independent hash functions, the probability that an element has the minimum hash value is Therefore is a good approximating value of  [t]

65 Example of Min-wise Independent Hash Functions A family of functions representing all permutations of [u] = {1,...,u}  (u lg u) bits are necessary to represent one permutation. → too much memory consumption

66  -min-wise Independent Hash Functions Definition ： a family of hash functions H  [u]  [u] is  -min-wise independent ⇔ For any X  [u] and x  X, cf. ： a polynomial of degree O(log (1/  )) on GF(u) –can be represented in O(log u log (1/  )) bits –function value can be computed in O(log (1/  )) time –Note ： u is a prime

67 Performance of Approximation Algorithm With high probability, By using large k, the success probability will increase Memory ： O(k log u log (1/  )) bits

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo"— Presentation transcript:

Similar presentations

About project

Feedback