Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS6234 Advanced Algorithms February

Similar presentations


Presentation on theme: "CS6234 Advanced Algorithms February"— Presentation transcript:

1 CS6234 Advanced Algorithms February 10 2015
Streaming Algorithms CS6234 Advanced Algorithms February

2 Approximate answer is usually preferable
The stream model Data sequentially enters at a rapid rate from one or more inputs We cannot store the entire stream Processing in real-time Limited memory (usually sub linear in the size of the stream) Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence Approximate answer is usually preferable

3 Overview Counting bits with DGIM algorithm Bloom Filter
Count-Min Sketch Approximate Heavy Hitters AMS Sketch AMS Sketch Applications

4 Counting bits with DGIM algorithm
Presented by Dmitrii Kharkovskii

5 Sliding windows A useful model : queries are about a window of length N The N most recent elements received (or last N time units) Interesting case: N is still so large that it cannot be stored Or, there are so many streams that windows for all cannot be stored

6 Problem description Problem Obvious solution Real Problem
Given a stream of 0’s and 1’s Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N Obvious solution Store the most recent N bits (i.e., window size = N) When a new bit arrives, discard the N +1st bit Real Problem Slow ‐ need to scan k‐bits to count What if we cannot afford to store N bits? Estimate with an approximate answer

7 Datar-Gionis-Indyk-Motwani Algorithm (DGIM)
Overview Approximate answer Uses 𝑂( 𝑙𝑜𝑔 2 N)  of memory Performance guarantee: error no more than 50% Possible to decrease error to any fraction 𝜀 > 0 with 𝑂( 𝑙𝑜𝑔 2 N) memory Possible to generalize for the case of positive integer stream

8 Main idea of the algorithm
Represent the window as a set of exponentially growing non-overlapping buckets

9 Timestamps Each bit in the stream has a timestamp  - the position in the stream from the beginning. Record timestamps modulo N (window size) - use o(log N) bits Store the most recent timestamp to identify the position of any other bit  in the window  

10 Buckets Each bucket has two components:
Timestamp of the most recent end. Needs 𝑂(𝑙𝑜𝑔 N) bits Size of the bucket - the number of ones in it. Size is always 2 𝑗 . To store j we need 𝑂(𝑙𝑜𝑔 𝑙𝑜𝑔 N) bits Each bucket needs 𝑂(𝑙𝑜𝑔⁡N) bits

11 Representing the stream by buckets
The right end of a bucket is always a position with a 1. Every position with a 1 is in some bucket. Buckets do not overlap. There are one or two buckets of any given size, up to some maximum size. All sizes must be a power of 2. Buckets cannot decrease in size as we move to the left (back in time).

12 Updating buckets when a new bit arrives
Drop the last bucket if it has no overlap with the window If the current bit is zero, no changes are needed If the current bit is one Create a new bucket with it. Size = 1, timestamp = current time modulo N. If there are 3 buckets of size 1, merge two oldest into one of size 2. If there are 3 buckets of size 2, merge two oldest into one of size 4. ...

13 Example of updating process

14 Query Answering How many ones are in the most recent k bits?
Find all buckets overlapping with last k bits Sum the sizes of all but the oldest one Add the half of the size of the oldest one Ans = /2 = 24 k

15 Memory requirements

16 Performance guarantee
Suppose the last bucket has size 2 𝑟 . By taking half of it, maximum error is 2 𝑟−1 At least one bucket of every size less than 2 𝑟 The true sum is at least … +   2 𝑟−1 = 2 𝑟 - 1 The first bit of the last bucket is always equal to 1. Error is at most 50%

17 References J. Leskovic, A. Rajamaran, J. Ulmann. “Mining of Massive Datasets”. Cambridge University Press 

18 Bloom Filter Presented by- Naheed Anjum Arafat

19 Motivation: The “Set Membership” Problem
x: An Element S: A Set of elements (Finite) Input: x, S Output: True (if x in S) False (if x not in S) Streaming Algorithm: Limited Space/item Limited Processing time/item Approximate answer based on a summary/sketch of the data stream in the memory. Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)

20 Bloom Filter Consists of
vector of n Boolean values, initially all set false (Complexity:- O(n) ) k independent and uniform hash functions, ℎ 0 , ℎ 1 , … , ℎ k−1 each outputs a value within the range {0, 1, … , n-1} F 1 2 3 4 5 6 7 8 9 n = 10

21 Bloom Filter For each element sϵS, the Boolean value at positions ℎ 0 𝑠 , ℎ 1 𝑠 , … , ℎ 𝑘−1 𝑠 are set true. Complexity of Insertion:- O(k) 𝑠 1 ℎ 0 𝑠 1 = 1 ℎ 2 𝑠 1 = 6 ℎ 1 𝑠 1 = 4 F 1 2 3 4 5 6 7 8 9 T T T k = 3

22 Bloom Filter For each element sϵS, the Boolean value at positions ℎ 0 𝑠 , ℎ 1 𝑠 , … , ℎ 𝑘−1 𝑠 are set true. Note: A particular Boolean value may be set to True several times. 𝑠 1 𝑠 2 ℎ 0 𝑠 2 = 4 ℎ 1 𝑠 2 = 7 ℎ 2 𝑠 2 = 9 F T 1 2 3 4 5 6 7 8 9 T T k = 3

23 Algorithm to Approximate Set Membership Query
Input: x ( may/may not be an element) Output: Boolean For all i ϵ {0,1,…,k-1} if hi(x) is False return False return True Runtime Complexity:- O(k) 𝑠 1 𝑠 2 F T 1 2 3 4 5 6 7 8 9 𝑥 = S1 k = 3 𝑥 = S3

24 Algorithm to Approximate Set Membership Query
False Positive!! 𝑠 1 𝑠 2 ℎ 2 𝑠 1 = 6 ℎ 0 𝑠 2 = 4 ℎ 0 𝑠 1 = 1 ℎ 1 𝑠 1 = 4 ℎ 2 𝑠 2 = 9 ℎ 1 𝑠 2 = 7 F T 1 2 3 4 5 6 7 8 9 ℎ 1 𝑥 = 6 ℎ 2 𝑥 = 1 𝑥 ℎ 0 𝑥 = 9 k = 3

25 Error Types False Negative – Answering “is not there” on an element which “is there” Never happens for Bloom Filter False Positive – Answering “is there” for an element which “is not there” Might happens. How likely?

26 Probability of false positives
n = size of table m = number of items k = number of hash functions Consider a particular bit 0 <= j <= n-1 Probability that ℎ 𝑖 𝑥 does not set bit j after hashing only 1 item: 𝑃 ℎ 𝑖 𝑥 ≠𝑗 = 1− 1 𝑛 Probability that ℎ 𝑖 𝑥 does not set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 { 𝑆 1 , 𝑆 2 ,…, 𝑆 𝑚 }: ℎ 𝑖 𝑥 ≠𝑗 = 1− 1 𝑛 𝑚 Question: Where the randomness comes from? Ans: From the set {S1,S2,….,Sm}. Imagine, you are trying m times to set the bit value at position j to True using a certain (i-th) hash function (no randomness here so far, the function is fixed):- hi. The last line is the probability of your failure to set that bit to True.

27 Probability of false positives
n = size of table m = number of items k = number of hash functions Probability that none of the hash functions set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 𝑆 1 , 𝑆 2 ,…, 𝑆 𝑚 ,∀𝑖 𝑖𝑛 1,2,…,𝑘 : ℎ 𝑖 (𝑥)≠𝑗 = 1− 1 𝑛 𝑘𝑚 We know that, 1− 1 𝑛 𝑛 ≈ 1 e = 𝑒 −1 ⇒ 1− 1 𝑛 𝑘𝑚 = 1− 1 𝑛 𝑛 𝑘 𝑚 𝑛 ≈ 𝑒 −1 𝑘 𝑚 𝑛 = 𝑒 −𝑘 𝑚 𝑛 In the first equation:- The reason why you will multiply (1-1/n)^m term k times is:- When you are using k hash functions to hash m items, you are trying k times to set the bit value at position j to True. Once again, there is no randomness in hi here. You are just taking more trials to set the bit(j) to True.

28 Probability of false positives
n = size of table m = number of items k = number of hash functions Approximate Probability of False Positive Probability that bit j is not set 𝑃 𝐵𝑖𝑡 𝑗 =𝐹 = 𝑒 −𝑘 𝑚 𝑛 The prob. of having all k bits of a new element already set = 𝟏− 𝒆 − 𝒌 𝒎 𝒏 𝒌 For a fixed m, n which value of k will minimize this bound? kopt = log 𝑒 2 ⋅ 𝑛 𝑚 Bit per item The probability of False Positive = ( 1 2 ) 𝑘 𝑜𝑝𝑡 = (0.6185) 𝑛 𝑚

29 Bloom Filters: cons Small false positive probability
Cannot handle deletions Size of the Bit vector has to be set a priori in order to maintain a predetermined FP-rates :- Resolved in “Scalable Bloom Filter” – Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable Bloom Filters" (PDF), Information Processing Letters 101 (6): 255–261

30 References https://en.wikipedia.org/wiki/Bloom_filter
Graham Cormode, Sketch Techniques for Approximate Query Processing, ATT Research Michael Mitzenmacher, Compressed Bloom Filters, Harvard University, Cambridge

31 Count-Min Sketch Erick Purwanto A L

32 Motivation Count-Min Sketch
Implemented in real system AT&T: network switch to analyze network traffic using limited memory Google: implemented on top of MapReduce parallel processing infrastructure Simple and used to solve other problems Heavy Hitters by Joseph Second Moment 𝐹 2 , AMS Sketch by Manupa Inner Product, Self Join by Sapumal

33 Frequency Query Trivial if we have count array [1,𝑚]
Given a stream of data vector 𝑥 of length 𝑛, 𝑥 𝑖 ∈[1,𝑚] and update (increment) operation, we want to know at each time, what is 𝑓 𝑗 the frequency of item 𝑗 assume frequency 𝑓 𝑗 ≥0 Trivial if we have count array [1,𝑚] we want sublinear space probabilistically approximately correct 𝑥 𝑗

34 Count-Min Sketch Assumption:
family of 𝑑–independent hash function 𝐻 sample 𝑑 hash functions ℎ 𝑖 ←𝐻 Use: 𝑑 indep. hash func. and integer array CM[𝑤,𝑑] 𝑥 𝑗 ℎ 𝑖 : 1,𝑚 →[1,𝑤] 1 ℎ 𝑖 (𝑗) 𝑤

35 Count-Min Sketch Algorithm to Update:
Inc(𝑗) : for each row 𝑖, CM[ 𝑖,ℎ 𝑖 (𝑗)] +=1 𝑥 𝑗 ℎ 1 ℎ 2 CM +1 1 +1 ℎ 𝑑 +1 𝑑 1 𝑤

36 Count-Min Sketch Algorithm to estimate Frequency Query:
Count(𝑗) : 𝑓 𝑗 = min𝑖 CM[ 𝑖, ℎ 𝑖 (𝑗)] 𝑗 ℎ 1 ℎ 2 CM 1 ℎ 𝑑 𝑑 1 𝑤

37 Collision Entry CM 𝑖, ℎ 𝑖 𝑗 is an estimate of the frequency of item 𝑗 at row 𝑖 for example, ℎ 1 5 = ℎ 1 2 =7 Let 𝑓 𝑗 : frequency of 𝑗, and random variable 𝑋 𝑖,𝑗 : frequency of all 𝑘≠𝑗, ℎ 𝑖 𝑘 = ℎ 𝑖 (𝑗) 𝑥 3 5 5 8 5 2 5 row 1 1 7 𝑤

38 Count-Min Sketch Analysis
row 𝑖 1 ℎ 𝑖 (𝑗) 𝑤 Estimate frequency of 𝑗 at row 𝑖: 𝑓 𝑖,𝑗 = CM 𝑖, ℎ 𝑖 𝑗 = 𝑓 𝑗 + 𝑘≠𝑗, ℎ 𝑖 𝑘 = ℎ 𝑖 𝑗 𝑛 𝑓 𝑘 = 𝑓 𝑗 + 𝑋 𝑖,𝑗

39 Count-Min Sketch Analysis
Let 𝜀 : approximation error, and set 𝑤= 𝑒 𝜀 The expectation of other item contribution: E[ 𝑋 𝑖,𝑗 ] = 𝑘≠𝑗 𝑓 𝑘 ⋅Pr[ ℎ 𝑖 𝑘 = ℎ 𝑖 𝑗 ] ≤ Pr ℎ 𝑖 𝑘 = ℎ 𝑖 𝑗 ⋅ 𝑘 𝑓 𝑘 . = 1 𝑤 ⋅ 𝐹 1 = 𝜀 𝑒 ⋅ 𝐹 1

40 Count-Min Sketch Analysis
Markov Inequality: Pr[ 𝑋≥𝑘∙E 𝑋 ]≤ 1 𝑘 Probability an estimate 𝜀 ⋅𝐹 1 far from true value: Pr 𝑓 𝑖,𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 = Pr[ 𝑋 𝑖,𝑗 >𝜀∙ 𝐹 1 ] = Pr[ 𝑋 𝑖,𝑗 >𝑒⋅E 𝑋 𝑖,𝑗 ] ≤ 1 𝑒

41 Count-Min Sketch Analysis
Let 𝛿 : failure probability, and set 𝑑=ln( 1 𝛿 ) Probability final estimate far from true value: Pr 𝑓 𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 = Pr ∀𝑖 : 𝑓 𝑖,𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 = ( Pr 𝑓 𝑖,𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 ) 𝑑 ≤ 1 𝑒 ln( 1 𝛿 ) = 𝛿

42 Count-Min Sketch Result
dynamic data structure CM, item frequency query set 𝑤= 𝑒 𝜀 and 𝑑=ln⁡( 1 𝛿 ) with probability at least 1− 𝛿, 𝑓 𝑗 ≤ 𝑓 𝑗 +𝜀∙ 𝑘 𝑓 𝑘 sublinear space, does not depend on 𝑛 nor 𝑚 running time update 𝑂(𝑑) and freq. query 𝑂(𝑑)

43 Approximate Heavy Hitters
TaeHoon Joseph, Kim

44 Count-Min Sketch (CMS)
Inc(𝑗) takes 𝑂 𝑑 time 𝑂 1×𝑑 update 𝑑 values Count(𝑗) takes 𝑂 𝑑 time return the minimum of 𝑑 values

45 Heavy Hitters Problem Input: Objective:
An array of length 𝑛 with 𝑚 distinct items Objective: Find all items that occur more than 𝑛 𝑘 times in the array there can be at most 𝑘 such items Parameter 𝑘 n is very large, could be millions or billions k is modest, 10 or 1000 𝑘 there can be at most 𝑘 such items in the array of the array. It is also possible that there won’t be any suppose not, more than 𝑘 items, then in total more than 𝑛 𝑘 ⋅𝑘=𝑛 items, contradiction

46 Heavy Hitters Problem: Naïve Solution
Trivial solution is to use 𝑂 𝑚 array Store all items and each item’s frequency Find all 𝑘 items that has frequencies ≥ 𝑛 𝑘

47 𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻)
Relax Heavy Hitters Problem Requires sub-linear space cannot solve exact problem parameters : 𝑘 and 𝜖 in this presentation, 𝜖= 1 2𝑘

48 𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻)
Returns every item occurs more than 𝑛 𝑘 times Returns some items that occur more than 𝑛 𝑘 −𝜖∙𝑛 times Count min sketch 𝑓 𝑗 ≤ 𝑓 𝑗 +𝜀∙ 𝑘 𝑓 𝑘 Count min sketch guarantees that the frequency of an item will satisfy this condition Because we are using CMS on eHH heavy problem, the frequency of items will be guaranteed in the same way, hence some of the items will have frequencies with an error and be returned as heavy hitters

49 Naïve Solution using CMS
m-2 m-1 m j ℎ 1 ℎ 2 ℎ 𝑑 1 𝑑 1 𝑤

50 Naïve Solution using CMS
Query the frequency of all 𝑚 items Return items with Count 𝑗 ≥ 𝑛 𝑘 𝑂 𝑚𝑑 slow

51 Better Solution Use CMS to store the frequency
Use a baseline 𝑏 as a threshold at 𝑖 𝑡ℎ item 𝑏= 𝑖 𝑘 Use MinHeap to store potential heavy hitters at 𝑖 𝑡ℎ item store new items in MinHeap with frequency ≥𝑏 delete old items from MinHeap with frequency <𝑏 support O(log n) insertion and deletion Find-Min 𝑂 1 in time and Extract-Min in 𝑂 ln 𝑘 time

52 𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻)
Returns every item occurs more than 𝑛 𝑘 times Returns some items that occur more than 𝑛 𝑘 −𝜖∙𝑛 times 𝜖= 1 2𝑘 , then 𝐂𝐨𝐮𝐧𝐭 𝑥 ∈[ 𝑓 𝑥 , 𝑓 𝑥 + 𝑛 2𝑘 ] ℎ𝑒𝑎𝑝 𝑠𝑖𝑧𝑒=2𝑘 Other objects that has lower frequency will be deleted from the heap

53 Algorithm Approximate Heavy Hitters
Input stream 𝑥, parameter 𝑘 For each item 𝑗∈𝑥 : Update Count Min Sketch Compare the frequency of 𝑗 with 𝑏 if count≥𝑏 Insert or update 𝑗 in Min Heap remove any value in Min Heap with frequency <𝑏 Returns the MinHeap as Heavy Hitters

54 EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 4 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 1 5 1 𝑑 1 𝑤 𝑖=
When a new item is given When 𝑑 1 𝑤

55 EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 4 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 1 5 1 𝑑 1 𝑤 𝑖= {1:4}
When a new item is given When 𝑑 1 𝑤

56 EXAMPLES ℎ 1 ℎ 𝑑 ℎ 2 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 5 5 1 𝑑 1 𝑤
𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 5 5 {1:3} {1:2} {1:6} ℎ 1 ℎ 𝑑 ℎ 2 {1:9} {1:4} 1 1 When a new item is given When 𝑑 1 𝑤

57 EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 1 𝑑 1 𝑤
𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 {1:3} {1:2} {1:6} ℎ 𝑑 ℎ 2 ℎ 1 {1:9} {1:4} 1 1 When a new item is given When 𝑑 1 𝑤

58 EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 1 𝑑 1 𝑤
𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 {1:3} {1:2} {1:6} ℎ 𝑑 ℎ 2 ℎ 1 {1:9} {1:4} 2 1 When a new item is given When 𝑑 1 𝑤

59 EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 1 𝑑 1 𝑤
𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 {2:4} ℎ 𝑑 ℎ 2 ℎ 1 2 1 When a new item is given When 𝑑 1 𝑤

60 EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 1 𝑑 1 𝑤
𝑖= 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 {16:4} {20:9} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 16 18 15 1 When a new item is given When 𝑑 1 𝑤

61 EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 1 𝑑 1 𝑤
𝑖= 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 {16:4} {20:9} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 17 19 16 1 When a new item is given When 𝑑 1 𝑤

62 EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 1 𝑑 1 𝑤
𝑖= 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 {16:2} {16:4} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 {20:9} 17 19 16 1 When a new item is given When 𝑑 1 𝑤

63 EXAMPLES ℎ 1 ℎ 𝑑 ℎ 2 … 79 80 81 2 1 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 80 5 =16 1 𝑑
𝑖= 79 80 81 2 1 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 80 5 =16 {16:2} {16:4} {23:6} ℎ 1 ℎ 𝑑 ℎ 2 {20:9} 3 6 4 1 When a new item is given When 𝑑 1 𝑤

64 EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2
𝑖= 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 {16:2} {16:4} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 {20:9} 20 24 25 1 When a new item is given When 𝑑 1 𝑤

65 EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2
𝑖= 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 {16:2} {16:4} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 {20:9} 21 25 26 1 When a new item is given When 𝑑 1 𝑤

66 EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2
𝑖= 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 {21:9} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 21 25 26 1 When a new item is given When 𝑑 1 𝑤

67 Analysis Because 𝑛 is unknown, possible heavy hitters are calculated and stored every new item comes in Maintaining the heap requires extra 𝑂 log 𝑘 =𝑂 log 1 𝜀 time

68 AMS Sketch : Estimate Second Moment
Dissanayaka Mudiyanselage Emil Manupa Karunaratne

69 The Second Moment Stream : The Second Moment :
The trivial solution would be : maintain a histogram of size n and get the sum of squares Its not feasible maintain that large array, therefore we intend to find a approximation algorithm to achieve sub-linear space complexity with bounded errors The algorithm will give an estimate within ε relative error with δ failure probability. (Two Parameters)

70 The Method j is the next item in the stream.
2-wise independent d hash functions to find the bucket for each row After finding the bucket, 4-wise independent d hash functions to decide inc/dec : In a summary :

71 The Method Calculate row estimate Median :
Choose 𝑤= 4 𝜖 2 and 𝑑=8log⁡( 1 𝛿 ) , by doing so it will give an estimate with 𝜖 relative error and 𝛿 failure probability

72 Why should this method give F2 ?
For kth row : Estimate F2 from kth row : Each row there would be : First part : Second part : g(i)g(j) can be +1 or -1 with equal probability, therefore the expectation is 0.

73 What guarantee can we give about the accuracy ?
The variance of Rk, a row estimate, is caused by hashing collisions. Given the independent nature of the hash functions, we can safely state the variance is bounded by 𝐹 2 2 𝑤 . Using Chebyshev Inequality, Lets assign, Still the failure probability is is linear in over 1 𝑤 .

74 What guarantee can we give about the accuracy ?
We had d number of hash functions, that produce R1, R2, …. Rd estimates. The Median being wrong  Half of the estimates are wrong These are independent d estimates, like toin-cosses that have exponentially decaying probability to get the same outcome. They have stronger bounds, Chernoff Bounds : 𝜇=𝑑 #𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 ∗ 4 3 (𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑝𝑟𝑜𝑏.) 𝑑 2 𝑒𝑟𝑟𝑜𝑟 𝑖𝑠 𝑑 4 𝑎𝑤𝑎𝑦 𝑓𝑟om mean :

75 Space and Time Complexity
E.g. In order to achieve e-10 of tightly bounded accuracy, only 8 * 10 = 80 rows required Space complexity is O(log(𝛿)). Time complexity will be explained later along with the application

76 AMS Sketch and Applications
Sapumal Ahangama

77 Hash functions ℎ k maps the input domain uniformly to 1,2,…𝑤 buckets
ℎ 𝑘 should be a pairwise independent hash functions, to cancel out product terms Ex: family of ℎ 𝑥 = 𝑎𝑥+𝑏 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑤 For a and b chosen from prime field 𝑝, 𝑎≠0

78 Hash functions 𝑔 𝑘 maps elements from domain uniformly onto {−1, +1}
𝑔 𝑘 should be four-wise independent Ex: family of g x =𝑎 𝑥 3 +𝑏 𝑥 2 +𝑐𝑥+𝑑 𝑚𝑜𝑑 𝑝 equations g 𝑥 =2 𝑎 𝑥 3 +𝑏 𝑥 2 +𝑐𝑥+𝑑 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 2 −1 for 𝑎, 𝑏, 𝑐, 𝑑 chosen uniformly from prime field 𝑝.

79 Hash functions These hash functions can be computed very quickly, faster even than more familiar (cryptographic) hash functions For scenarios which require very high throughput, efficient implementations are available for hash functions, Based on optimizations for particular values of p, and partial precomputations Ref: M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In ACM-SIAM Symposium on Discrete Algorithms, 2004

80 Time complexity - Update
The sketch is initialized by picking the hash functions to use, and initializing the array of counters to all zeros For each update operation, the item is mapped to an entry in each row based on the hash functions ℎ 𝑗 , multiplied by the corresponding value of 𝑔 𝑗 Processing each update therefore takes time 𝑂(𝑑) since each hash function evaluation takes constant time.

81 Time complexity - Query
Found by taking the sum of the squares of each row of the sketch in turn, and finds the median of these sums. That is for each row k, compute 𝑖 𝐶𝑀[𝑘, 𝑖] 2 Take the median of the d such estimates Hence the query time is linear in the size of the sketch, 𝑂(𝑤𝑑)

82 Applications - Inner product
AMS sketch can be used to estimate the inner-product between a pair of vectors Given two frequency distributions 𝑓 𝑎𝑛𝑑 𝑓′ 𝑓. 𝑓 ′ = 𝑖=1 𝑀 𝑓 𝑖 ∗ 𝑓 ′ (𝑖) AMS sketch based estimator is an unbiased estimator for the inner product of the vectors

83 Inner Product Two sketches 𝐶𝑀 and 𝐶𝑀’
Formed with the same parameters and using the same hash functions (same 𝑤, 𝑑, ℎ 𝑘 , 𝑔 𝑘 ) The row estimate is the inner product of the rows, 𝑖=1 𝑤 𝐶𝑀 𝑘, 𝑖 ∗𝐶𝑀′[𝑘, 𝑖]

84 Inner Product Expanding 𝑖=1 𝑤 𝐶𝑀 𝑘, 𝑖 ∗𝐶𝑀′[𝑘, 𝑖]
Shows that the estimate gives 𝑓·𝑓′ with additional cross-terms due to collisions of items under ℎ 𝑘 The expectation of these cross terms is zero Over the choice of the hash functions, as the function 𝑔 𝑘 is equally likely to add as to subtract any given term.

85 Inner Product – Join size estimation
Inner product has a natural interpretation, as the size of the equi-join between two relations… In SQL, SELECT COUNT(*) FROM D, D’ WHERE D.id = D’.id

86 Example UPDATE(23, 1) 23 h1 h2 h3 1 2 3 4 5 6 7 8 d = 3 w = 8

87 Example UPDATE(23, 1) 23 h1 h2 h3 d = 3 w = 8 1 2 3 4 5 6 7 8 -1 +1
ℎ 1 = 3 𝑔 1 =−1 ℎ 2 =1 𝑔 2 =−1 ℎ 3 =7 𝑔 3 =+1 1 2 3 4 5 6 7 8 -1 +1 d = 3 w = 8

88 Example UPDATE(99, 2) 99 h1 h2 h3 1 2 3 4 5 6 7 8 -1 +1 d = 3 w = 8

89 Example UPDATE(99, 2) 99 h1 h2 h3 d = 3 w = 8 1 2 3 4 5 6 7 8 -1 +1
ℎ 1 =5 𝑔 1 =+1 ℎ 2 =1 𝑔 2 =−1 ℎ 3 =3 𝑔 3 =+1 1 2 3 4 5 6 7 8 -1 +1 d = 3 w = 8

90 Example UPDATE(99, 2) 99 h1 h2 h3 d = 3 w = 8 1 2 3 4 5 6 7 8 -1 +2 -3
ℎ 1 =5 𝑔 1 =+1 ℎ 2 =1 𝑔 2 =−1 ℎ 3 =3 𝑔 3 =+1 1 2 3 4 5 6 7 8 -1 +2 -3 +1 d = 3 w = 8


Download ppt "CS6234 Advanced Algorithms February"

Similar presentations


Ads by Google