CS6234 Advanced Algorithms February 10 2015 Streaming Algorithms CS6234 Advanced Algorithms February 10 2015
Approximate answer is usually preferable The stream model Data sequentially enters at a rapid rate from one or more inputs We cannot store the entire stream Processing in real-time Limited memory (usually sub linear in the size of the stream) Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence Approximate answer is usually preferable
Overview Counting bits with DGIM algorithm Bloom Filter Count-Min Sketch Approximate Heavy Hitters AMS Sketch AMS Sketch Applications
Counting bits with DGIM algorithm Presented by Dmitrii Kharkovskii
Sliding windows A useful model : queries are about a window of length N The N most recent elements received (or last N time units) Interesting case: N is still so large that it cannot be stored Or, there are so many streams that windows for all cannot be stored
Problem description Problem Obvious solution Real Problem Given a stream of 0’s and 1’s Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N Obvious solution Store the most recent N bits (i.e., window size = N) When a new bit arrives, discard the N +1st bit Real Problem Slow ‐ need to scan k‐bits to count What if we cannot afford to store N bits? Estimate with an approximate answer
Datar-Gionis-Indyk-Motwani Algorithm (DGIM) Overview Approximate answer Uses 𝑂( 𝑙𝑜𝑔 2 N) of memory Performance guarantee: error no more than 50% Possible to decrease error to any fraction 𝜀 > 0 with 𝑂( 𝑙𝑜𝑔 2 N) memory Possible to generalize for the case of positive integer stream
Main idea of the algorithm Represent the window as a set of exponentially growing non-overlapping buckets
Timestamps Each bit in the stream has a timestamp - the position in the stream from the beginning. Record timestamps modulo N (window size) - use o(log N) bits Store the most recent timestamp to identify the position of any other bit in the window
Buckets Each bucket has two components: Timestamp of the most recent end. Needs 𝑂(𝑙𝑜𝑔 N) bits Size of the bucket - the number of ones in it. Size is always 2 𝑗 . To store j we need 𝑂(𝑙𝑜𝑔 𝑙𝑜𝑔 N) bits Each bucket needs 𝑂(𝑙𝑜𝑔N) bits
Representing the stream by buckets The right end of a bucket is always a position with a 1. Every position with a 1 is in some bucket. Buckets do not overlap. There are one or two buckets of any given size, up to some maximum size. All sizes must be a power of 2. Buckets cannot decrease in size as we move to the left (back in time).
Updating buckets when a new bit arrives Drop the last bucket if it has no overlap with the window If the current bit is zero, no changes are needed If the current bit is one Create a new bucket with it. Size = 1, timestamp = current time modulo N. If there are 3 buckets of size 1, merge two oldest into one of size 2. If there are 3 buckets of size 2, merge two oldest into one of size 4. ...
Example of updating process
Query Answering How many ones are in the most recent k bits? Find all buckets overlapping with last k bits Sum the sizes of all but the oldest one Add the half of the size of the oldest one Ans = 1 + 1 + 2 + 4 + 4 + 8 + 8/2 = 24 k
Memory requirements
Performance guarantee Suppose the last bucket has size 2 𝑟 . By taking half of it, maximum error is 2 𝑟−1 At least one bucket of every size less than 2 𝑟 The true sum is at least 1+ 2 + 4 + … + 2 𝑟−1 = 2 𝑟 - 1 The first bit of the last bucket is always equal to 1. Error is at most 50%
References J. Leskovic, A. Rajamaran, J. Ulmann. “Mining of Massive Datasets”. Cambridge University Press
Bloom Filter Presented by- Naheed Anjum Arafat
Motivation: The “Set Membership” Problem x: An Element S: A Set of elements (Finite) Input: x, S Output: True (if x in S) False (if x not in S) Streaming Algorithm: Limited Space/item Limited Processing time/item Approximate answer based on a summary/sketch of the data stream in the memory. Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)
Bloom Filter Consists of vector of n Boolean values, initially all set false (Complexity:- O(n) ) k independent and uniform hash functions, ℎ 0 , ℎ 1 , … , ℎ k−1 each outputs a value within the range {0, 1, … , n-1} F 1 2 3 4 5 6 7 8 9 n = 10
Bloom Filter For each element sϵS, the Boolean value at positions ℎ 0 𝑠 , ℎ 1 𝑠 , … , ℎ 𝑘−1 𝑠 are set true. Complexity of Insertion:- O(k) 𝑠 1 ℎ 0 𝑠 1 = 1 ℎ 2 𝑠 1 = 6 ℎ 1 𝑠 1 = 4 F 1 2 3 4 5 6 7 8 9 T T T k = 3
Bloom Filter For each element sϵS, the Boolean value at positions ℎ 0 𝑠 , ℎ 1 𝑠 , … , ℎ 𝑘−1 𝑠 are set true. Note: A particular Boolean value may be set to True several times. 𝑠 1 𝑠 2 ℎ 0 𝑠 2 = 4 ℎ 1 𝑠 2 = 7 ℎ 2 𝑠 2 = 9 F T 1 2 3 4 5 6 7 8 9 T T k = 3
Algorithm to Approximate Set Membership Query Input: x ( may/may not be an element) Output: Boolean For all i ϵ {0,1,…,k-1} if hi(x) is False return False return True Runtime Complexity:- O(k) 𝑠 1 𝑠 2 F T 1 2 3 4 5 6 7 8 9 𝑥 = S1 k = 3 𝑥 = S3
Algorithm to Approximate Set Membership Query False Positive!! 𝑠 1 𝑠 2 ℎ 2 𝑠 1 = 6 ℎ 0 𝑠 2 = 4 ℎ 0 𝑠 1 = 1 ℎ 1 𝑠 1 = 4 ℎ 2 𝑠 2 = 9 ℎ 1 𝑠 2 = 7 F T 1 2 3 4 5 6 7 8 9 ℎ 1 𝑥 = 6 ℎ 2 𝑥 = 1 𝑥 ℎ 0 𝑥 = 9 k = 3
Error Types False Negative – Answering “is not there” on an element which “is there” Never happens for Bloom Filter False Positive – Answering “is there” for an element which “is not there” Might happens. How likely?
Probability of false positives n = size of table m = number of items k = number of hash functions Consider a particular bit 0 <= j <= n-1 Probability that ℎ 𝑖 𝑥 does not set bit j after hashing only 1 item: 𝑃 ℎ 𝑖 𝑥 ≠𝑗 = 1− 1 𝑛 Probability that ℎ 𝑖 𝑥 does not set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 { 𝑆 1 , 𝑆 2 ,…, 𝑆 𝑚 }: ℎ 𝑖 𝑥 ≠𝑗 = 1− 1 𝑛 𝑚 Question: Where the randomness comes from? Ans: From the set {S1,S2,….,Sm}. Imagine, you are trying m times to set the bit value at position j to True using a certain (i-th) hash function (no randomness here so far, the function is fixed):- hi. The last line is the probability of your failure to set that bit to True.
Probability of false positives n = size of table m = number of items k = number of hash functions Probability that none of the hash functions set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 𝑆 1 , 𝑆 2 ,…, 𝑆 𝑚 ,∀𝑖 𝑖𝑛 1,2,…,𝑘 : ℎ 𝑖 (𝑥)≠𝑗 = 1− 1 𝑛 𝑘𝑚 We know that, 1− 1 𝑛 𝑛 ≈ 1 e = 𝑒 −1 ⇒ 1− 1 𝑛 𝑘𝑚 = 1− 1 𝑛 𝑛 𝑘 𝑚 𝑛 ≈ 𝑒 −1 𝑘 𝑚 𝑛 = 𝑒 −𝑘 𝑚 𝑛 In the first equation:- The reason why you will multiply (1-1/n)^m term k times is:- When you are using k hash functions to hash m items, you are trying k times to set the bit value at position j to True. Once again, there is no randomness in hi here. You are just taking more trials to set the bit(j) to True.
Probability of false positives n = size of table m = number of items k = number of hash functions Approximate Probability of False Positive Probability that bit j is not set 𝑃 𝐵𝑖𝑡 𝑗 =𝐹 = 𝑒 −𝑘 𝑚 𝑛 The prob. of having all k bits of a new element already set = 𝟏− 𝒆 − 𝒌 𝒎 𝒏 𝒌 For a fixed m, n which value of k will minimize this bound? kopt = log 𝑒 2 ⋅ 𝑛 𝑚 Bit per item The probability of False Positive = ( 1 2 ) 𝑘 𝑜𝑝𝑡 = (0.6185) 𝑛 𝑚
Bloom Filters: cons Small false positive probability Cannot handle deletions Size of the Bit vector has to be set a priori in order to maintain a predetermined FP-rates :- Resolved in “Scalable Bloom Filter” – Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable Bloom Filters" (PDF), Information Processing Letters 101 (6): 255–261
References https://en.wikipedia.org/wiki/Bloom_filter Graham Cormode, Sketch Techniques for Approximate Query Processing, ATT Research Michael Mitzenmacher, Compressed Bloom Filters, Harvard University, Cambridge http://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf
Count-Min Sketch Erick Purwanto A0050717L
Motivation Count-Min Sketch Implemented in real system AT&T: network switch to analyze network traffic using limited memory Google: implemented on top of MapReduce parallel processing infrastructure Simple and used to solve other problems Heavy Hitters by Joseph Second Moment 𝐹 2 , AMS Sketch by Manupa Inner Product, Self Join by Sapumal
Frequency Query Trivial if we have count array [1,𝑚] Given a stream of data vector 𝑥 of length 𝑛, 𝑥 𝑖 ∈[1,𝑚] and update (increment) operation, we want to know at each time, what is 𝑓 𝑗 the frequency of item 𝑗 assume frequency 𝑓 𝑗 ≥0 Trivial if we have count array [1,𝑚] we want sublinear space probabilistically approximately correct 𝑥 … 𝑗
Count-Min Sketch Assumption: family of 𝑑–independent hash function 𝐻 sample 𝑑 hash functions ℎ 𝑖 ←𝐻 Use: 𝑑 indep. hash func. and integer array CM[𝑤,𝑑] 𝑥 … 𝑗 ℎ 𝑖 : 1,𝑚 →[1,𝑤] 1 ℎ 𝑖 (𝑗) 𝑤
Count-Min Sketch Algorithm to Update: Inc(𝑗) : for each row 𝑖, CM[ 𝑖,ℎ 𝑖 (𝑗)] +=1 𝑥 … 𝑗 ℎ 1 ℎ 2 CM +1 1 +1 ℎ 𝑑 +1 𝑑 1 𝑤
Count-Min Sketch Algorithm to estimate Frequency Query: Count(𝑗) : 𝑓 𝑗 = min𝑖 CM[ 𝑖, ℎ 𝑖 (𝑗)] 𝑗 ℎ 1 ℎ 2 CM 1 ℎ 𝑑 𝑑 1 𝑤
Collision Entry CM 𝑖, ℎ 𝑖 𝑗 is an estimate of the frequency of item 𝑗 at row 𝑖 for example, ℎ 1 5 = ℎ 1 2 =7 Let 𝑓 𝑗 : frequency of 𝑗, and random variable 𝑋 𝑖,𝑗 : frequency of all 𝑘≠𝑗, ℎ 𝑖 𝑘 = ℎ 𝑖 (𝑗) 𝑥 … 3 5 5 8 5 2 5 row 1 1 7 𝑤
Count-Min Sketch Analysis row 𝑖 1 ℎ 𝑖 (𝑗) 𝑤 Estimate frequency of 𝑗 at row 𝑖: 𝑓 𝑖,𝑗 = CM 𝑖, ℎ 𝑖 𝑗 = 𝑓 𝑗 + 𝑘≠𝑗, ℎ 𝑖 𝑘 = ℎ 𝑖 𝑗 𝑛 𝑓 𝑘 = 𝑓 𝑗 + 𝑋 𝑖,𝑗
Count-Min Sketch Analysis Let 𝜀 : approximation error, and set 𝑤= 𝑒 𝜀 The expectation of other item contribution: E[ 𝑋 𝑖,𝑗 ] = 𝑘≠𝑗 𝑓 𝑘 ⋅Pr[ ℎ 𝑖 𝑘 = ℎ 𝑖 𝑗 ] ≤ Pr ℎ 𝑖 𝑘 = ℎ 𝑖 𝑗 ⋅ 𝑘 𝑓 𝑘 . = 1 𝑤 ⋅ 𝐹 1 = 𝜀 𝑒 ⋅ 𝐹 1
Count-Min Sketch Analysis Markov Inequality: Pr[ 𝑋≥𝑘∙E 𝑋 ]≤ 1 𝑘 Probability an estimate 𝜀 ⋅𝐹 1 far from true value: Pr 𝑓 𝑖,𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 = Pr[ 𝑋 𝑖,𝑗 >𝜀∙ 𝐹 1 ] = Pr[ 𝑋 𝑖,𝑗 >𝑒⋅E 𝑋 𝑖,𝑗 ] ≤ 1 𝑒
Count-Min Sketch Analysis Let 𝛿 : failure probability, and set 𝑑=ln( 1 𝛿 ) Probability final estimate far from true value: Pr 𝑓 𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 = Pr ∀𝑖 : 𝑓 𝑖,𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 = ( Pr 𝑓 𝑖,𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 ) 𝑑 ≤ 1 𝑒 ln( 1 𝛿 ) = 𝛿
Count-Min Sketch Result dynamic data structure CM, item frequency query set 𝑤= 𝑒 𝜀 and 𝑑=ln( 1 𝛿 ) with probability at least 1− 𝛿, 𝑓 𝑗 ≤ 𝑓 𝑗 +𝜀∙ 𝑘 𝑓 𝑘 sublinear space, does not depend on 𝑛 nor 𝑚 running time update 𝑂(𝑑) and freq. query 𝑂(𝑑)
Approximate Heavy Hitters TaeHoon Joseph, Kim
Count-Min Sketch (CMS) Inc(𝑗) takes 𝑂 𝑑 time 𝑂 1×𝑑 update 𝑑 values Count(𝑗) takes 𝑂 𝑑 time return the minimum of 𝑑 values
Heavy Hitters Problem Input: Objective: An array of length 𝑛 with 𝑚 distinct items Objective: Find all items that occur more than 𝑛 𝑘 times in the array there can be at most 𝑘 such items Parameter 𝑘 n is very large, could be millions or billions k is modest, 10 or 1000 𝑘 there can be at most 𝑘 such items in the array of the array. It is also possible that there won’t be any suppose not, more than 𝑘 items, then in total more than 𝑛 𝑘 ⋅𝑘=𝑛 items, contradiction
Heavy Hitters Problem: Naïve Solution Trivial solution is to use 𝑂 𝑚 array Store all items and each item’s frequency Find all 𝑘 items that has frequencies ≥ 𝑛 𝑘
𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻) Relax Heavy Hitters Problem Requires sub-linear space cannot solve exact problem parameters : 𝑘 and 𝜖 in this presentation, 𝜖= 1 2𝑘
𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻) Returns every item occurs more than 𝑛 𝑘 times Returns some items that occur more than 𝑛 𝑘 −𝜖∙𝑛 times Count min sketch 𝑓 𝑗 ≤ 𝑓 𝑗 +𝜀∙ 𝑘 𝑓 𝑘 Count min sketch guarantees that the frequency of an item will satisfy this condition Because we are using CMS on eHH heavy problem, the frequency of items will be guaranteed in the same way, hence some of the items will have frequencies with an error and be returned as heavy hitters
Naïve Solution using CMS … m-2 m-1 m j ℎ 1 ℎ 2 ℎ 𝑑 … 1 𝑑 1 𝑤
Naïve Solution using CMS Query the frequency of all 𝑚 items Return items with Count 𝑗 ≥ 𝑛 𝑘 𝑂 𝑚𝑑 slow
Better Solution Use CMS to store the frequency Use a baseline 𝑏 as a threshold at 𝑖 𝑡ℎ item 𝑏= 𝑖 𝑘 Use MinHeap to store potential heavy hitters at 𝑖 𝑡ℎ item store new items in MinHeap with frequency ≥𝑏 delete old items from MinHeap with frequency <𝑏 support O(log n) insertion and deletion Find-Min 𝑂 1 in time and Extract-Min in 𝑂 ln 𝑘 time
𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻) Returns every item occurs more than 𝑛 𝑘 times Returns some items that occur more than 𝑛 𝑘 −𝜖∙𝑛 times 𝜖= 1 2𝑘 , then 𝐂𝐨𝐮𝐧𝐭 𝑥 ∈[ 𝑓 𝑥 , 𝑓 𝑥 + 𝑛 2𝑘 ] ℎ𝑒𝑎𝑝 𝑠𝑖𝑧𝑒=2𝑘 Other objects that has lower frequency will be deleted from the heap
Algorithm Approximate Heavy Hitters Input stream 𝑥, parameter 𝑘 For each item 𝑗∈𝑥 : Update Count Min Sketch Compare the frequency of 𝑗 with 𝑏 if count≥𝑏 Insert or update 𝑗 in Min Heap remove any value in Min Heap with frequency <𝑏 Returns the MinHeap as Heavy Hitters
EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 4 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 1 5 1 𝑑 1 𝑤 𝑖= When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 4 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 1 5 1 𝑑 1 𝑤 𝑖= {1:4} When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 1 ℎ 𝑑 ℎ 2 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 5 5 1 𝑑 1 𝑤 𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 5 5 {1:3} {1:2} {1:6} ℎ 1 ℎ 𝑑 ℎ 2 {1:9} {1:4} 1 … 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 1 𝑑 1 𝑤 𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 {1:3} {1:2} {1:6} ℎ 𝑑 ℎ 2 ℎ 1 {1:9} {1:4} 1 … 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 1 𝑑 1 𝑤 𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 {1:3} {1:2} {1:6} ℎ 𝑑 ℎ 2 ℎ 1 {1:9} {1:4} 2 … 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 1 𝑑 1 𝑤 𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 {2:4} ℎ 𝑑 ℎ 2 ℎ 1 2 … 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 1 𝑑 1 𝑤 𝑖= 79 … 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 {16:4} {20:9} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 16 18 … 15 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 1 𝑑 1 𝑤 𝑖= 79 … 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 {16:4} {20:9} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 17 19 … 16 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 1 𝑑 1 𝑤 𝑖= 79 … 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 {16:2} {16:4} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 {20:9} 17 19 … 16 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 1 ℎ 𝑑 ℎ 2 … 79 80 81 2 1 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 80 5 =16 1 𝑑 𝑖= 79 80 81 … 2 1 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 80 5 =16 {16:2} {16:4} {23:6} ℎ 1 ℎ 𝑑 ℎ 2 {20:9} 3 6 … 4 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 𝑖= 79 80 81 … 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 {16:2} {16:4} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 {20:9} 20 24 … 25 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 𝑖= 79 80 81 … 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 {16:2} {16:4} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 {20:9} 21 25 … 26 1 When a new item is given When 𝑑 1 𝑤
EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 𝑖= 79 80 81 … 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 {21:9} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 21 25 … 26 1 When a new item is given When 𝑑 1 𝑤
Analysis Because 𝑛 is unknown, possible heavy hitters are calculated and stored every new item comes in Maintaining the heap requires extra 𝑂 log 𝑘 =𝑂 log 1 𝜀 time
AMS Sketch : Estimate Second Moment Dissanayaka Mudiyanselage Emil Manupa Karunaratne
The Second Moment Stream : The Second Moment : The trivial solution would be : maintain a histogram of size n and get the sum of squares Its not feasible maintain that large array, therefore we intend to find a approximation algorithm to achieve sub-linear space complexity with bounded errors The algorithm will give an estimate within ε relative error with δ failure probability. (Two Parameters)
The Method j is the next item in the stream. 2-wise independent d hash functions to find the bucket for each row After finding the bucket, 4-wise independent d hash functions to decide inc/dec : In a summary :
The Method Calculate row estimate Median : Choose 𝑤= 4 𝜖 2 and 𝑑=8log( 1 𝛿 ) , by doing so it will give an estimate with 𝜖 relative error and 𝛿 failure probability
Why should this method give F2 ? For kth row : Estimate F2 from kth row : Each row there would be : First part : Second part : g(i)g(j) can be +1 or -1 with equal probability, therefore the expectation is 0.
What guarantee can we give about the accuracy ? The variance of Rk, a row estimate, is caused by hashing collisions. Given the independent nature of the hash functions, we can safely state the variance is bounded by 𝐹 2 2 𝑤 . Using Chebyshev Inequality, Lets assign, Still the failure probability is is linear in over 1 𝑤 .
What guarantee can we give about the accuracy ? We had d number of hash functions, that produce R1, R2, …. Rd estimates. The Median being wrong Half of the estimates are wrong These are independent d estimates, like toin-cosses that have exponentially decaying probability to get the same outcome. They have stronger bounds, Chernoff Bounds : 𝜇=𝑑 #𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 ∗ 4 3 (𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑝𝑟𝑜𝑏.) 𝑑 2 𝑒𝑟𝑟𝑜𝑟 𝑖𝑠 𝑑 4 𝑎𝑤𝑎𝑦 𝑓𝑟om mean :
Space and Time Complexity E.g. In order to achieve e-10 of tightly bounded accuracy, only 8 * 10 = 80 rows required Space complexity is O(log(𝛿)). Time complexity will be explained later along with the application
AMS Sketch and Applications Sapumal Ahangama
Hash functions ℎ k maps the input domain uniformly to 1,2,…𝑤 buckets ℎ 𝑘 should be a pairwise independent hash functions, to cancel out product terms Ex: family of ℎ 𝑥 = 𝑎𝑥+𝑏 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑤 For a and b chosen from prime field 𝑝, 𝑎≠0
Hash functions 𝑔 𝑘 maps elements from domain uniformly onto {−1, +1} 𝑔 𝑘 should be four-wise independent Ex: family of g x =𝑎 𝑥 3 +𝑏 𝑥 2 +𝑐𝑥+𝑑 𝑚𝑜𝑑 𝑝 equations g 𝑥 =2 𝑎 𝑥 3 +𝑏 𝑥 2 +𝑐𝑥+𝑑 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 2 −1 for 𝑎, 𝑏, 𝑐, 𝑑 chosen uniformly from prime field 𝑝.
Hash functions These hash functions can be computed very quickly, faster even than more familiar (cryptographic) hash functions For scenarios which require very high throughput, efficient implementations are available for hash functions, Based on optimizations for particular values of p, and partial precomputations Ref: M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In ACM-SIAM Symposium on Discrete Algorithms, 2004
Time complexity - Update The sketch is initialized by picking the hash functions to use, and initializing the array of counters to all zeros For each update operation, the item is mapped to an entry in each row based on the hash functions ℎ 𝑗 , multiplied by the corresponding value of 𝑔 𝑗 Processing each update therefore takes time 𝑂(𝑑) since each hash function evaluation takes constant time.
Time complexity - Query Found by taking the sum of the squares of each row of the sketch in turn, and finds the median of these sums. That is for each row k, compute 𝑖 𝐶𝑀[𝑘, 𝑖] 2 Take the median of the d such estimates Hence the query time is linear in the size of the sketch, 𝑂(𝑤𝑑)
Applications - Inner product AMS sketch can be used to estimate the inner-product between a pair of vectors Given two frequency distributions 𝑓 𝑎𝑛𝑑 𝑓′ 𝑓. 𝑓 ′ = 𝑖=1 𝑀 𝑓 𝑖 ∗ 𝑓 ′ (𝑖) AMS sketch based estimator is an unbiased estimator for the inner product of the vectors
Inner Product Two sketches 𝐶𝑀 and 𝐶𝑀’ Formed with the same parameters and using the same hash functions (same 𝑤, 𝑑, ℎ 𝑘 , 𝑔 𝑘 ) The row estimate is the inner product of the rows, 𝑖=1 𝑤 𝐶𝑀 𝑘, 𝑖 ∗𝐶𝑀′[𝑘, 𝑖]
Inner Product Expanding 𝑖=1 𝑤 𝐶𝑀 𝑘, 𝑖 ∗𝐶𝑀′[𝑘, 𝑖] Shows that the estimate gives 𝑓·𝑓′ with additional cross-terms due to collisions of items under ℎ 𝑘 The expectation of these cross terms is zero Over the choice of the hash functions, as the function 𝑔 𝑘 is equally likely to add as to subtract any given term.
Inner Product – Join size estimation Inner product has a natural interpretation, as the size of the equi-join between two relations… In SQL, SELECT COUNT(*) FROM D, D’ WHERE D.id = D’.id
Example UPDATE(23, 1) 23 h1 h2 h3 1 2 3 4 5 6 7 8 d = 3 w = 8
Example UPDATE(23, 1) 23 h1 h2 h3 d = 3 w = 8 1 2 3 4 5 6 7 8 -1 +1 ℎ 1 = 3 𝑔 1 =−1 ℎ 2 =1 𝑔 2 =−1 ℎ 3 =7 𝑔 3 =+1 1 2 3 4 5 6 7 8 -1 +1 d = 3 w = 8
Example UPDATE(99, 2) 99 h1 h2 h3 1 2 3 4 5 6 7 8 -1 +1 d = 3 w = 8
Example UPDATE(99, 2) 99 h1 h2 h3 d = 3 w = 8 1 2 3 4 5 6 7 8 -1 +1 ℎ 1 =5 𝑔 1 =+1 ℎ 2 =1 𝑔 2 =−1 ℎ 3 =3 𝑔 3 =+1 1 2 3 4 5 6 7 8 -1 +1 d = 3 w = 8
Example UPDATE(99, 2) 99 h1 h2 h3 d = 3 w = 8 1 2 3 4 5 6 7 8 -1 +2 -3 ℎ 1 =5 𝑔 1 =+1 ℎ 2 =1 𝑔 2 =−1 ℎ 3 =3 𝑔 3 =+1 1 2 3 4 5 6 7 8 -1 +2 -3 +1 d = 3 w = 8