CS6234 Advanced Algorithms February

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Data Stream Algorithms Frequency Moments
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Mining Data Streams.
Techniques for Dealing with Hard Problems Backtrack: –Systematically enumerates all potential solutions by continually trying to extend a partial solution.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Heavy hitter computation over data stream
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Hashing General idea: Get a large array
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Updating SF-Tree Speaker: Ho Wai Shing.
New Characterizations in Turnstile Streams with Applications
The Stream Model Sliding Windows Counting 1’s
Analysis of Algorithms
The Variable-Increment Counting Bloom Filter
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Finding Frequent Items in Data Streams
Streaming & sampling.
Hashing Alexandra Stefan.
Mining Data Streams (Part 1)
Query-Friendly Compression of Graph Streams
Hash Table.
Lecture 11: Nearest Neighbor Search
Sublinear Algorithmic Tools 2
Counting How Many Elements Computing “Moments”
Mining Data Streams (Part 2)
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 4: CountSketch High Frequencies
Computer Science 2 Hashing
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Hash Tables.
Hidden Markov Models Part 2: Algorithms
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective of This Course
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Range-Efficient Computation of F0 over Massive Data Streams
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
Minwise Hashing and Efficient Search
Approximation and Load Shedding Sampling Methods
CS 3343: Analysis of Algorithms
The Selection Problem.
(Learned) Frequency Estimation Algorithms
Counting Bits.
Algorithms Tutorial 27th Sept, 2019.
Presentation transcript:

CS6234 Advanced Algorithms February 10 2015 Streaming Algorithms CS6234 Advanced Algorithms February 10 2015

Approximate answer is usually preferable The stream model Data sequentially enters at a rapid rate from one or more inputs We cannot store the entire stream Processing in real-time Limited memory (usually sub linear in the size of the stream) Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence Approximate answer is usually preferable

Overview Counting bits with DGIM algorithm Bloom Filter Count-Min Sketch Approximate Heavy Hitters AMS Sketch AMS Sketch Applications

Counting bits with DGIM algorithm Presented by Dmitrii Kharkovskii

Sliding windows A useful model : queries are about a window of length N The N most recent elements received (or last N time units) Interesting case: N is still so large that it cannot be stored Or, there are so many streams that windows for all cannot be stored

Problem description Problem Obvious solution Real Problem Given a stream of 0’s and 1’s Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N Obvious solution Store the most recent N bits (i.e., window size = N) When a new bit arrives, discard the N +1st bit Real Problem Slow ‐ need to scan k‐bits to count What if we cannot afford to store N bits? Estimate with an approximate answer

Datar-Gionis-Indyk-Motwani Algorithm (DGIM) Overview Approximate answer Uses 𝑂( 𝑙𝑜𝑔 2 N)  of memory Performance guarantee: error no more than 50% Possible to decrease error to any fraction 𝜀 > 0 with 𝑂( 𝑙𝑜𝑔 2 N) memory Possible to generalize for the case of positive integer stream

Main idea of the algorithm Represent the window as a set of exponentially growing non-overlapping buckets

Timestamps Each bit in the stream has a timestamp  - the position in the stream from the beginning. Record timestamps modulo N (window size) - use o(log N) bits Store the most recent timestamp to identify the position of any other bit  in the window  

Buckets Each bucket has two components: Timestamp of the most recent end. Needs 𝑂(𝑙𝑜𝑔 N) bits Size of the bucket - the number of ones in it. Size is always 2 𝑗 . To store j we need 𝑂(𝑙𝑜𝑔 𝑙𝑜𝑔 N) bits Each bucket needs 𝑂(𝑙𝑜𝑔⁡N) bits

Representing the stream by buckets The right end of a bucket is always a position with a 1. Every position with a 1 is in some bucket. Buckets do not overlap. There are one or two buckets of any given size, up to some maximum size. All sizes must be a power of 2. Buckets cannot decrease in size as we move to the left (back in time).

Updating buckets when a new bit arrives Drop the last bucket if it has no overlap with the window If the current bit is zero, no changes are needed If the current bit is one Create a new bucket with it. Size = 1, timestamp = current time modulo N. If there are 3 buckets of size 1, merge two oldest into one of size 2. If there are 3 buckets of size 2, merge two oldest into one of size 4. ...

Example of updating process

Query Answering How many ones are in the most recent k bits? Find all buckets overlapping with last k bits Sum the sizes of all but the oldest one Add the half of the size of the oldest one Ans = 1 + 1 + 2 + 4 + 4 + 8 + 8/2 = 24 k

Memory requirements  

Performance guarantee Suppose the last bucket has size 2 𝑟 . By taking half of it, maximum error is 2 𝑟−1 At least one bucket of every size less than 2 𝑟 The true sum is at least 1+ 2 + 4 + … +   2 𝑟−1 = 2 𝑟 - 1 The first bit of the last bucket is always equal to 1. Error is at most 50%

References J. Leskovic, A. Rajamaran, J. Ulmann. “Mining of Massive Datasets”. Cambridge University Press 

Bloom Filter Presented by- Naheed Anjum Arafat

Motivation: The “Set Membership” Problem x: An Element S: A Set of elements (Finite) Input: x, S Output: True (if x in S) False (if x not in S) Streaming Algorithm: Limited Space/item Limited Processing time/item Approximate answer based on a summary/sketch of the data stream in the memory. Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)

Bloom Filter Consists of vector of n Boolean values, initially all set false (Complexity:- O(n) ) k independent and uniform hash functions, ℎ 0 , ℎ 1 , … , ℎ k−1 each outputs a value within the range {0, 1, … , n-1} F 1 2 3 4 5 6 7 8 9 n = 10

Bloom Filter For each element sϵS, the Boolean value at positions ℎ 0 𝑠 , ℎ 1 𝑠 , … , ℎ 𝑘−1 𝑠 are set true. Complexity of Insertion:- O(k) 𝑠 1 ℎ 0 𝑠 1 = 1 ℎ 2 𝑠 1 = 6 ℎ 1 𝑠 1 = 4 F 1 2 3 4 5 6 7 8 9 T T T k = 3

Bloom Filter For each element sϵS, the Boolean value at positions ℎ 0 𝑠 , ℎ 1 𝑠 , … , ℎ 𝑘−1 𝑠 are set true. Note: A particular Boolean value may be set to True several times. 𝑠 1 𝑠 2 ℎ 0 𝑠 2 = 4 ℎ 1 𝑠 2 = 7 ℎ 2 𝑠 2 = 9 F T 1 2 3 4 5 6 7 8 9 T T k = 3

Algorithm to Approximate Set Membership Query Input: x ( may/may not be an element) Output: Boolean For all i ϵ {0,1,…,k-1} if hi(x) is False return False return True Runtime Complexity:- O(k) 𝑠 1 𝑠 2 F T 1 2 3 4 5 6 7 8 9 𝑥 = S1 k = 3 𝑥 = S3

Algorithm to Approximate Set Membership Query False Positive!! 𝑠 1 𝑠 2 ℎ 2 𝑠 1 = 6 ℎ 0 𝑠 2 = 4 ℎ 0 𝑠 1 = 1 ℎ 1 𝑠 1 = 4 ℎ 2 𝑠 2 = 9 ℎ 1 𝑠 2 = 7 F T 1 2 3 4 5 6 7 8 9 ℎ 1 𝑥 = 6 ℎ 2 𝑥 = 1 𝑥 ℎ 0 𝑥 = 9 k = 3

Error Types False Negative – Answering “is not there” on an element which “is there” Never happens for Bloom Filter False Positive – Answering “is there” for an element which “is not there” Might happens. How likely?

Probability of false positives n = size of table m = number of items k = number of hash functions Consider a particular bit 0 <= j <= n-1 Probability that ℎ 𝑖 𝑥 does not set bit j after hashing only 1 item: 𝑃 ℎ 𝑖 𝑥 ≠𝑗 = 1− 1 𝑛 Probability that ℎ 𝑖 𝑥 does not set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 { 𝑆 1 , 𝑆 2 ,…, 𝑆 𝑚 }: ℎ 𝑖 𝑥 ≠𝑗 = 1− 1 𝑛 𝑚 Question: Where the randomness comes from? Ans: From the set {S1,S2,….,Sm}. Imagine, you are trying m times to set the bit value at position j to True using a certain (i-th) hash function (no randomness here so far, the function is fixed):- hi. The last line is the probability of your failure to set that bit to True.

Probability of false positives n = size of table m = number of items k = number of hash functions Probability that none of the hash functions set bit j after hashing m items: 𝑃 ∀𝑥 𝑖𝑛 𝑆 1 , 𝑆 2 ,…, 𝑆 𝑚 ,∀𝑖 𝑖𝑛 1,2,…,𝑘 : ℎ 𝑖 (𝑥)≠𝑗 = 1− 1 𝑛 𝑘𝑚 We know that, 1− 1 𝑛 𝑛 ≈ 1 e = 𝑒 −1 ⇒ 1− 1 𝑛 𝑘𝑚 = 1− 1 𝑛 𝑛 𝑘 𝑚 𝑛 ≈ 𝑒 −1 𝑘 𝑚 𝑛 = 𝑒 −𝑘 𝑚 𝑛 In the first equation:- The reason why you will multiply (1-1/n)^m term k times is:- When you are using k hash functions to hash m items, you are trying k times to set the bit value at position j to True. Once again, there is no randomness in hi here. You are just taking more trials to set the bit(j) to True.

Probability of false positives n = size of table m = number of items k = number of hash functions Approximate Probability of False Positive Probability that bit j is not set 𝑃 𝐵𝑖𝑡 𝑗 =𝐹 = 𝑒 −𝑘 𝑚 𝑛 The prob. of having all k bits of a new element already set = 𝟏− 𝒆 − 𝒌 𝒎 𝒏 𝒌 For a fixed m, n which value of k will minimize this bound? kopt = log 𝑒 2 ⋅ 𝑛 𝑚 Bit per item The probability of False Positive = ( 1 2 ) 𝑘 𝑜𝑝𝑡 = (0.6185) 𝑛 𝑚

Bloom Filters: cons Small false positive probability Cannot handle deletions Size of the Bit vector has to be set a priori in order to maintain a predetermined FP-rates :- Resolved in “Scalable Bloom Filter” – Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable Bloom Filters" (PDF), Information Processing Letters 101 (6): 255–261

References https://en.wikipedia.org/wiki/Bloom_filter Graham Cormode, Sketch Techniques for Approximate Query Processing, ATT Research Michael Mitzenmacher, Compressed Bloom Filters, Harvard University, Cambridge http://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf

Count-Min Sketch Erick Purwanto A0050717L

Motivation Count-Min Sketch Implemented in real system AT&T: network switch to analyze network traffic using limited memory Google: implemented on top of MapReduce parallel processing infrastructure Simple and used to solve other problems Heavy Hitters by Joseph Second Moment 𝐹 2 , AMS Sketch by Manupa Inner Product, Self Join by Sapumal

Frequency Query Trivial if we have count array [1,𝑚] Given a stream of data vector 𝑥 of length 𝑛, 𝑥 𝑖 ∈[1,𝑚] and update (increment) operation, we want to know at each time, what is 𝑓 𝑗 the frequency of item 𝑗 assume frequency 𝑓 𝑗 ≥0 Trivial if we have count array [1,𝑚] we want sublinear space probabilistically approximately correct 𝑥 … 𝑗

Count-Min Sketch Assumption: family of 𝑑–independent hash function 𝐻 sample 𝑑 hash functions ℎ 𝑖 ←𝐻 Use: 𝑑 indep. hash func. and integer array CM[𝑤,𝑑] 𝑥 … 𝑗 ℎ 𝑖 : 1,𝑚 →[1,𝑤] 1 ℎ 𝑖 (𝑗) 𝑤

Count-Min Sketch Algorithm to Update: Inc(𝑗) : for each row 𝑖, CM[ 𝑖,ℎ 𝑖 (𝑗)] +=1 𝑥 … 𝑗 ℎ 1 ℎ 2 CM +1 1 +1 ℎ 𝑑 +1 𝑑 1 𝑤

Count-Min Sketch Algorithm to estimate Frequency Query: Count(𝑗) : 𝑓 𝑗 = min𝑖 CM[ 𝑖, ℎ 𝑖 (𝑗)] 𝑗 ℎ 1 ℎ 2 CM 1 ℎ 𝑑 𝑑 1 𝑤

Collision Entry CM 𝑖, ℎ 𝑖 𝑗 is an estimate of the frequency of item 𝑗 at row 𝑖 for example, ℎ 1 5 = ℎ 1 2 =7 Let 𝑓 𝑗 : frequency of 𝑗, and random variable 𝑋 𝑖,𝑗 : frequency of all 𝑘≠𝑗, ℎ 𝑖 𝑘 = ℎ 𝑖 (𝑗) 𝑥 … 3 5 5 8 5 2 5 row 1 1 7 𝑤

Count-Min Sketch Analysis row 𝑖 1 ℎ 𝑖 (𝑗) 𝑤 Estimate frequency of 𝑗 at row 𝑖: 𝑓 𝑖,𝑗 = CM 𝑖, ℎ 𝑖 𝑗 = 𝑓 𝑗 + 𝑘≠𝑗, ℎ 𝑖 𝑘 = ℎ 𝑖 𝑗 𝑛 𝑓 𝑘 = 𝑓 𝑗 + 𝑋 𝑖,𝑗

Count-Min Sketch Analysis Let 𝜀 : approximation error, and set 𝑤= 𝑒 𝜀 The expectation of other item contribution: E[ 𝑋 𝑖,𝑗 ] = 𝑘≠𝑗 𝑓 𝑘 ⋅Pr[ ℎ 𝑖 𝑘 = ℎ 𝑖 𝑗 ] ≤ Pr ℎ 𝑖 𝑘 = ℎ 𝑖 𝑗 ⋅ 𝑘 𝑓 𝑘 . = 1 𝑤 ⋅ 𝐹 1 = 𝜀 𝑒 ⋅ 𝐹 1

Count-Min Sketch Analysis Markov Inequality: Pr[ 𝑋≥𝑘∙E 𝑋 ]≤ 1 𝑘 Probability an estimate 𝜀 ⋅𝐹 1 far from true value: Pr 𝑓 𝑖,𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 = Pr[ 𝑋 𝑖,𝑗 >𝜀∙ 𝐹 1 ] = Pr[ 𝑋 𝑖,𝑗 >𝑒⋅E 𝑋 𝑖,𝑗 ] ≤ 1 𝑒

Count-Min Sketch Analysis Let 𝛿 : failure probability, and set 𝑑=ln( 1 𝛿 ) Probability final estimate far from true value: Pr 𝑓 𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 = Pr ∀𝑖 : 𝑓 𝑖,𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 = ( Pr 𝑓 𝑖,𝑗 > 𝑓 𝑗 +𝜀∙ 𝐹 1 ) 𝑑 ≤ 1 𝑒 ln( 1 𝛿 ) = 𝛿

Count-Min Sketch Result dynamic data structure CM, item frequency query set 𝑤= 𝑒 𝜀 and 𝑑=ln⁡( 1 𝛿 ) with probability at least 1− 𝛿, 𝑓 𝑗 ≤ 𝑓 𝑗 +𝜀∙ 𝑘 𝑓 𝑘 sublinear space, does not depend on 𝑛 nor 𝑚 running time update 𝑂(𝑑) and freq. query 𝑂(𝑑)

Approximate Heavy Hitters TaeHoon Joseph, Kim

Count-Min Sketch (CMS) Inc(𝑗) takes 𝑂 𝑑 time 𝑂 1×𝑑 update 𝑑 values Count(𝑗) takes 𝑂 𝑑 time return the minimum of 𝑑 values

Heavy Hitters Problem Input: Objective: An array of length 𝑛 with 𝑚 distinct items Objective: Find all items that occur more than 𝑛 𝑘 times in the array there can be at most 𝑘 such items Parameter 𝑘 n is very large, could be millions or billions k is modest, 10 or 1000 𝑘 there can be at most 𝑘 such items in the array of the array. It is also possible that there won’t be any suppose not, more than 𝑘 items, then in total more than 𝑛 𝑘 ⋅𝑘=𝑛 items, contradiction

Heavy Hitters Problem: Naïve Solution Trivial solution is to use 𝑂 𝑚 array Store all items and each item’s frequency Find all 𝑘 items that has frequencies ≥ 𝑛 𝑘

𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻) Relax Heavy Hitters Problem Requires sub-linear space cannot solve exact problem parameters : 𝑘 and 𝜖 in this presentation, 𝜖= 1 2𝑘

𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻) Returns every item occurs more than 𝑛 𝑘 times Returns some items that occur more than 𝑛 𝑘 −𝜖∙𝑛 times Count min sketch 𝑓 𝑗 ≤ 𝑓 𝑗 +𝜀∙ 𝑘 𝑓 𝑘 Count min sketch guarantees that the frequency of an item will satisfy this condition Because we are using CMS on eHH heavy problem, the frequency of items will be guaranteed in the same way, hence some of the items will have frequencies with an error and be returned as heavy hitters

Naïve Solution using CMS … m-2 m-1 m j ℎ 1 ℎ 2 ℎ 𝑑 … 1 𝑑 1 𝑤

Naïve Solution using CMS Query the frequency of all 𝑚 items Return items with Count 𝑗 ≥ 𝑛 𝑘 𝑂 𝑚𝑑 slow

Better Solution Use CMS to store the frequency Use a baseline 𝑏 as a threshold at 𝑖 𝑡ℎ item 𝑏= 𝑖 𝑘 Use MinHeap to store potential heavy hitters at 𝑖 𝑡ℎ item store new items in MinHeap with frequency ≥𝑏 delete old items from MinHeap with frequency <𝑏 support O(log n) insertion and deletion Find-Min 𝑂 1 in time and Extract-Min in 𝑂 ln 𝑘 time

𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻) Returns every item occurs more than 𝑛 𝑘 times Returns some items that occur more than 𝑛 𝑘 −𝜖∙𝑛 times 𝜖= 1 2𝑘 , then 𝐂𝐨𝐮𝐧𝐭 𝑥 ∈[ 𝑓 𝑥 , 𝑓 𝑥 + 𝑛 2𝑘 ] ℎ𝑒𝑎𝑝 𝑠𝑖𝑧𝑒=2𝑘 Other objects that has lower frequency will be deleted from the heap

Algorithm Approximate Heavy Hitters Input stream 𝑥, parameter 𝑘 For each item 𝑗∈𝑥 : Update Count Min Sketch Compare the frequency of 𝑗 with 𝑏 if count≥𝑏 Insert or update 𝑗 in Min Heap remove any value in Min Heap with frequency <𝑏 Returns the MinHeap as Heavy Hitters

EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 4 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 1 5 1 𝑑 1 𝑤 𝑖= When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 4 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 1 5 1 𝑑 1 𝑤 𝑖= {1:4} When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 1 ℎ 𝑑 ℎ 2 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 5 5 1 𝑑 1 𝑤 𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 5 5 {1:3} {1:2} {1:6} ℎ 1 ℎ 𝑑 ℎ 2 {1:9} {1:4} 1 … 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 1 𝑑 1 𝑤 𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 {1:3} {1:2} {1:6} ℎ 𝑑 ℎ 2 ℎ 1 {1:9} {1:4} 1 … 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 1 𝑑 1 𝑤 𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 {1:3} {1:2} {1:6} ℎ 𝑑 ℎ 2 ℎ 1 {1:9} {1:4} 2 … 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 2 ℎ 1 … 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 1 𝑑 1 𝑤 𝑖= 1 2 3 4 5 6 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 6 5 {2:4} ℎ 𝑑 ℎ 2 ℎ 1 2 … 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 1 𝑑 1 𝑤 𝑖= 79 … 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 {16:4} {20:9} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 16 18 … 15 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 1 𝑑 1 𝑤 𝑖= 79 … 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 {16:4} {20:9} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 17 19 … 16 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 1 𝑑 1 𝑤 𝑖= 79 … 2 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 79 5 =15.8 {16:2} {16:4} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 {20:9} 17 19 … 16 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 1 ℎ 𝑑 ℎ 2 … 79 80 81 2 1 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 80 5 =16 1 𝑑 𝑖= 79 80 81 … 2 1 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 80 5 =16 {16:2} {16:4} {23:6} ℎ 1 ℎ 𝑑 ℎ 2 {20:9} 3 6 … 4 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 𝑖= 79 80 81 … 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 {16:2} {16:4} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 {20:9} 20 24 … 25 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 𝑖= 79 80 81 … 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 {16:2} {16:4} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 {20:9} 21 25 … 26 1 When a new item is given When 𝑑 1 𝑤

EXAMPLES ℎ 𝑑 ℎ 1 ℎ 2 … 79 80 81 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 𝑖= 79 80 81 … 2 1 9 Min-Heap 𝑘=5 𝑏= 𝑖 𝑘 = 81 5 =16.2 {21:9} {23:6} ℎ 𝑑 ℎ 1 ℎ 2 21 25 … 26 1 When a new item is given When 𝑑 1 𝑤

Analysis Because 𝑛 is unknown, possible heavy hitters are calculated and stored every new item comes in Maintaining the heap requires extra 𝑂 log 𝑘 =𝑂 log 1 𝜀 time

AMS Sketch : Estimate Second Moment Dissanayaka Mudiyanselage Emil Manupa Karunaratne

The Second Moment Stream : The Second Moment : The trivial solution would be : maintain a histogram of size n and get the sum of squares Its not feasible maintain that large array, therefore we intend to find a approximation algorithm to achieve sub-linear space complexity with bounded errors The algorithm will give an estimate within ε relative error with δ failure probability. (Two Parameters)

The Method j is the next item in the stream. 2-wise independent d hash functions to find the bucket for each row After finding the bucket, 4-wise independent d hash functions to decide inc/dec : In a summary :

The Method Calculate row estimate Median : Choose 𝑤= 4 𝜖 2 and 𝑑=8log⁡( 1 𝛿 ) , by doing so it will give an estimate with 𝜖 relative error and 𝛿 failure probability

Why should this method give F2 ? For kth row : Estimate F2 from kth row : Each row there would be : First part : Second part : g(i)g(j) can be +1 or -1 with equal probability, therefore the expectation is 0.

What guarantee can we give about the accuracy ? The variance of Rk, a row estimate, is caused by hashing collisions. Given the independent nature of the hash functions, we can safely state the variance is bounded by 𝐹 2 2 𝑤 . Using Chebyshev Inequality, Lets assign, Still the failure probability is is linear in over 1 𝑤 .

What guarantee can we give about the accuracy ? We had d number of hash functions, that produce R1, R2, …. Rd estimates. The Median being wrong  Half of the estimates are wrong These are independent d estimates, like toin-cosses that have exponentially decaying probability to get the same outcome. They have stronger bounds, Chernoff Bounds : 𝜇=𝑑 #𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 ∗ 4 3 (𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑝𝑟𝑜𝑏.) 𝑑 2 𝑒𝑟𝑟𝑜𝑟 𝑖𝑠 𝑑 4 𝑎𝑤𝑎𝑦 𝑓𝑟om mean :

Space and Time Complexity E.g. In order to achieve e-10 of tightly bounded accuracy, only 8 * 10 = 80 rows required Space complexity is O(log(𝛿)). Time complexity will be explained later along with the application

AMS Sketch and Applications Sapumal Ahangama

Hash functions ℎ k maps the input domain uniformly to 1,2,…𝑤 buckets ℎ 𝑘 should be a pairwise independent hash functions, to cancel out product terms Ex: family of ℎ 𝑥 = 𝑎𝑥+𝑏 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑤 For a and b chosen from prime field 𝑝, 𝑎≠0

Hash functions 𝑔 𝑘 maps elements from domain uniformly onto {−1, +1} 𝑔 𝑘 should be four-wise independent Ex: family of g x =𝑎 𝑥 3 +𝑏 𝑥 2 +𝑐𝑥+𝑑 𝑚𝑜𝑑 𝑝 equations g 𝑥 =2 𝑎 𝑥 3 +𝑏 𝑥 2 +𝑐𝑥+𝑑 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 2 −1 for 𝑎, 𝑏, 𝑐, 𝑑 chosen uniformly from prime field 𝑝.

Hash functions These hash functions can be computed very quickly, faster even than more familiar (cryptographic) hash functions For scenarios which require very high throughput, efficient implementations are available for hash functions, Based on optimizations for particular values of p, and partial precomputations Ref: M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In ACM-SIAM Symposium on Discrete Algorithms, 2004

Time complexity - Update The sketch is initialized by picking the hash functions to use, and initializing the array of counters to all zeros For each update operation, the item is mapped to an entry in each row based on the hash functions ℎ 𝑗 , multiplied by the corresponding value of 𝑔 𝑗 Processing each update therefore takes time 𝑂(𝑑) since each hash function evaluation takes constant time.

Time complexity - Query Found by taking the sum of the squares of each row of the sketch in turn, and finds the median of these sums. That is for each row k, compute 𝑖 𝐶𝑀[𝑘, 𝑖] 2 Take the median of the d such estimates Hence the query time is linear in the size of the sketch, 𝑂(𝑤𝑑)

Applications - Inner product AMS sketch can be used to estimate the inner-product between a pair of vectors Given two frequency distributions 𝑓 𝑎𝑛𝑑 𝑓′ 𝑓. 𝑓 ′ = 𝑖=1 𝑀 𝑓 𝑖 ∗ 𝑓 ′ (𝑖) AMS sketch based estimator is an unbiased estimator for the inner product of the vectors

Inner Product Two sketches 𝐶𝑀 and 𝐶𝑀’ Formed with the same parameters and using the same hash functions (same 𝑤, 𝑑, ℎ 𝑘 , 𝑔 𝑘 ) The row estimate is the inner product of the rows, 𝑖=1 𝑤 𝐶𝑀 𝑘, 𝑖 ∗𝐶𝑀′[𝑘, 𝑖]

Inner Product Expanding 𝑖=1 𝑤 𝐶𝑀 𝑘, 𝑖 ∗𝐶𝑀′[𝑘, 𝑖] Shows that the estimate gives 𝑓·𝑓′ with additional cross-terms due to collisions of items under ℎ 𝑘 The expectation of these cross terms is zero Over the choice of the hash functions, as the function 𝑔 𝑘 is equally likely to add as to subtract any given term.

Inner Product – Join size estimation Inner product has a natural interpretation, as the size of the equi-join between two relations… In SQL, SELECT COUNT(*) FROM D, D’ WHERE D.id = D’.id

Example UPDATE(23, 1) 23 h1 h2 h3 1 2 3 4 5 6 7 8 d = 3 w = 8

Example UPDATE(23, 1) 23 h1 h2 h3 d = 3 w = 8 1 2 3 4 5 6 7 8 -1 +1 ℎ 1 = 3 𝑔 1 =−1 ℎ 2 =1 𝑔 2 =−1 ℎ 3 =7 𝑔 3 =+1 1 2 3 4 5 6 7 8 -1 +1 d = 3 w = 8

Example UPDATE(99, 2) 99 h1 h2 h3 1 2 3 4 5 6 7 8 -1 +1 d = 3 w = 8

Example UPDATE(99, 2) 99 h1 h2 h3 d = 3 w = 8 1 2 3 4 5 6 7 8 -1 +1 ℎ 1 =5 𝑔 1 =+1 ℎ 2 =1 𝑔 2 =−1 ℎ 3 =3 𝑔 3 =+1 1 2 3 4 5 6 7 8 -1 +1 d = 3 w = 8

Example UPDATE(99, 2) 99 h1 h2 h3 d = 3 w = 8 1 2 3 4 5 6 7 8 -1 +2 -3 ℎ 1 =5 𝑔 1 =+1 ℎ 2 =1 𝑔 2 =−1 ℎ 3 =3 𝑔 3 =+1 1 2 3 4 5 6 7 8 -1 +2 -3 +1 d = 3 w = 8