Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005 http://www.ee.technion.ac.il/courses/049011

2 2 Random Sampling

3 3 Outline The random sampling model Mean estimation Median estimation O(n) time median algorithm (Floyd-Rivest) MST weight estimation (Chazelle- Rubinfeld-Trevisan)

4 4 The Random Sampling Model f: A n  B  A,B arbitrary sets  n: positive integer (think of n as large) Goal: given x  A n, compute f(x)  Sometimes, approximation of f(x) suffices Oracle access to input:  Algorithm does not have direct access to x  In order to probe x, algorithm sends queries to an “oracle”  Query: an index i  {1,…,n}  Answer: x i Objective: compute f with minimum number of queries

5 5 Motivation The most basic model for dealing with large data sets  Statistics  Machine learning  Signal processing  Approximation algorithms …… Algorithm’s resources are a function of # of queries rather than of the input length Sometimes, constant # of queries suffices

6 6 Adaptive vs. Non-adaptive Sampling Non-adaptive sampling  Algorithm decides which indices to query a priori.  Queries are performed in batch at a pre-processing step  Number of queries performed is the same for all inputs. Adaptive sampling  Queries are performed sequentially: Query i 1 Get answer x i1 Query i 2 Get answer x i2 … Algorithm stops whenever has enough information to compute f(x)  In order to decide which index to query, the algorithm can use answers to previous queries.  Number of queries performed may vary for different inputs. Example: OR of n bits

7 7 Randomization vs. Determinism Deterministic algorithms  Non-adaptive: always queries the same set of indices  Adaptive: choice of i t deterministically depends on answers to first t-1 queries Randomized algorithms  Non-adaptive: indices are chosen randomly according to some distribution (e.g., uniform)  Adaptive: i t is chosen randomly according to a distribution, which depends on the answers to previous queries Our focus: randomized algorithms

8 8 ( ,  )-approximation M: a randomized sampling algorithm M(x): output of M on input x  M(x) is a random variable  > 0: approximation error parameter 0 <  < 1: confidence parameter Definition: M is said to  -approximate f with confidence 1 - , if for all inputs x  A n,  Ex:  With probability ≥ 0.9,

9 9 Query Complexity Definition: qcost(M) = the maximum number of queries M performs on:  worst choice of input x  worst choice of random bits Definition: eqcost(M) = the expected number of queries M performs on:  worst choice of input x  expectation over random bits Definition: The query complexity of f is qc ,  (f) = min { qcost(M) | M  -approximates f with confidence 1-  }  eqc ,  similarly defined

10 10 dd Want relative approximation: Naïve algorithm:  Choose:i 1,…,i k (uniformly and independently)  Query:i 1,…,i k  Output: (sample mean) How large should k be? Estimating the Mean

11 11 Chernoff-Hoeffding Bound X 1,…,X n  i.i.d. random variables  have a bounded domain [0,1]  E[x i ] =  for all i By linearity of expectation: Theorem [Chernoff-Hoeffding Bound]: For all 0 <  < 1,

12 12 Analysis of Naïve Algorithm Lemma: queries suffice. Proof:  For i = 1,…,k, let X i = answer to i-th query   Then, output of algorithm:  By Chernoff-Hoeffding bound:

13 13 dd Want rank approximation: Sampling algorithm:  Choose: i 1,…,i k (uniformly and independently)  Query: i 1,…,i k  Output: (sample median) How large should k be? Estimating the Median

14 14 Analysis of Median Algorithm Lemma: queries suffice. Proof:  For j = 1,…,k, let 

15 15 The Selection Problem Input:  n real numbers x 1,…,x n  Integer k  {1,…,n} Output:  x i whose rank is k/n Ex:  k = 1: minimum  k = n: maximum  k = n/2: median Can be easily solved by sorting (O(n log n) time) Can we do it in O(n) time?

16 16 The Floyd-Rivest Algorithm Note: Our approximate median algorithm can be generalized to any quantile 0 < q < 1. Floyd-Rivest algorithm 1.Set  = 1/n 1/3,  = 1/n 2.Use approximate quantile algorithm for  Let x L be the element returned 3.Use approximate qunatile algorithm for  Let x R be the element returned 4.Let k L = rank(x L ) 5.Keep only elements in the interval [x L,x R ] 6.Sort these elements and output the element whose rank is k – k L + 1

17 17 Analysis of Floyd-Rivest Theorem: With probability 1 – O(1/n), the Floyd-Rivest algorithm finds the k-th largest number from the input at O(n) time. Proof:  Let x * = element of rank k/n  Lemma 1: With probability ≥ 1-2/n, x *  [x L,x R ]  Proof:

18 18 Analysis of Floyd-Rivest Let S = input elements that belong to [x L,x R ] Lemma 2: With probability ≥ 1-2/n, |S| ≤ O(n 2/3 ) Proof:    Therefore, with probability ≥ 1-2/n, at most 4  n = O(n 2/3 ) elements are between x L and x R Running time analysis:  O(n 2/3 log n): approximate quantile computations  O(n): calculation of rank(x L )  O(n): filtering elements outside [x L,x R ]  O(n 2/3 log n): sorting S

19 19 Minimum Spanning Tree (MST) G = (V,E): an undirected weighted graph Spanning Tree: A subgraph G’ = (V’,E’) of G s.t.:  V’ = V  G’ is a tree MST: spanning tree of minimum weight Algorithms:  Kruskal: O(V log V)  Chazelle: O(V  (E,V))  Karger, Klein, Tarjan: O(V) randomized

20 20 Sublinear Time MST Algorithm [Chazelle, Rubinfeld, Trevisan] Approximate the weight of the MST  Outputting the whole MST takes linear time Relative approximation: Running time:   (G) = maximum degree of G  w(G) = maximum weight of an edge in G Main subroutine: estimate the number of connected components in G

21 21 From MST to Connected Components Suppose all edges have weight 1 or 2 Let G 1 be the subgraph spanned by edges of weight 1 Let C 1 be the number of connected components in G 1 Then, in any MST, the # of edges of weight 2 must be C 1 – 1. Therefore, the weight of the MST is n + C 1 – 2 Can be generalized to arbitrary weights

22 22 Estimating # of Connected Components Let C = # of connected components in G For every node u 2 V, let n u = size of connected component to which u belongs. Note: for every connected component I, Therefore, Thus, it would suffice to approximate 1/n u for all u.

23 23 The Algorithm 1.choose r = O(1/  2 ) vertices u 1,…,u r u.a.r 2.for each vertex u i do 3. set  i = 0 4. set S = first node visited in BFS(u i ) 5. set t = 1 6. while (BFS(u i ) has not finished && |S| < 1/  ) do 7. set $ = a random bit 8. if ($ = 0) exit while loop 9. resume BFS(u i ), until # of visited nodes is 2|S| 10. add newly visited nodes to S 11. if BFS(u i ) completes, set  i = 2 t / |S| 12. set t = t+1 13.Output

24 24 Analysis: Correctness Let T = vertices in components of size < 1/  If u i  T, then with probability 1. If u i  T, then with probability Therefore, Therefore: A variance analysis shows that this also occurs with high probability

25 25 Analysis: Running time # of outer iterations: O(1/  2 ) For each iteration of the outer loop, expected # of nodes visited by BFS is: For each node, query at most O(  (G)) neighbors in BFS. Total:

26 26 End of Lecture 8

