Download presentation
Presentation is loading. Please wait.
1
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005 http://www.ee.technion.ac.il/courses/049011
2
2 Random Sampling
3
3 Outline The random sampling model Mean estimation Median estimation O(n) time median algorithm (Floyd-Rivest) MST weight estimation (Chazelle- Rubinfeld-Trevisan)
4
4 The Random Sampling Model f: A n B A,B arbitrary sets n: positive integer (think of n as large) Goal: given x A n, compute f(x) Sometimes, approximation of f(x) suffices Oracle access to input: Algorithm does not have direct access to x In order to probe x, algorithm sends queries to an “oracle” Query: an index i {1,…,n} Answer: x i Objective: compute f with minimum number of queries
5
5 Motivation The most basic model for dealing with large data sets Statistics Machine learning Signal processing Approximation algorithms …… Algorithm’s resources are a function of # of queries rather than of the input length Sometimes, constant # of queries suffices
6
6 Adaptive vs. Non-adaptive Sampling Non-adaptive sampling Algorithm decides which indices to query a priori. Queries are performed in batch at a pre-processing step Number of queries performed is the same for all inputs. Adaptive sampling Queries are performed sequentially: Query i 1 Get answer x i1 Query i 2 Get answer x i2 … Algorithm stops whenever has enough information to compute f(x) In order to decide which index to query, the algorithm can use answers to previous queries. Number of queries performed may vary for different inputs. Example: OR of n bits
7
7 Randomization vs. Determinism Deterministic algorithms Non-adaptive: always queries the same set of indices Adaptive: choice of i t deterministically depends on answers to first t-1 queries Randomized algorithms Non-adaptive: indices are chosen randomly according to some distribution (e.g., uniform) Adaptive: i t is chosen randomly according to a distribution, which depends on the answers to previous queries Our focus: randomized algorithms
8
8 ( , )-approximation M: a randomized sampling algorithm M(x): output of M on input x M(x) is a random variable > 0: approximation error parameter 0 < < 1: confidence parameter Definition: M is said to -approximate f with confidence 1 - , if for all inputs x A n, Ex: With probability ≥ 0.9,
9
9 Query Complexity Definition: qcost(M) = the maximum number of queries M performs on: worst choice of input x worst choice of random bits Definition: eqcost(M) = the expected number of queries M performs on: worst choice of input x expectation over random bits Definition: The query complexity of f is qc , (f) = min { qcost(M) | M -approximates f with confidence 1- } eqc , similarly defined
10
10 dd Want relative approximation: Naïve algorithm: Choose:i 1,…,i k (uniformly and independently) Query:i 1,…,i k Output: (sample mean) How large should k be? Estimating the Mean
11
11 Chernoff-Hoeffding Bound X 1,…,X n i.i.d. random variables have a bounded domain [0,1] E[x i ] = for all i By linearity of expectation: Theorem [Chernoff-Hoeffding Bound]: For all 0 < < 1,
12
12 Analysis of Naïve Algorithm Lemma: queries suffice. Proof: For i = 1,…,k, let X i = answer to i-th query Then, output of algorithm: By Chernoff-Hoeffding bound:
13
13 dd Want rank approximation: Sampling algorithm: Choose: i 1,…,i k (uniformly and independently) Query: i 1,…,i k Output: (sample median) How large should k be? Estimating the Median
14
14 Analysis of Median Algorithm Lemma: queries suffice. Proof: For j = 1,…,k, let
15
15 The Selection Problem Input: n real numbers x 1,…,x n Integer k {1,…,n} Output: x i whose rank is k/n Ex: k = 1: minimum k = n: maximum k = n/2: median Can be easily solved by sorting (O(n log n) time) Can we do it in O(n) time?
16
16 The Floyd-Rivest Algorithm Note: Our approximate median algorithm can be generalized to any quantile 0 < q < 1. Floyd-Rivest algorithm 1.Set = 1/n 1/3, = 1/n 2.Use approximate quantile algorithm for Let x L be the element returned 3.Use approximate qunatile algorithm for Let x R be the element returned 4.Let k L = rank(x L ) 5.Keep only elements in the interval [x L,x R ] 6.Sort these elements and output the element whose rank is k – k L + 1
17
17 Analysis of Floyd-Rivest Theorem: With probability 1 – O(1/n), the Floyd-Rivest algorithm finds the k-th largest number from the input at O(n) time. Proof: Let x * = element of rank k/n Lemma 1: With probability ≥ 1-2/n, x * [x L,x R ] Proof:
18
18 Analysis of Floyd-Rivest Let S = input elements that belong to [x L,x R ] Lemma 2: With probability ≥ 1-2/n, |S| ≤ O(n 2/3 ) Proof: Therefore, with probability ≥ 1-2/n, at most 4 n = O(n 2/3 ) elements are between x L and x R Running time analysis: O(n 2/3 log n): approximate quantile computations O(n): calculation of rank(x L ) O(n): filtering elements outside [x L,x R ] O(n 2/3 log n): sorting S
19
19 Minimum Spanning Tree (MST) G = (V,E): an undirected weighted graph Spanning Tree: A subgraph G’ = (V’,E’) of G s.t.: V’ = V G’ is a tree MST: spanning tree of minimum weight Algorithms: Kruskal: O(V log V) Chazelle: O(V (E,V)) Karger, Klein, Tarjan: O(V) randomized
20
20 Sublinear Time MST Algorithm [Chazelle, Rubinfeld, Trevisan] Approximate the weight of the MST Outputting the whole MST takes linear time Relative approximation: Running time: (G) = maximum degree of G w(G) = maximum weight of an edge in G Main subroutine: estimate the number of connected components in G
21
21 From MST to Connected Components Suppose all edges have weight 1 or 2 Let G 1 be the subgraph spanned by edges of weight 1 Let C 1 be the number of connected components in G 1 Then, in any MST, the # of edges of weight 2 must be C 1 – 1. Therefore, the weight of the MST is n + C 1 – 2 Can be generalized to arbitrary weights
22
22 Estimating # of Connected Components Let C = # of connected components in G For every node u 2 V, let n u = size of connected component to which u belongs. Note: for every connected component I, Therefore, Thus, it would suffice to approximate 1/n u for all u.
23
23 The Algorithm 1.choose r = O(1/ 2 ) vertices u 1,…,u r u.a.r 2.for each vertex u i do 3. set i = 0 4. set S = first node visited in BFS(u i ) 5. set t = 1 6. while (BFS(u i ) has not finished && |S| < 1/ ) do 7. set $ = a random bit 8. if ($ = 0) exit while loop 9. resume BFS(u i ), until # of visited nodes is 2|S| 10. add newly visited nodes to S 11. if BFS(u i ) completes, set i = 2 t / |S| 12. set t = t+1 13.Output
24
24 Analysis: Correctness Let T = vertices in components of size < 1/ If u i T, then with probability 1. If u i T, then with probability Therefore, Therefore: A variance analysis shows that this also occurs with high probability
25
25 Analysis: Running time # of outer iterations: O(1/ 2 ) For each iteration of the outer loop, expected # of nodes visited by BFS is: For each node, query at most O( (G)) neighbors in BFS. Total:
26
26 End of Lecture 8
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.