1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Covers, Dominations, Independent Sets and Matchings AmirHossein Bayegan Amirkabir University of Technology.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Analysis of Algorithms
Order Statistics Sorted
Comp 122, Spring 2004 Greedy Algorithms. greedy - 2 Lin / Devi Comp 122, Fall 2003 Overview  Like dynamic programming, used to solve optimization problems.
Greedy Algorithms Greed is good. (Some of the time)
Greed is good. (Some of the time)
Heuristics for the Hidden Clique Problem Robert Krauthgamer (IBM Almaden) Joint work with Uri Feige (Weizmann)
Minimum Spanning Trees Definition Two properties of MST’s Prim and Kruskal’s Algorithm –Proofs of correctness Boruvka’s algorithm Verifying an MST Randomized.
A Randomized Linear-Time Algorithm to Find Minimum Spaning Trees 黃則翰 R 蘇承祖 R 張紘睿 R 許智程 D 戴于晉 R David R. Karger.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
A Randomized Linear-Time Algorithm to Find Minimum Spanning Trees David R. Karger David R. Karger Philip N. Klein Philip N. Klein Robert E. Tarjan.
CSL758 Instructors: Naveen Garg Kavitha Telikepalli Scribe: Manish Singh Vaibhav Rastogi February 7 & 11, 2008.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Approximation Algorithms
Greedy Algorithms for Matroids Andreas Klappenecker.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Mazes In The Theory of Computer Science Dana Moshkovitz.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 11, 2006
A general approximation technique for constrained forest problems Michael X. Goemans & David P. Williamson Presented by: Yonatan Elhanani & Yuval Cohen.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Shortest Path Algorithms
Complexity 1 Mazes And Random Walks. Complexity 2 Can You Solve This Maze?
Study Group Randomized Algorithms Jun 7, 2003 Jun 14, 2003.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
More Chapter 7: Greedy Algorithms Kruskal’s Minimum Spanning Tree Algorithm.
TECH Computer Science Graph Optimization Problems and Greedy Algorithms Greedy Algorithms  // Make the best choice now! Optimization Problems  Minimizing.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Randomized Algorithms Morteza ZadiMoghaddam Amin Sayedi.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 Shortest Path Algorithms Andreas Klappenecker [based on slides by Prof. Welch]
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
2IL05 Data Structures Fall 2007 Lecture 13: Minimum Spanning Trees.
Spring 2015 Lecture 11: Minimum Spanning Trees
Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
 2004 SDU Lecture 7- Minimum Spanning Tree-- Extension 1.Properties of Minimum Spanning Tree 2.Secondary Minimum Spanning Tree 3.Bottleneck.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Greedy Algorithms and Matroids Andreas Klappenecker.
Minimum Spanning Trees Easy. Terms Node Node Edge Edge Cut Cut Cut respects a set of edges Cut respects a set of edges Light Edge Light Edge Minimum Spanning.
Markov Chains and Random Walks. Def: A stochastic process X={X(t),t ∈ T} is a collection of random variables. If T is a countable set, say T={0,1,2, …
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
1 Distributed Vertex Coloring. 2 Vertex Coloring: each vertex is assigned a color.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Space Complexity Guy Feigenblat Based on lecture by Dr. Ely Porat Complexity course Computer science department, Bar-Ilan university December 2008.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Markov Chains and Random Walks
Stochastic Streams: Sample Complexity vs. Space Complexity
Introduction to Algorithms
Chapter 5. Greedy Algorithms
Approximating the MST Weight in Sublinear Time
Lecture 18: Uniformity Testing Monotonicity Testing
Greedy Algorithms / Minimum Spanning Tree Yin Tat Lee
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Enumerating Distances Using Spanners of Bounded Degree
Randomized Algorithms CS648
CIS 700: “algorithms for Big Data”
Randomized Algorithms
Sublinear Algorihms for Big Data
CSCI B609: “Foundations of Data Science”
CSE 373: Data Structures and Algorithms
Compact routing schemes with improved stretch
Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4,

2 Random Sampling

3 Outline The random sampling model Mean estimation Median estimation O(n) time median algorithm (Floyd-Rivest) MST weight estimation (Chazelle- Rubinfeld-Trevisan)

4 The Random Sampling Model f: A n  B  A,B arbitrary sets  n: positive integer (think of n as large) Goal: given x  A n, compute f(x)  Sometimes, approximation of f(x) suffices Oracle access to input:  Algorithm does not have direct access to x  In order to probe x, algorithm sends queries to an “oracle”  Query: an index i  {1,…,n}  Answer: x i Objective: compute f with minimum number of queries

5 Motivation The most basic model for dealing with large data sets  Statistics  Machine learning  Signal processing  Approximation algorithms …… Algorithm’s resources are a function of # of queries rather than of the input length Sometimes, constant # of queries suffices

6 Adaptive vs. Non-adaptive Sampling Non-adaptive sampling  Algorithm decides which indices to query a priori.  Queries are performed in batch at a pre-processing step  Number of queries performed is the same for all inputs. Adaptive sampling  Queries are performed sequentially: Query i 1 Get answer x i1 Query i 2 Get answer x i2 … Algorithm stops whenever has enough information to compute f(x)  In order to decide which index to query, the algorithm can use answers to previous queries.  Number of queries performed may vary for different inputs. Example: OR of n bits

7 Randomization vs. Determinism Deterministic algorithms  Non-adaptive: always queries the same set of indices  Adaptive: choice of i t deterministically depends on answers to first t-1 queries Randomized algorithms  Non-adaptive: indices are chosen randomly according to some distribution (e.g., uniform)  Adaptive: i t is chosen randomly according to a distribution, which depends on the answers to previous queries Our focus: randomized algorithms

8 ( ,  )-approximation M: a randomized sampling algorithm M(x): output of M on input x  M(x) is a random variable  > 0: approximation error parameter 0 <  < 1: confidence parameter Definition: M is said to  -approximate f with confidence 1 - , if for all inputs x  A n,  Ex:  With probability ≥ 0.9,

9 Query Complexity Definition: qcost(M) = the maximum number of queries M performs on:  worst choice of input x  worst choice of random bits Definition: eqcost(M) = the expected number of queries M performs on:  worst choice of input x  expectation over random bits Definition: The query complexity of f is qc ,  (f) = min { qcost(M) | M  -approximates f with confidence 1-  }  eqc ,  similarly defined

10 dd Want relative approximation: Naïve algorithm:  Choose:i 1,…,i k (uniformly and independently)  Query:i 1,…,i k  Output: (sample mean) How large should k be? Estimating the Mean

11 Chernoff-Hoeffding Bound X 1,…,X n  i.i.d. random variables  have a bounded domain [0,1]  E[x i ] =  for all i By linearity of expectation: Theorem [Chernoff-Hoeffding Bound]: For all 0 <  < 1,

12 Analysis of Naïve Algorithm Lemma: queries suffice. Proof:  For i = 1,…,k, let X i = answer to i-th query   Then, output of algorithm:  By Chernoff-Hoeffding bound:

13 dd Want rank approximation: Sampling algorithm:  Choose: i 1,…,i k (uniformly and independently)  Query: i 1,…,i k  Output: (sample median) How large should k be? Estimating the Median

14 Analysis of Median Algorithm Lemma: queries suffice. Proof:  For j = 1,…,k, let 

15 The Selection Problem Input:  n real numbers x 1,…,x n  Integer k  {1,…,n} Output:  x i whose rank is k/n Ex:  k = 1: minimum  k = n: maximum  k = n/2: median Can be easily solved by sorting (O(n log n) time) Can we do it in O(n) time?

16 The Floyd-Rivest Algorithm Note: Our approximate median algorithm can be generalized to any quantile 0 < q < 1. Floyd-Rivest algorithm 1.Set  = 1/n 1/3,  = 1/n 2.Use approximate quantile algorithm for  Let x L be the element returned 3.Use approximate qunatile algorithm for  Let x R be the element returned 4.Let k L = rank(x L ) 5.Keep only elements in the interval [x L,x R ] 6.Sort these elements and output the element whose rank is k – k L + 1

17 Analysis of Floyd-Rivest Theorem: With probability 1 – O(1/n), the Floyd-Rivest algorithm finds the k-th largest number from the input at O(n) time. Proof:  Let x * = element of rank k/n  Lemma 1: With probability ≥ 1-2/n, x *  [x L,x R ]  Proof:

18 Analysis of Floyd-Rivest Let S = input elements that belong to [x L,x R ] Lemma 2: With probability ≥ 1-2/n, |S| ≤ O(n 2/3 ) Proof:    Therefore, with probability ≥ 1-2/n, at most 4  n = O(n 2/3 ) elements are between x L and x R Running time analysis:  O(n 2/3 log n): approximate quantile computations  O(n): calculation of rank(x L )  O(n): filtering elements outside [x L,x R ]  O(n 2/3 log n): sorting S

19 Minimum Spanning Tree (MST) G = (V,E): an undirected weighted graph Spanning Tree: A subgraph G’ = (V’,E’) of G s.t.:  V’ = V  G’ is a tree MST: spanning tree of minimum weight Algorithms:  Kruskal: O(V log V)  Chazelle: O(V  (E,V))  Karger, Klein, Tarjan: O(V) randomized

20 Sublinear Time MST Algorithm [Chazelle, Rubinfeld, Trevisan] Approximate the weight of the MST  Outputting the whole MST takes linear time Relative approximation: Running time:   (G) = maximum degree of G  w(G) = maximum weight of an edge in G Main subroutine: estimate the number of connected components in G

21 From MST to Connected Components Suppose all edges have weight 1 or 2 Let G 1 be the subgraph spanned by edges of weight 1 Let C 1 be the number of connected components in G 1 Then, in any MST, the # of edges of weight 2 must be C 1 – 1. Therefore, the weight of the MST is n + C 1 – 2 Can be generalized to arbitrary weights

22 Estimating # of Connected Components Let C = # of connected components in G For every node u 2 V, let n u = size of connected component to which u belongs. Note: for every connected component I, Therefore, Thus, it would suffice to approximate 1/n u for all u.

23 The Algorithm 1.choose r = O(1/  2 ) vertices u 1,…,u r u.a.r 2.for each vertex u i do 3. set  i = 0 4. set S = first node visited in BFS(u i ) 5. set t = 1 6. while (BFS(u i ) has not finished && |S| < 1/  ) do 7. set $ = a random bit 8. if ($ = 0) exit while loop 9. resume BFS(u i ), until # of visited nodes is 2|S| 10. add newly visited nodes to S 11. if BFS(u i ) completes, set  i = 2 t / |S| 12. set t = t+1 13.Output

24 Analysis: Correctness Let T = vertices in components of size < 1/  If u i  T, then with probability 1. If u i  T, then with probability Therefore, Therefore: A variance analysis shows that this also occurs with high probability

25 Analysis: Running time # of outer iterations: O(1/  2 ) For each iteration of the outer loop, expected # of nodes visited by BFS is: For each node, query at most O(  (G)) neighbors in BFS. Total:

26 End of Lecture 8