April 26, 2006K. Chang - Dissertation Defense1 Pass-Efficient Algorithms for Clustering Dissertation Defense Kevin Chang Adviser: Ravi Kannan Committee:

April 26, 2006K. Chang - Dissertation Defense1 Pass-Efficient Algorithms for Clustering Dissertation Defense Kevin Chang Adviser: Ravi Kannan Committee: Dana Angluin Joan Feigenbaum Petros Drineas (RPI)

April 26, 2006K. Chang - Dissertation Defense2 Overview Massive Data Sets in Theoretical Computer Science –Algorithms for input that is too large to fit in memory of computer. Clustering Problems –Learning generative models –Clustering via combinatorial optimization –Both massive data set and traditional algorithms

April 26, 2006K. Chang - Dissertation Defense3 Theoretical Abstractions for Massive Data Sets computation The input on disk/storage is modeled as a read- only array. Elements may only be accessed through a sequential pass. –Input elements may be arbitrarily ordered. Main memory is modeled as extra space used for intermediate calculations. Algorithm is allowed extra time before and after each pass to perform calculations. Goal: Minimize memory usage, number of passes.

April 26, 2006K. Chang - Dissertation Defense4 Models of Computation Streaming Model: Algorithm may make a single pass over the data. Space must be o(n). Pass-Efficient Model: Algorithm may make a small, constant number of passes. Ideally, space is O(1). Other models: sublinear algorithms Pass-Efficient more flexible than streaming, but not suitable for “streaming” data arriving that is processed immediately and then “forgotten.”

April 26, 2006K. Chang - Dissertation Defense5 Main Question: What will multiple passes buy you? Is it better, in terms of other resources, to make 3 passes instead of 1 pass? Example: Find the mean of an array of integers. 1 pass requires O(1) space. More passes don’t help. Example: Find the median. [MP ‘80] 1 pass algorithm requires n space. 2 pass needs only O(n 1/2 ) space. We will study the Trade-Off between passes and memory. Other work: for graphs [FKMSZ ‘05]

April 26, 2006K. Chang - Dissertation Defense6 Overview of Results General framework for Pass-Efficient clustering algorithms. Specific problems: –A learning problem: generative model of clustering. Sharp trade-off between passes-space Lower bounds show this is nearly tight –A combinatorial optimization problem: Facility Location Same sharp trade-off Algorithm for a graph partitioning problem –Can be considered a clustering problem.

April 26, 2006K. Chang - Dissertation Defense7 Graph Partitioning: Sum-of- squares partition The objective function is natural, because large components will have roughly equal size in the optimum partition. Joint with Alex Yampolskiy, Jim Aspnes. Given a graph G = (V, E), remove m nodes or edges to disconnect the graph into components {C i } in order to minimize objective function ∑ i |C i | 2.

April 26, 2006K. Chang - Dissertation Defense8 Graph Partitioning: Our Results Definition: (α, β)-bicriterion approximation: –remove αm nodes or edges to get partition {C i } –such that ∑ i |C i | 2 < β OPT – OPT is best sum of squares for removing m edges. Approximation Algorithm for SSP –Thm: There exists a (O(log 1.5 n), O(1)) algorithm. Hardness of Approximation –Thm: NP-Hard to compute a (1.18, 1) approximation. –By reduction from minimum Vertex Cover.

April 26, 2006K. Chang - Dissertation Defense9 Outline of Algorithm Recursively partition the graph by removing sparse node or edge cuts. –Sparsest cut is an important theory problem. –Informally: find a cut that removes a small number of edges to get two relatively large components. –Recent activity in this area. [LR99, ARV04]. –Node cut problem reduces to edge cut. Similar approach taken by [KVV ’00] for a different clustering problem.

April 26, 2006K. Chang - Dissertation Defense10 Rough Idea of Algorithm In first iteration, partition the graph G into C 1, C 2 by removing a sparse node cut. G

April 26, 2006K. Chang - Dissertation Defense11 Rough Idea of Algorithm In first iteration, partition the graph G into C 1, C 2 by removing a sparse node cut. C1C1 C2C2

April 26, 2006K. Chang - Dissertation Defense12 Rough Idea of Algorithm C1C1 C2C2 Subsequent iterations: calculate cuts in all components, and remove the cut that “most effectively” reduces the objective function.

April 26, 2006K. Chang - Dissertation Defense13 Rough Idea of Algorithm Continue procedure until Θ(log 1.5 n)m nodes have been removed. C1C1 C2C2 C3C3

April 26, 2006K. Chang - Dissertation Defense14 Generative Models of Clustering Assume the input is generated by k different random processes. k distributions: F 1, F 2, … F k, each with weight w i > 0 such that ∑ i w i = 1. A sample is drawn according to the mixture by picking F i with probability w i, and then drawing a point according to F i. Alternatively, if F i is a density function, then mixture has density F = ∑ i w i F i. Much TCS work on learning mixtures Gaussian Distributions in high dimension. [D ’99],[AK ’01], [VW ‘02], [KSV ‘05]

April 26, 2006K. Chang - Dissertation Defense15 Learning Mixtures Q: Given samples from the mixture, can we learn the parameters of the k processes? A: Not always. Question can be ill-posed. –Two different mixtures can create the same density. Our problem: Can we learn the density F of the mixture in the pass-efficient model? –What are the trade-offs between space and passes? Suppose F is the density of mixture. Accuracy of our estimate G measured by L 1 distance: ∫ |F-G|.

April 26, 2006K. Chang - Dissertation Defense16 Mixtures of k uniform distributions in R Assume each F i is a uniform distribution over some interval (a i, b i ) in R. Then mixture density looks like a step distribution: 0 5 x1x1 x2x2 x3x3 x4x4 x5x5 A mixture of three distributions: F 1 on (x 1, x 3 ), F 2 on (x 2, x 3 ) and F 3 on (x 4,x 5 ) The problem reduces to learning a density function that is known to be piecewise constant with at most 2k-1 “steps”.

April 26, 2006K. Chang - Dissertation Defense17 Algorithmic Results Pick any integer P>0, and 1>  >0. 2P pass randomized algorithm to learn a mixture of k uniform distributions. Error at most  P, memory required is O * (k 3 /  2 +Pk/  ). –E–Error drops exponentially in number of passes while memory used grows very slowly. Error at most , memory required is O * (k 3 /  2/P ). –I–If you take 2 passes you need: ~(1/  , 4 passes you need ~(1/ , 8 passes, you need (1/  ) 1/2 –M–Memory drops very sharply as P increases, while the error remains constant. Failure probability: 1- 

April 26, 2006K. Chang - Dissertation Defense18 Step distribution: First attempt Let X be the data stream of samples from F. Break the domain into bins and count the number of points that fall in each bin. Estimate F in each bin. In order to get accuracy  P, you will need to store at least ¼ 1/  P counters. Too much! We can do much better using a few more passes.

April 26, 2006K. Chang - Dissertation Defense19 General Framework In one pass, take a small uniform subsample of size s from the data stream. Compute a coarse solution on the subsample. In a second pass, for places where coarse solution is reasonable, sharpen it. Also in a second pass, find places where coarse solution is very far off. Recurse on these spots. –Examine these trouble spots more carefully…zoom in!

April 26, 2006K. Chang - Dissertation Defense20 Our algorithm 1.In one pass, draw a sample of size s =  (k 2 /  2 ) from X. 2.Based on the sample, partition the domain into intervals I such that s I F =  (  /2k). 1.Number of I’s will be O(k/  ). 3.In one pass, for each I, determine if F is very close to constant on I (Call subroutine Constant). Also count the number of points of X that lie in I. 4.If F is constant on I, then |XÅ I| / (|X| length(I)) is very close to F. 5.If F is not constant on I, recurse on I. (Zoom in on the trouble spot). Requires |X|=  (k 6 /  6P ). Space usage: O * (k 3 /  2 +P k/  ).

April 26, 2006K. Chang - Dissertation Defense21 The Action Sample Data stream …. Repeat this P-2 more times … sub-data stream Mark interval with jump

April 26, 2006K. Chang - Dissertation Defense22 Why it works: bounding the error Easy to learn when F is constant. Our estimate is very accurate on these intervals. The weight of bins decreases exponentially at each iteration. At the Pth iteration, bins have weight at most  P /4k Thus, we can estimate F as 0 on 2k bins where there is a jump, and incur an error of at most  P /2k.

April 26, 2006K. Chang - Dissertation Defense23 Generalizations Can generalize our algorithm to learn the following types of distribution (roughly same trade-off of space and passes). 1.Mixtures of uniform distributions over axis-aligned rectangles in R 2. –F i is uniform over a rectangle (a i, b i )£ (c i, d i )½ R 2 2.Mixtures of linear distributions in R: The density of F i is linear over some interval (a i, b i )½ R. Heuristic: For a mixture of Gaussians, just treat it like a mixture of m linear distributions for some m>>k (theoretical error will be large if m is not really big).

April 26, 2006K. Chang - Dissertation Defense24 Lower bounds We define a Generalized Learning Problem (GLP), which is a slight generalization of the learning mixtures of distribution problem. Thm: Any P pass randomized algorithm that solves the GLP must use at least  (1/  1/(2P-1) ) bits of memory. –Proof uses “r round communication complexity” Thm: There exists a P pass algorithm that solves the GLP using at most O * (1/  4/P ) bits of memory. –Slight generalization of algorithm given above. Conclusion: trade-off is nearly tight for GLP.

April 26, 2006K. Chang - Dissertation Defense25 General framework for clustering: adaptive sampling in 2P passes Pass 1: draw a small sample S from the data. –Compute a solution C on the subproblem S. (If S is small, can do this in memory.) Pass 2: determine those points R that are not clustered well by C, and recurse on R. –If S is representative, then there won’t be many points in R. If R is small, then we will sample it at a higher rate in subsequent iterations and get a better solution for these “outliers”.

April 26, 2006K. Chang - Dissertation Defense26 Facility Location 2 f 2 =40 1 f 1 =2 4 f 4 =25 3 f 3 =8 d(1,2) = 5 d(2,3 ) = 6 d(1,3) = 2 d(1,4) = 5 d(3,4) = 3 Input: Set X of points, with distances defined by function d and a facility cost f i for each point. Cost of solution for building facilities at 1 and 3 is: 2+8+5+3 = 18 Problem: Find a set F of facilities thatminimizes  i2 F f i +  i2 x d(i,F), d(i,F) = min j2 F d(i, j)

April 26, 2006K. Chang - Dissertation Defense27 Quick facts about facility location Natural operations research problem. NP-Hard to solve optimally. Many approximation algorithms (not streaming or massive data set) –Best approximation ratio is 1.52. –Achieving a factor of 1.463 is hard.

April 26, 2006K. Chang - Dissertation Defense28 Other NP-Hard clustering problems in Combinatorial Optimization k-center: Find a set of centers C, |C|=k, that minimizes max i2 X d(i, C). –One pass approximation algorithm that requires O(k) space [CCFM ‘97] k-median: Find a set of centers C, |C|=k, that minimizes ∑ i2 X d(i, C). –One pass approximation algorithm that requires O(k log 2 n) space. [COP ’03] No pass-efficient algorithm for general FL! –One pass for very restricted inputs [Indyk ’04]

April 26, 2006K. Chang - Dissertation Defense29 Memory issues for facility location Note that some instances of FL will require  (n) facilities to get any reasonable approximation. Thm: Any P pass, randomized, algorithm for approximately computing the optimum FL cost requires  (n/P) bits of memory. How to give reasonable guarantees that the space usage is “small”? Parameterize bounds on memory usage by number of facilities opened.

April 26, 2006K. Chang - Dissertation Defense30 Algorithmic Results 3P pass algorithm for solving FL, using at most O * (k * n 2/P ) bits of extra memory, where k * is the number of facilities output by the algorithm Approximation ratio is O(P): if OPT is the cost of the optimum solution, then the algorithm will output a solution with cost at most  ¢ 36¢ P¢ OPT Requires a black-box approximation algorithm for FL with approximation ratio . –Best so-far:  = 1.52 Surprising Fact: same trade-off, very different problem!

April 26, 2006K. Chang - Dissertation Defense31 Simplifying the Problem Large number of facilities complicates matters. Algorithmic cure involves technical details that obscure the main point we would like to make. Easier problem to demonstrate our framework in action is k Facility Location. k-FL: Find a set of at most k facilities F (|F|≤k) that minimizes ∑ i 2 F f i + ∑ i 2 X d(i,F). We present a 3P pass algorithm that uses O * (k¢ n 1/P ) bits of memory. Approx ratio O(P).

April 26, 2006K. Chang - Dissertation Defense32 Idea of our algorithm Our algorithm will make P triples of passes, total 3P passes. In the first two passes of each triple, we take a uniform sample of X and compute a FL solution on the sample. Intuitively, this will be a good solution for most of the input. In a third pass, identify bad points which are far away from our facilities. In subsequent passes, we recursively compute a solution for the bad points by restricting our algorithms to those points.

April 26, 2006K. Chang - Dissertation Defense33 Algorithm, Part 1: Clustering a Sample 1.In one pass, draw a sample S 1 of s = O * (k (P-1)/P n 1/P ) nodes from the data stream. 2.In a second pass, compute an O(1)- approximation to FL on S 1, but with facilities drawn from all of X and with distances in S scaled up by n/s. Call the set of facilities F 1.

April 26, 2006K. Chang - Dissertation Defense34 How good is the solution F 1 ? Want to show that the cost of F 1 on X is small. Easy to show that  i2 F 1 f i <O(1)OPT Best scenario: facilities that service the sample S 1 well will service all points of X well:  j2 F 1 f j +  j2 X d(j, F 1 ) = O(1)¢ OPT. Unfortunately, this is not true in general.

April 26, 2006K. Chang - Dissertation Defense35 Bounding the cost, similar to [Indyk ’99] Fix an optimum solution on X, a set of facilities F *, |F * | = k that partitions X into k clusters C 1,…,C k. If cluster C i is “large” (i.e. |C i |>n/s) then a sample S 1 of X of size s will contain many points of C i. Can prove a good clustering of S 1 will be good for points in the large clusters C i of X. –Will cost at most O(1)OPT to service large clusters. Can prove the number of “bad” points is at most O(k 1/P n 1-1/P ).

April 26, 2006K. Chang - Dissertation Defense36 Algorithm, part 2: Recursing on the outliers 3.In a third pass, identify the O(k 1/P n 1-1/P ) points that are farthest away from the facilities in F 1. Can do it using ~ O * (n 1/P ) space using random sampling. 1 4.Assign all other points to facilities in F 1. Cost of this at most O(1)OPT. 5.Recurse on the outliers. –In subsequent iterations, will draw a sample of size s from outliers; thus, points will be sampled with much higher frequency.

April 26, 2006K. Chang - Dissertation Defense37 The General Recursive Step 1.For the m’th iteration: Let R m be the points remaining to be clustered. 2.Take a sample S of size s from R m, and compute a FL solution F m. 3.Identify the O * (k (m+1)/P n 1-(m+1)/P ) points that are farthest away from F m, call this set R m+1 –Can prove that F m will service R m n R m+1 with cost O(1) OPT. 4.Recurse on the points in R m+1.

April 26, 2006K. Chang - Dissertation Defense38 Putting it all together Let the final set of facilities be F =[ F i. After P iterations, all points will be assigned a facility. Total cost of solution will be O(P) OPT. Note that possibly |F|>k. Can boil this down to k facilities that still give O(P) OPT. –Hint: Use a known k-median algorithm. Algorithm generalizes to k-median, but requires more passes and space than [COP ‘03].

April 26, 2006K. Chang - Dissertation Defense39 References J. Aspnes, K. Chang, A. Yampolskiy. Inoculation strategies for victims of viruses and sum-of-squares partition problem. To appear in Journal of Computer and System Sciences. Preliminary version appeared in SODA 2005. K. Chang, R. Kannan. Pass-Efficient Algorithms for Clustering. SODA 2006. K. Chang. Pass-Efficient Algorithms for Facility Location. Yale TR 1337. S. Arora, K. Chang. Approximation Schemes for low degree MST and Red-blue Separation problem. Algorithmica 40(3):189-210, 2004. Preliminary version appeared in ICALP 2003. (Not in thesis)

April 26, 2006K. Chang - Dissertation Defense40 Future Directions Consider problems from data mining, abstract (and possibly simplify) them, and see what algorithmic insights TCS can provide. Possible extensions of this thesis work: –Design a pass-efficient algorithm for learning mixtures of Gaussian distributions in high dimension. Give rigorous guarantees about accuracy. –Design pass-efficient algorithms for other Comb. Opt. Problems. Possibility: find a set of centers C, |C| ≤ k to minimize  i d(i, C) 2. (Generalization of k-means objective…Euclidean case already solved). Many problems out there. Much more to be done!

April 26, 2006K. Chang - Dissertation Defense41 Thanks for listening!

April 26, 2006K. Chang - Dissertation Defense1 Pass-Efficient Algorithms for Clustering Dissertation Defense Kevin Chang Adviser: Ravi Kannan Committee:

Similar presentations

Presentation on theme: "April 26, 2006K. Chang - Dissertation Defense1 Pass-Efficient Algorithms for Clustering Dissertation Defense Kevin Chang Adviser: Ravi Kannan Committee:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

April 26, 2006K. Chang - Dissertation Defense1 Pass-Efficient Algorithms for Clustering Dissertation Defense Kevin Chang Adviser: Ravi Kannan Committee:

Similar presentations

Presentation on theme: "April 26, 2006K. Chang - Dissertation Defense1 Pass-Efficient Algorithms for Clustering Dissertation Defense Kevin Chang Adviser: Ravi Kannan Committee:"— Presentation transcript:

Similar presentations

About project

Feedback