Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficient Algorithms for Non-Parametric Clustering With Clutter Weng-Keen Wong Andrew Moore.

Similar presentations


Presentation on theme: "1 Efficient Algorithms for Non-Parametric Clustering With Clutter Weng-Keen Wong Andrew Moore."— Presentation transcript:

1 1 Efficient Algorithms for Non-Parametric Clustering With Clutter Weng-Keen Wong Andrew Moore

2 2 Problems From the Physical Sciences Minefield detection (Dasgupta and Raftery 1998) Earthquake faults (Byers and Raftery 1998)

3 3 Problems From the Physical Sciences (Pereira 2002)(Sloan Digital Sky Survey 2000)

4 4 A Simplified Example

5 5 Clustering with Traditional Algorithms Single Linkage ClusteringMixture of Gaussians with a Uniform Background Component

6 6 Clustering with CFF Cuevas-Febrero-FraimanOriginal Dataset

7 7 Related Work (Dasgupta and Raftery 98) Mixture model approach – mixture of Gaussians for features, Poisson process for clutter (Byers and Raftery 98) K-nearest neighbour distances for all points modeled as a mixture of two gamma distributions, one for clutter and one for the features Classify each data point based on which component it was most likely generated from

8 8 Outline 1. Introduction: Clustering and Clutter 2. The Cuevas-Febreiro-Fraiman Algorithm 3. Optimizing Step One of CFF 4. Optimizing Step Two of CFF 5. Results

9 9 The CFF Algorithm Step One Find the high density datapoints

10 10 The CFF Algorithm Step Two Cluster the high density points using Single Linkage Clustering Stop when link length > 

11 11 The CFF Algorithm Originally intended to estimate the number of clusters Can also be used to find clusters against a noisy background

12 12 Step One: Non-Parametric Density Estimator A datapoint is a high density datapoint if: The number of datapoints within a hypersphere of radius h is > threshold c

13 13 Speeding up the Non-Parametric Density Estimator Addressed in a separate paper (Gray and Moore 2001) Two basic ideas: 1. Use a dual tree algorithm (Gray and Moore 2000) 2. Cut search off early without computing exact densities (Moore 2000)

14 14 Step Two: Euclidean Minimum Spanning Trees (EMSTs) Traditional MST algorithms assume you are given all the distances Implies O(N 2 ) memory usage Want to use a Euclidean Minimum Spanning Tree algorithm

15 15 Optimizing Clustering Step Exploit recent results in computational geometry for efficient EMSTs Involves modification to GeoMST2 algorithm by (Narasimhan et al 2000) GeoMST2 is based on Well-Separated Pairwise Decompositions (WSPDs) (Callahan 1995) Our optimizations gain an order of magnitude speedup, especially in higher dimensions

16 16 Outline for Optimizing Step Two 1. High level overview of GeoMST2 2. Example of a WSPD 3. More detailed description of GeoMST2 4. Our optimizations

17 17 Intuition behind GeoMST2

18 18 Intuition behind GeoMST2

19 19 High Level Overview of GeoMST2 (A 1,B 1 ) (A 2,B 2 ). (A m,B m ) 1.Create the Well- Separated Pairwise Decomposition

20 20 High Level Overview of GeoMST2 (A 1,B 1 ) (A 2,B 2 ). (A m,B m ) Each Pair (A i,B i ) represents a possible edge in the MST 1.Create the Well- Separated Pairwise Decomposition

21 21 High Level Overview of GeoMST2 (A 1,B 1 ) (A 2,B 2 ). (A m,B m ) 1.Create the Well- Separated Pairwise Decomposition 2.Take the pair (A i,B i ) that corresponds to the shortest edge 3.If the vertices of that edge are not in the same connected component, add the edge to the MST. Repeat Step 2.

22 22 A Well-Separated Pair (Callahan 1995) Let A and B be point sets in  d Let R A and R B be their respective bounding hyper-rectangles Define MargDistance(A,B) to be the minimum distance between R A and R B

23 23 A Well-Separated Pair (Cont) The point sets A and B are considered to be well-separated if: MargDistance(A,B)  max{Diam(R A ),Diam(R B )}

24 24 A Well-Separated Pairwise Decomposition Pair #1: ([0],[1]) Pair #2: ([0,1], [2]) Pair #3: ([0,1,2],[3,4]) Pair #4: ([3], [4]) The set of pairs {([0],[1]), ([0,1], [2]), ([0,1,2],[3,4]), ([3], [4])} form a Well-Separated Pairwise Decomposition.

25 25 The Size of a WSPD If there are n points, a WSPD can be constructed with O(n) pairs using a fair split tree (Callahan 1995) (A 1,B 1 ) (A 2,B 2 ). (A m,B m ) A WSPD

26 26 High Level Overview of GeoMST2 (A 1,B 1 ) (A 2,B 2 ). (A m,B m ) 1.Create the Well- Separated Pairwise Decomposition 2.Take the pair (A i,B i ) that corresponds to the shortest edge 3.If the vertices of that edge are not in the same connected component, add the edge to the MST. Repeat Step 2

27 27 Bichromatic Closest Pair Distance Given two sets (A i,B i ), the Bichromatic Closest Pair Distance is the closest distance from a point in A i to a point in B i

28 28 High Level Overview of GeoMST2 (A 1,B 1 ) (A 2,B 2 ). (A m,B m ) 1.Create the Well- Separated Pairwise Decomposition 2.Take the pair (A i,B i ) with the shortest BCP distance 3.If A i and B i are not already connected, add the edge to the MST. Repeat Step 2.

29 29 GeoMST2 Example Start Current MST

30 30 GeoMST2 Example Iteration 1 Current MST

31 31 GeoMST2 Example Iteration 2 Current MST

32 32 GeoMST2 Example Iteration 3 Current MST

33 33 GeoMST2 Example Iteration 4 Current MST

34 34 High Level Overview of GeoMST2 (A 1,B 1 ) (A 2,B 2 ). (A m,B m ) 1.Create the Well- Separated Pairwise Decomposition 2.Take the pair (A i,B i ) with the shortest BCP distance 3.If A i and B i are not already connected, add the edge to the MST. Repeat Step 2. Modification for CFF: If BCP distance > , terminate

35 35 Optimizations We don’t need the EMST We just need to cluster all points that are within  distance or less from each other Allows two optimizations to GeoMST2 code

36 36 High Level Overview of GeoMST2 (A 1,B 1 ) (A 2,B 2 ). (A m,B m ) 1.Create the Well- Separated Pairwise Decomposition 2.Take the pair (A i,B i ) with the shortest BCP distance 3.If A i and B i are not already connected, add the edge to the MST. Repeat Step 2. Optimizations take place in Step 1

37 37 Optimization 1 Illustration

38 38 Optimization 1 Ignore all links that are >  Every pair (A i, B i ) in the WSPD becomes an edge unless it joins two already connected components If MargDistance(A i,B i ) > , then an edge of length  cannot exist between a point in A i and B i Don’t include such a pair in the WSPD

39 39 Optimization 2 Illustration

40 40 Optimization 2 Join all elements that are within  distance of each other If the max distance separating the bounding hyper-rectangles of A i and B i is  , then join all the points in A i and B i if they are not already connected Do not add such a pair (A i,B i ) to the WSPD

41 41 Implications of the optimizations Reduce the amount of time spent in creating the WSPD Reduce the number of WSPDs, thereby speeding up the GeoMST2 algorithm by reducing the size of the priority queue

42 42 Results Ran step two algorithms on subsets of the Sloan Digital Sky Survey Compared Kruskal, GeoMST2, and  -clustering 7 attributes – 4 colors, 2 sky coordinates, 1 redshift value

43 43 Results (GeoMST2 vs  -Clustering vs Kruskal in 4D)

44 44 Results (GeoMST2 vs  -Clustering in 3D)

45 45 Results (GeoMST2 vs  -Clustering in 4D)

46 46 Results (Change in Time as  changes for 4D data)

47 47 Results (Increasing Dimensions vs Time

48 48 Conclusions  -clustering outperforms GeoMST2 by nearly an order of magnitude in higher dimensions Combining the optimizations in both steps will yield an efficient algorithm for clustering against clutter on massive data sets


Download ppt "1 Efficient Algorithms for Non-Parametric Clustering With Clutter Weng-Keen Wong Andrew Moore."

Similar presentations


Ads by Google