Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005.

Similar presentations


Presentation on theme: "1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005."— Presentation transcript:

1 1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005

2 2 A data set

3 3 Clustering algorithms could generate bad cluster hMETIS (k=6)

4 4 Clustering algorithms could generate bad cluster hMETIS (k=20)

5 5 BIRCH

6 6

7 7 Clustering algorithms could generate bad cluster BIRCH (k=20)

8 8 Factors affecting clustering results Outliers Inappropriate value for parameters Drawbacks of the clustering algorithm themselves

9 9 Factors affecting outlier detection results Distributions Boundary between outlier group and microcluster Nested outliers

10 10 Two steps of cluster repair Outlier/outlier group detection for each cluster Separate points which are not supposed to be together Merge density connected points Merge points which should be together Outlier detection of different clusters. Clusters generated by a clustering algorithm Merge similar points from different clusters.

11 11 Step 1: Cluster Repair Outlier Detection and Evaluation by Network Flow

12 12 Network Flow: Maximum Flow/Minimum Cut Ford-Fulkerson (1962) The maximum flow problem is to find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink.

13 13 Outlier detection: Maximum flow/Minimum cut st a b cd 19/19 12/13 7/10 9/97/7 12/12 28/30 3/3 10/11 s->a->b->t: 12 s->a->c->d->b->t: 7 s->c->b->t: 9 s->c->d->t: 3 maximum-flow= minimum-cut = 12+3+9+7=31

14 14 Outlier detection by network flow 1. compute k nearest neighbors of each point in a cluster of data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the sink/source t. 4. use the Maximum-Flow/Minimum-Cut algorithm to find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group. 5. remove the candidate outlier or outlier groups from the graph. 6. select the next source, go back to 3 until the stop criterion. 7. adjusting: coarsen the graph and adjust the maximum flow.

15 15 Loosely connected clusters 20 19 10 1 2

16 16 7 nearest neighbors 591 points, 5028 edges Setting up the Network The No. 20 cluster , 591 points Experiments (setting up the network)

17 17 Setting up the network Compute k nearest neighbors, make sure all vertices are connected. Compute the capacity between two vertices by the distance.

18 18 Experiment result LoopMax Flow No. 41267 No. 11269 No. 33256 No. 53937 No. 85939 No. 77717 No. 148962 No. 910148 No. 1016194 No. 216533 No. 1317793 No. 625378 No. 1163797 No. 12160515 No. 15359560 No. 17427908 No. 161307310

19 19 Experiment (adjusting) 18 vertices, 66 edges LoopCutMax Flow No. 1vertex 41267 No. 2vertex 11269 No. 3vertex 33256 No. 4Vertex 53937 No. 5vertex 85939 No. 6vertex 7,9,1016531 No. 7vertex 216533 No. 8vertex 1317793 No. 9Vertex 1420261 No. 10Vertex 625378 No. 11Vertex 1152498 No. 12Vertex 12160515 No. 13Vertex 15359560 No. 14Vertex 17427908 No. 15Vertex 161307310

20 20 Stop criteria Users input the number of outlier or outlier group they want. Use the maximum flow as the stop condition. Stop when D flow D avg D avg = average distance of the remaining data

21 21 Outlier Degree

22 22 Experiment (20 clusters) 27 1 3 4 5 68 1415 13 910 20 19 17 16 18 11 12

23 23 Step 2: Cluster Repair Merge Density Connected Points

24 24 Merge density connected microclusters by flexible parameters of DBSCAN 27 1 3 4 5 68 1415 13 910 20 19 17 16 18 11 12

25 25 Flexible parameters of DBSCAN get the average distance d of every microcluster by each point’s k nearest neighbors No. 20 clusterNo. 19 cluster No. 10 cluster

26 26 DBSCAN

27 27 DBSCAN

28 28 DBSCAN with flexible Eps Original DBSCAN use least dense e- neighborhood as global Eps and set MinPts=4. We use average distance of every microcluster as the Eps. When do DBSCAN, points in different microclusters use different Eps.

29 29 Kd tree Use kd tree to find buckets with more than two microclusters from different original cluster results.

30 30 No. 125 bucket

31 31 MinPts = 4 for dim = 2 Eps p Search the rectangle (x+Eps, y+Eps, x-Eps, y-Eps) by R* tree, when Eps = avg_dist between points, it is very possible the point P could include 3 extra points besides itself.

32 32 No. 125 bucket (a) MinPts = 5 (b) MinPts = 5

33 33 Other controversial buckets No.119 bucketNo.113 bucketNo.114 bucket If x% points of a microcluster are merged into another microcluster, then merge These two microclusters. Since the proportion of points of these microclusters in these buckets that are merged exceeds 90%, 24 and 28 microclusters are merged.

34 34 No. 20, 19 and 10 cluster repair

35 35 After repair 20 clusters

36 36 Conclusion Repair cluster from two aspects. Removing points which are loosely connect to the clusters by outlier/outlier group detection; merging points which are density connected by DBSCAN with flexible Eps. Analyze interested microclusters Found the Relationship among Outliers, outlier groups and main clusters.

37 37 Questions MinPts in high dimensional data For 3-d, MinPts=5; 4-d, MinPts=6? For some outlier group microcluster, MinPts could be very high, it’s because border points include points in neighbor dense microcluters within its Eps, how to use each microcluster’s MinPts as reference.


Download ppt "1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005."

Similar presentations


Ads by Google