Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew McCallum, Kamal Nigam and Lyle H. Ungar Efficient clustering of high- dimensional data sets with application to reference matching ACM 2000

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Efficient Clustering with Canopies Experimental Results Conclusions Personal Opinion

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation  Traditional clustering algorithms become computationally expensive when the data set to be clustered is large ─ large number of elements in the data set ─ many features ─ many clusters to discover

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective  Introduces a technique for clustering that is efficient when the problem is large in all of these three ways at once  Using canopies for clustering can increase computational efficiency without losing any clustering accuracy

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Divide the clustering process into two stages efficiently divide the data into overlapping subsets we call canopies (increase computational efficiency) completes the clustering by running a standard clustering algorithm (reduce numbers of cluster)

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Efficient Clustering with Canopies  The key idea of the canopy algorithm ─ greatly reduce the number of distance computations required for clustering ─ by first cheaply partitioning the data into overlapping subsets ─ then only measuring distances among pairs of data points that belong to a common subset  Uses two different sources of information ─ cheap and approximate similarity measure ─ expensive and accurate similarity measure

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Divide the clustering process into two stages  First stage ─ use the cheap distance measure in order to create some number of overlapping subsets, called “canopies"  Second stage ─ execute some traditional clustering algorithm ─ using the accurate distance measure ─ but with the restriction that we do not calculate the distance between two points that never appear in the same canopy

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Create canopies  Start with a list of the data points in any order ─ two distance thresholds, T1 and T2, where T1 > T2 ─ Pick a point off the list and approximately measure its distance to all other points. (This is extremely cheap with an inverted index.) ─ Put all points that are within distance threshold T1 into a canopy ─ Remove from the list all points that are within distance threshold T2  Repeat until the list is empty  Figure 1 shows some canopies that were created by this procedure

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Example

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Canopies with Greedy Agglomerative Clustering  GAC is used to group items together based on similarity  Standard GAC implementation, we need to apply the distance function O(n2) times to calculate all pair-wise distances between items  A canopies-based implementation of GAC can drastically reduce this required number of comparisons

11 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Experimental Results

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The error and time costs of different methods of clustering references

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The accuracy of the clustering

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions  Canopies provide a principled approach  The canopy approach is widely applicable  Have demonstrated the success of the canopies approach on a reference matching problem

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion  High-dimensional data sets problem will become more and more important


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew."

Similar presentations


Ads by Google