Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.

Similar presentations


Presentation on theme: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI."— Presentation transcript:

1 Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI

2 Running Times (rounded up) Image source: Algorithm Design by Jon Kleinberg and Eva Tardos, Page 34

3 Clustering Large Data Sets  There are three different ways in which the data set can be large: ◦There can be a large number of elements in the data set ◦Each element can have many features (many attributes) ◦There can be many clusters to discover

4 Problem Statement  Traditional clustering algorithms are computationally expensive when we cluster large data sets.  Traditional algorithms such as: ◦Single Linkage ◦K-means ◦Etc.

5 Single Linkage Clustering Algorithm Start by placing each point in its own cluster O(n) Calculate and store the distance between each pair of clusters O(n 2 ) While there are more than k clusters O(n) - Let A, B be the two closest clusters O(n 2 ) - Add cluster A U B O(n) - Remove clusters A and B O(n) - Find the distance between A U B and all other clusters O(n 2 ) Time Complexity: O(n 3 ) Space Complexity: O(n 2 )

6 Single Linkage Clustering Output

7 Better Approach?  Can we somehow cluster data points more efficiently?  Can we somehow apply clustering without computing distances for each point?  Can we somehow disregard data points that will never lie in same clusters?

8 Efficient Clustering Using Canopies  First Stage: Create canopies ◦Compute a quick and cheap distance matrix  Second Stage: Use traditional clustering algorithm ◦Compute expensive distance matrix ◦Only for points that lie in or overlapping canopies

9 Canopy Based Clustering Example

10 End Goal  Implement efficient clustering algorithm using the concept of canopies: ◦Using Agglomerative Clustering (such as linkage based) ◦Using Expectation Maximization Clustering (such as k-means)  Find out the effective performance on a large data set ◦Such as text data ◦Reference matching data  Compare the performance with traditional clustering algorithms

11 Thanks!  Any questions?


Download ppt "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI."

Similar presentations


Ads by Google