Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI
Running Times (rounded up) Image source: Algorithm Design by Jon Kleinberg and Eva Tardos, Page 34
Clustering Large Data Sets There are three different ways in which the data set can be large: ◦There can be a large number of elements in the data set ◦Each element can have many features (many attributes) ◦There can be many clusters to discover
Problem Statement Traditional clustering algorithms are computationally expensive when we cluster large data sets. Traditional algorithms such as: ◦Single Linkage ◦K-means ◦Etc.
Single Linkage Clustering Algorithm Start by placing each point in its own cluster O(n) Calculate and store the distance between each pair of clusters O(n 2 ) While there are more than k clusters O(n) - Let A, B be the two closest clusters O(n 2 ) - Add cluster A U B O(n) - Remove clusters A and B O(n) - Find the distance between A U B and all other clusters O(n 2 ) Time Complexity: O(n 3 ) Space Complexity: O(n 2 )
Single Linkage Clustering Output
Better Approach? Can we somehow cluster data points more efficiently? Can we somehow apply clustering without computing distances for each point? Can we somehow disregard data points that will never lie in same clusters?
Efficient Clustering Using Canopies First Stage: Create canopies ◦Compute a quick and cheap distance matrix Second Stage: Use traditional clustering algorithm ◦Compute expensive distance matrix ◦Only for points that lie in or overlapping canopies
Canopy Based Clustering Example
End Goal Implement efficient clustering algorithm using the concept of canopies: ◦Using Agglomerative Clustering (such as linkage based) ◦Using Expectation Maximization Clustering (such as k-means) Find out the effective performance on a large data set ◦Such as text data ◦Reference matching data Compare the performance with traditional clustering algorithms
Thanks! Any questions?