Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.

Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI

Running Times (rounded up) Image source: Algorithm Design by Jon Kleinberg and Eva Tardos, Page 34

Clustering Large Data Sets  There are three different ways in which the data set can be large: ◦There can be a large number of elements in the data set ◦Each element can have many features (many attributes) ◦There can be many clusters to discover

Problem Statement  Traditional clustering algorithms are computationally expensive when we cluster large data sets.  Traditional algorithms such as: ◦Single Linkage ◦K-means ◦Etc.

Single Linkage Clustering Algorithm Start by placing each point in its own cluster O(n) Calculate and store the distance between each pair of clusters O(n 2 ) While there are more than k clusters O(n) - Let A, B be the two closest clusters O(n 2 ) - Add cluster A U B O(n) - Remove clusters A and B O(n) - Find the distance between A U B and all other clusters O(n 2 ) Time Complexity: O(n 3 ) Space Complexity: O(n 2 )

Single Linkage Clustering Output

Better Approach?  Can we somehow cluster data points more efficiently?  Can we somehow apply clustering without computing distances for each point?  Can we somehow disregard data points that will never lie in same clusters?

Efficient Clustering Using Canopies  First Stage: Create canopies ◦Compute a quick and cheap distance matrix  Second Stage: Use traditional clustering algorithm ◦Compute expensive distance matrix ◦Only for points that lie in or overlapping canopies

Canopy Based Clustering Example

End Goal  Implement efficient clustering algorithm using the concept of canopies: ◦Using Agglomerative Clustering (such as linkage based) ◦Using Expectation Maximization Clustering (such as k-means)  Find out the effective performance on a large data set ◦Such as text data ◦Reference matching data  Compare the performance with traditional clustering algorithms

Thanks!  Any questions?

Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.

Similar presentations

Presentation on theme: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.

Similar presentations

Presentation on theme: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI."— Presentation transcript:

Similar presentations

About project

Feedback