Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE572, CBS598: Data Mining by H. Liu

Similar presentations


Presentation on theme: "CSE572, CBS598: Data Mining by H. Liu"— Presentation transcript:

1 CSE572, CBS598: Data Mining by H. Liu
Clustering Basic concepts with simple examples Categories of clustering methods Challenges 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

2 CSE572, CBS598: Data Mining by H. Liu
What is clustering? The process of grouping a set of physical or abstract objects into classes of similar objects. It is also called unsupervised learning. It is a common and important task that finds many applications Examples of clusters Examples where we need clustering 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

3 Differences from Classification
How different? Which one is more difficult as a learning problem? How do we cluster? How to measure the results of clustering? With/without class labels 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

4 Major clustering methods
Partitioning methods k-Means (and EM), k-Medoids Hierarchical methods agglomerative, divisive, BIRCH Similarity and dissimilarity of points in the same cluster and from different clusters Distance measures between clusters minimum, maximum Means of clusters Average between clusters 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

5 CSE572, CBS598: Data Mining by H. Liu
Clustering -- Example 1 For simplicity, 1-dimension objects and k=2. Objects: 1, 2, 5, 6,7 K-means: Randomly select 5 and 6 as centroids; => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^ ^2 + 1^2 + 1^2 = 2.5 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

6 CSE572, CBS598: Data Mining by H. Liu
Issues with k-means A heuristic method Sensitive to outliers How to prove it Determining k Crisp clustering EM Don’t be confused with k-NN 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

7 CSE572, CBS598: Data Mining by H. Liu
Clustering -- Example 2 For simplicity, we still use 1-dimension objects. Objects: 1, 2, 5, 6,7 agglomerative clustering – a very frequently used algorithm How to cluster: find two closest objects and merge; => {1,2}, so we have now {1.5,5, 6,7}; => {1,2}, {5,6}, so {1.5, 5.5,7}; => {1,2}, {{5,6},7}. 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

8 Issues with dendrograms
How to find proper clusters An alternative: divisive algorithms Top down Comparing with bottom-up, which is more efficient What’s the complexity? How to divide the data A heuristic – Minimum Spanning Tree 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

9 CSE572, CBS598: Data Mining by H. Liu
Distance measures Single link Measured by the shortest edge between the two clusters Complete link Measured by the longest edge Average link Measured by the average edge length An example 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

10 CSE572, CBS598: Data Mining by H. Liu
Other Methods Density-based methods DBSCAN: a cluster is a maximal set of density-connected points Core points defined using epsilon-neighborhood and minPts Apply directly density reachable (e.g., P and Q, Q and M) and density-reachable (P and M, assuming so are P and N), and density-connected (any density reachable points, P, Q, M, N) form clusters Grid-based methods STING: the lowest level is the original data statistical parameters of higher-level cells are computed from the parameters of the lower-level cells (count, mean, standard deviation, min, max, distribution Model-based methods Conceptual clustering: COBWEB Category utility Intraclass similarity Interclass dissimilarity 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

11 CSE572, CBS598: Data Mining by H. Liu
Neural networks Self-organizing feature maps (SOMs) Subspace clustering Clique: if a k-dimensional unit space is dense, then so are its (k-1)-d subspaces 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

12 CSE572, CBS598: Data Mining by H. Liu
Challenges Scalability Dealing with different types of attributes Clusters with arbitrary shapes Automatically determining input parameters Dealing with noise (outliers) Order insensitivity of instances presented to learning High dimensionality Interpretability and usability 11/27/2018 CSE572, CBS598: Data Mining by H. Liu


Download ppt "CSE572, CBS598: Data Mining by H. Liu"

Similar presentations


Ads by Google