Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.

Similar presentations


Presentation on theme: "Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these."— Presentation transcript:

1 Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

2 Rodney Nielsen, Human Intelligence & Language Technologies Lab Algorithms: The Basic Methods ● Inferring rudimentary rules ● Naïve Bayes, probabilistic model ● Constructing decision trees ● Constructing rules ● Association rule learning ● Linear models ● Instance-based learning ● Clustering

3 Rodney Nielsen, Human Intelligence & Language Technologies Lab Clustering Clustering techniques apply when there is no class to be predicted Aim: divide instances into “natural” groups Clusters can be: Disjoint OR Overlapping Deterministic OR Probabilistic Flat OR Hierarchical Classic clustering algorithm: k-means k-means clusters are disjoint, deterministic, and flat

4 Rodney Nielsen, Human Intelligence & Language Technologies Lab Unsupervised Learning a a b b c c d d e e f f

5 Rodney Nielsen, Human Intelligence & Language Technologies Lab Hierarchical Agglomerative Clustering a a b b c c d d e e f f a a b b c c d d e e f f bc de def bcdef abcdef

6 Rodney Nielsen, Human Intelligence & Language Technologies Lab k -means Clustering b c a 1. Choose number of clusters e.g., k=3 2. Select random centroids often examples 3. Until convergence 4. Iterate over all examples and assign them to the cluster whose centroid is closest 5. Re-compute the cluster centroid

7 Rodney Nielsen, Human Intelligence & Language Technologies Lab k -means Clustering b c a 1. Choose number of clusters e.g., k=3 2. Select random centroids often examples 3. Until convergence 4. Iterate over all examples and assign them to the cluster whose centroid is closest 5. Re-compute the cluster centroid

8 Rodney Nielsen, Human Intelligence & Language Technologies Lab k -means Clustering b c a 1. Choose number of clusters e.g., k=3 2. Select random centroids often examples 3. Until convergence 4. Iterate over all examples and assign them to the cluster whose centroid is closest 5. Re-compute the cluster centroid

9 Rodney Nielsen, Human Intelligence & Language Technologies Lab k -means Clustering a a b b c c a a b c b c a a a b c

10 Rodney Nielsen, Human Intelligence & Language Technologies Lab Expectation Maximization

11 Rodney Nielsen, Human Intelligence & Language Technologies Lab Discussion k-means minimizes squared distance to cluster centers Result can vary significantly Based on initial choice of seeds Can get trapped in local minimum Example: To increase chance of finding global optimum: restart with different random seeds For hierarchical clustering, can be applied recursively with k = 2 instances initial cluster centers

12 Rodney Nielsen, Human Intelligence & Language Technologies Lab Faster Distance Calculations Can we use kD-trees or ball trees to speed up the process? Yes: First, build tree, which remains static, for all the data points At each node, store number of instances and sum of all instances In each iteration, descend tree and find out which cluster each node belongs to Can stop descending as soon as we find out that a node belongs entirely to a particular cluster Use statistics stored at the nodes to compute new cluster centers

13 Rodney Nielsen, Human Intelligence & Language Technologies Lab Example

14 Rodney Nielsen, Human Intelligence & Language Technologies Lab Dimensionality Reduction Principle Components Analysis Singular Value Decomposition


Download ppt "Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these."

Similar presentations


Ads by Google