Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.

Slides:



Advertisements
Similar presentations
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Techniques: Clustering
Clustering II.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
K-means clustering CS281B Winter02 Yan Wang and Lihua Lin.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Mining: Basic Cluster Analysis
Hierarchical Clustering
Clustering CSC 600: Data Mining Class 21.
Slides by Eamonn Keogh (UC Riverside)
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Hierarchical Clustering
Data Clustering Michael J. Watts
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Topic 3: Cluster Analysis
Clustering.
Revision (Part II) Ke Chen
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Revision (Part II) Ke Chen
Cluster Analysis in Bioinformatics
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Junheng, Shengming, Yunsheng 11/09/2018
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms

Slide 2 Today  Unsupervised Learning  Clustering  K-means

Slide 3 EE3J2 Data Mining Distortion  The distortion for the centroid set C = c 1,…,c M is defined by:  In other words, the distortion is the sum of distances between each data point and its nearest centroid  The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

Slide 4 4 The K-Means Clustering Method  Given k, the k-means algorithm is implemented in 4 steps: Initialisation  Define the number of clusters (k).  Designate a cluster centre (a vector quantity that is of the same dimensionality of the data) for each cluster. Assign each data point to the closest cluster centre (centroid). That data point is now a member of that cluster. Calculate the new cluster centre (the geometric average of all the members of a certain cluster). Calculate the sum of within-cluster sum-of-squares. If this value has not significantly changed over a certain number of iterations, exit the algorithm. If it has, or the change is insignificant but has not been seen to persist over a certain number of iterations, go back to Step 2. Remember you converge when you have found the minimum overall distance between the centroid and the objects.

Slide 5 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged! [From Mooney]

Slide 6 So….Basically  Start with randomly k data points (objects).  Find the set of data points that are closer to C 0 k ( Y 0 k ).  Compute average of these points, notate C 1 k -> new centroid.  Now repeat again this process and find the closest objects to C 1 k  Compute the average to get C 2 k -> new centroid, and so on….  Until convergence.

Slide 7 7 Comments on the K-Means Method  Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms  Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non- convex shapes

Slide 8 Hierarchical Clustering  Grouping data objects into a tree of clusters.  Agglomerative clustering Begin by assuming that every data point is a separate centroid Combine closest centroids until the desired number of clusters is reached  Divisive clustering Begin by assuming that there is just one centroid/cluster Split clusters until the desired number of clusters is reached

Slide 9 Agglomerative Clustering - Example StudentsExam1Exam2Exam3 Mike937 Tom1029 Bill194 T Ren655 Ali1103

Slide 10 Distances between objects  Using Euclidean Distance measure, what's the difference between Mike and Tom? Mike:9,3,7 Tom: 10,2,9 SE1E2E3 Mike937 Tom1029 Bill194 T655 Ali1103

Slide 11 Distance Matrix MikeTomBillT RenAli Mike Tom Bill T Ren Ali----0

Slide 12 The Algorithm Step 1  Identify the entities which are most similar - this can be easily discerned from the distance table constructed.  In this example, Bill and Ali are most similar, with a distance value of They are therefore the most 'related' Bill Ali

Slide 13 The Algorithm – Step 2  The two entities that are most similar can now be merged so that they represent a single cluster (or new entity).  So Bill and Ali can now be considered to be a single entity. How do we compare this entity with others? We use the Average linkage between the two.  So the new average vector is [1, 9.5, 3.5] – see first table and average the marks for Bill and Ali.  We now need to redraw the distance table, including the merger of the two entities, and new distance calculations.

Slide 14 The Algorithm – Step 3 MikeTomT Ren{Bill & Ali} Mike Tom T Ren-6.9 {Bill & Ali}-

Slide 15 Next closest students  Mike and Tom with 2.5!  So, now we have 2 clusters! Bill Ali Mike Tom

Slide 16 The distance matrix now {Mike & Tom} T Ren{Bill & Ali} {Mike & Tom} T Ren-6.9 {Bill & Ali}- Now, T Ren is closest to Bill and Ali so T Ren joines them In the cluster.

Slide 17 The final dendogram Bill Ali Mike Tom T Ren MANY ‘SUB-CLUSTERS’ WITHIN ONE CLUSTER

Slide 18 Conclusions  K- Means Algorithm – memorize equations and algorithm.  Hierarchical Clustering: Agglomerative Clustering