Data Mining – Algorithms: K Means Clustering

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Local Clustering Algorithm DISCOVIR Image collection within a client is modeled as a single cluster. Current Situation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Cluster Analysis (1).
What is Cluster Analysis?
Genetic Algorithm Genetic Algorithms (GA) apply an evolutionary approach to inductive learning. GA has been successfully applied to problems that are difficult.
Clustering a.j.m.m. (ton) weijters The main idea is to define k centroids, one for each cluster (Example from a K-clustering tutorial of Teknomo, K.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
1 Running Clustering Algorithm in Weka Presented by Rachsuda Jiamthapthaksin Computer Science Department University of Houston.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Unsupervised Learning
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Slides by Eamonn Keogh (UC Riverside)
Data Mining – Algorithms: Instance-Based Learning
Data Mining K-means Algorithm
Clustering (3) Center-based algorithms Fuzzy k-means
数据挖掘 Introduction to Data Mining
CSE 5243 Intro. to Data Mining
Clustering.
AIM: Clustering the Data together
Clustering.
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Fuzzy Clustering Algorithms
Unsupervised Learning: Clustering
SEEM4630 Tutorial 3 – Clustering.
Data Mining CSCI 307, Spring 2019 Lecture 24
Unsupervised Learning
Presentation transcript:

Data Mining – Algorithms: K Means Clustering Chapter 4, Section 4.8

K Means Clusting K – is the number of clusters K must be specified in advance (option or parameter to algorithm) Develops “Cluster Centers” Starts with random center points Puts instances into “closest” cluster – based on euclidean distance Creates new centers based on instances included Refines iteratively until no change

Example See bankrawnumericKMeansVersion2.xls

Pseudo-code for K Means Clustering Loop through K times current centroid = Randomly generate values for each attribute Done = False All instances cluster = none WHILE not Done Total distance = 0 Done = true For each instance instance’s previous cluster = instance’s cluster measure euclidean distance to each centroid find smallest distance and assign instance to that cluster if new cluster != previous cluster Done=False add smallest distance to total distance Report total distance For each cluster loop through attributes loop through instances assigned to cluster update totals calculate average for attribute for cluster – producing new centroid END While

K Means Clustering Simple and Effective The minimum is a local minimum No guarantee that the total Euclidean distance is a global minimum Final clusters are quite sensitive to the initial (random) cluster centers This is true for all practical clustering techniques (since they are greedy hill climbers) Common to run several times and manually choose the best final result (one with the smallest total Euclidean distance)

Let’s run WEKA on this …

WEKA - Take-Home Number of iterations: 2 Within cluster sum of squared errors: 7.160553788978607 Cluster centroids: Cluster 0 Mean/Mode: 40.2 9215 1013.65 22 8.4 24002.221 Std Devs: 10.4019 4607.5 206.0537 4.4721 6.5038 21098.1457 Cluster 1 Mean/Mode: 27.6 2795.2167 423.89 15.2667 4.5333 4224.9115 Std Devs: 11.5437 3493.6652 227.8601 4.8766 6.7387 5836.1117 Clustered Instances 0 5 ( 25%) 1 15 ( 75%) This was with default – k = 2 (2 clusters) It only had to loop twice Sum of euclidean distances is shown Means (and SDs) for each attribute for each cluster are shown Number of instances in each cluster are shown You can visualize the cluster (right click on result list) <DO> You can change the number of clusters generated <DO> You can change the random seed to see how results differ <DO> Weka doesn’t give you a list of which instance is in which cluster – but can add to arff file by using Preprocess tab – Filters.Unsupervised.Attribute.AddCluster

Numeric Attributes Simple K Means is designed for Numeric Attributes Nominal Attributes similarity measurement has to use all or nothing Centroid uses mode instead of mean

End Section 4.8