DATA MINING CLUSTERING K-Means.

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis Basics
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Data Mining Techniques: Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
Data Mining Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis: Basic Concepts and Algorithms
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Marcus Sampaio DSC/UFCG. Marcus Sampaio DSC/UFCG Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Cluster Analysis: Basic.
Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103,  Four Scales  Categorical.
DATA MINING LECTURE 8 Clustering The k-means algorithm
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Lecture 20: Cluster Validation
Jeff Howbert Introduction to Machine Learning Winter Clustering Basic Concepts and Algorithms 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Adapted from Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar.
Critical Issues with Respect to Clustering Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Clustering.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Fuzzy C-Means Clustering
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Clustering Basic Concepts and Algorithms 1
Critical Issues with Respect to Clustering
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Data Mining CSCI 307, Spring 2019 Lecture 24
Presentation transcript:

DATA MINING CLUSTERING K-Means

Clustering Definition Techniques that are used to divide data objects into groups A form of classification in that it creates a labelling object with class(cluster) labels. The labels are derived from the data Cluster analysis is categorized as unsupervised classification When you have no idea how to define groups, clustering method can be useful

Types of Clustering Hierarchical vs Partitional Hierarchical  nested cluster, organized as tree Partitional  fully non-overlapping Exclusive vs Overlapping vs Fuzzy Exclusive  each object is assigned to a single cluster Overlapping  an object can simultaneously belong to more than one cluster Fuzzy  every object belongs to every cluster with a membership weigth that is between 0 and 1 Complete vs Partial Complete  assigns every object to cluster Partial  not all objects are assigned

Types of Clusters Well-separated Prototype-based Graph-based Density-based Shared-property(Conceptual Cluster)

K-Means Partitional clustering Prototype-based One level

Basic K-Means k, the number of clusters that are to be formed, must be decided before beginning Step 1 Select k data points to act as the seeds (or initial cluster centroids) Step 2 Each record is assigned to the centroid which is nearest, thus forming a cluster Step 3 The centroids of the new clusters are then calculated. Go back to Step 2

Basic K-means -2- Determine cluster boundaries Assign each record to the nearest centroid Calculate new centroid

Choosing Initial Centroids Random initial centroids Poor Can have empty cluster Limits of random initialization Multiple runs with different set of randomly choosen centroids then select the set of cluster with the minimum SSE

Similarity, Association, and Distance The method just described assumes that each record can be described as a point in a metric space This is not easily done for many data sets (e.g., categorical and some numeric variables) Pre-processing is often necessary Records in a cluster should have a natural association. A measure of similarity is required. Euclidean distance is often used, but it is not always suitable Euclidean distance treats changes in each dimension equally, but changes in one field may be more important than changes in another and changes of the same “size” in different fields can have very different significances e.g. 1 metre difference in height vs. $1 difference in annual income

Measures of Similarity Euclidean distance between vectors X and Y Weighting

Redefine Cluster Centroids Sum of the Squared Error for data in euclidean space. The centroid(mean) of the ith cluster is defined: Other case: Proximity Function Centroid Objective Function Manhattan (L1) median Minimize sum of L1 distance of an object to its cluster centroid Square Euclidean(L22) mean Minimize sum of the squared L2 distance of an object to its cluster centroid Cosine Maximize sum of the cosine similarity of an object to its cluster centroid Bregman divergence Minimize sum of the Bregman divergence of an object to its cluster centroid

Bisecting K-means Basic idea: Choose the cluster to split: Split the set of all points into two cluster Select one of these clusters to split so on, until K cluster have been produced Choose the cluster to split: Cluster with largest SSE Cluster with largest size Both, or other criterion Bisecting is less susceptible to initialization problems

Strengths and Weaknesses Simple and can be used for wide variety data types Efficient in computation Weaknesses Not suitable for all types of data Cannot contains outliers, should be remove Restricted to data for which there is a notion of a center(centroids)