CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Cluster Analysis: Basic Concepts and Algorithms
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis (1).
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Radial Basis Function Networks
Evaluating Performance for Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Jeff Howbert Introduction to Machine Learning Winter Clustering Basic Concepts and Algorithms 1.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Unsupervised learning introduction
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Clustering.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
COMP24111 Machine Learning K-means Clustering Ke Chen.
What Is Cluster Analysis?
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining K-means Algorithm
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
CSCI N317 Computation for Scientific Applications Unit Weka
MIS2502: Data Analytics Clustering and Segmentation
Clustering Wei Wang.
Topic 5: Cluster Analysis
Presentation transcript:

CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods

Introduction to Clustering Definition The process of grouping a set of physical or abstract objects into classes of similar objects

Introduction to Clustering Advantages Adversely to classification which requires the often costly collection and labeling of a large set of training tuples or patterns, it proceeds in a reverse direction: * Partition the set of data into groups based on data similarity * Assign labels to the relatively small number of groups

Introduction to Clustering Importance & Necessity Discover overall distribution patterns and interesting correlations among data attributes. * Used widely in numerous applications: market research, pattern recognition, data analysis, and image processing * Used for outlier detection such as detection of credit card fraud or monitoring of criminal activities in electronic commerce * In business: characterize customer groups based on purchasing patterns * In biology: used to derive plants and animal taxonomies, categorize genes with similar functionality

Introduction to Clustering Pseudonym Occasionally called data segmentation because clustering partitions large data sets into groups according to their similarity

Introduction to Clustering Statistical Application Based on k-means, k-medoids, and several other methods, Cluster analysis tools have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS Clustering is a form of learning by observation (unsupervised learning) whereas learning machine is a form of learning by examples

Major Clustering Methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Clustering high-dimensional data Constraint-based clustering

Partitioning Methods Abstract Taxonomy

Abstract Premise Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the following requirements: (1) each group must contain at least one object, and (2) each object must belong to exactly one group.

Abstract General Criterion Objects in the same cluster are “close” or related to each other, whereas objects of different clusters are “far apart” or very different

Taxonomy Centroid-Based Technique: k-means paradigm Representative Object-Based Technique: The k-Medoids Method

K-MEANS PARADIGM Basic K-Means Algorithm Bisecting K-Means Algorithm EM (Expectation-Maximization) Algorithm K-Means Estimation: Strength and Weakness

K-Means Clustering (Centroid-Based Technique) I. The Algorithm Define k centroids, one for each cluster. These centroids should be place in a cunning way. Take each point belonging to a given data set and associate it to the nearest centroid. Re-calculate k new centroids. A loop has been generated ultil no more changes are done.

K-Means Clustering (Centroid-Based Technique) I. The Algorithm Typically, the square-error criterion is used, defined as where E is the sum of the square error for all objects in the data set, p is the point in space representing a given object, and mi is the mean of cluster Ci.

K-Means Clustering (Centroid-Based Technique) I. The Algorithm The algorithm is composed of the following steps: Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. Assign each object to the group that has the closest centroid.

K-Means Clustering (Centroid-Based Technique) I. The Algorithm 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat steps 2 and 3 until the centroids no longer move.

K-Means Clustering (Centroid-Based Technique) I. The Algorithm This is a greedy algorithm, it doesn’t necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly cluster centres.

K-Means Clustering (Centroid-Based Technique) II. Example

Representative Object-Based Technique: The K-Medoids Method The k-means algorithm is sensitive to outliers because an object with an extremely large value may substantially distort the distribution of data.

Representative Object-Based Technique: The K-Medoids Method Approach: Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster. Each remaining object is clustered with the representative object to which it is the most similar. An absolute-error criterion is used:

Hierarchical Methods: Bisecting K-Means Approach: The bisecting K-means algorithm is a straightforward extension of the basic K-Means algorithm that is based on the simple idea: to obtain K cluster, split the set of all points into two clusters, select one of these clusters to split, and so on, until K clusters have been produced.

Hierarchical Methods: Bisecting K-Means Bisecting K-Means Algorithm

Hierarchical Methods: Bisecting K-Means Different ways to choose which cluster to split: Choose the largest cluster at each step, or Choose the one with the largest SSE, or Use a criterion based on both size and SSE. Different choices result in different clusters. Advantage: Bisecting K-Means is less susceptible to initialization problems

Hierarchical Methods: Bisecting K-Means Example: Bisecting K-Means on the four clusters example.

Model-Based Clustering Methods: Expectation-Maximization Approach: Each cluster can be represented mathematically by a parametric probability distribution. Cluster the data using a finite mixture density model of k probability distributions , where each distribution represents a cluster. The problem is to estimate the parameters of the probability distributions so as to best fit the data ?

Model-Based Clustering Methods: Expectation-Maximization Instead of assigning each object to a dedicated cluster, EM assigns each object to a cluster according to a weight representing the probability of membership. new means are computed based on weighted measures. EM Algorithm Make an initial guess of the parameter vector: randomly selecting k objects to represent the cluster means. Iteratively refine the parameters (or clusters) based on the following two steps:

Model-Based Clustering Methods: Expectation-Maximization

K-Means Estimation: Strength and Weakness K-Means is simple and can be used for a wide variety of data types and, Efficient even through multiple runs are often performed. Some variants, including K-Medoids, bisecting K-Means, EM are more efficient and less susceptible to initialization problems. Weakness: Cannot handle non-globular clusters or cluster of different sizes and densities.

Representative Object-Based Technique: The K-Medoids Method To determine whether a non-representative object, orandom, is a good replacement for a current representative object, oj, the following four cases are examined for each of the non-representative objects, p

Representative Object-Based Technique: The K-Medoids Method PAM(Partitioning AroundMedoids) was one of the first k-medoids algorithms introduced

Representative Object-Based Technique: The K-Medoids Method The complexity of each iteration is O(k(n-k)2). The k-medoids method is more robust than k-means in the presence of noise and outliers, because a medoid is less influenced by outliers or other extreme values than a mean. However, its processing is more costly than the k-means method with complexity O(nkt).

References Data mining concepts and techniques 2nd: Jiawei Han and Micheline Kamber Introduction to Data Mining: Pang-Ning Tan - Michigan State University, Michael Steinbach - University of Minnesota , Vipin Kumar - University of Minnesota . Machine Learning for Data Mining - Week 6 – Clustering: Christof Monz - Queen Mary, University of London. http://en.wikipedia.org/wiki/K-medoids