Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Copyright Jiawei Han, modified by Charles Ling for CS411a
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
What is Cluster Analysis?
Clustering.
PARTITIONAL CLUSTERING
Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Learning from Examples Adriano Cruz ©2004 NCE/UFRJ e IM/UFRJ.
Data Mining Techniques: Clustering
Fuzzy Sets - Introduction If you only have a hammer, everything looks like a nail. Adriano Joaquim de Oliveira Cruz – NCE e IM, UFRJ
What is Cluster Analysis
Lecture 5 (Classification with Decision Trees)
Neuro-Fuzzy Control Adriano Joaquim de Oliveira Cruz NCE/UFRJ
Linguistic Descriptions Adriano Joaquim de Oliveira Cruz NCE e IM/UFRJ © 2003.
CLUSTERING (Segmentation)
Levels of Measurement Nominal measurement Involves assigning numbers to classify characteristics into categories Ordinal measurement Involves sorting objects.
Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103,  Four Scales  Categorical.
Scales of Measurement What is a nominal scale? A scale that categorizes items What is an ordinal scale? A scale that categorizes and rank orders items.
Data Mining Techniques
Data Presentation.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
CH. 8 MEASUREMENT OF VARIABLES: OPERATIONAL DEFINITION AND SCALES
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Chapter 2: Getting to Know Your Data
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining Techniques Clustering. Purpose In clustering analysis, there is no pre-classified data Instead, clustering analysis is a process where a set.
Clustering.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Sampling Design & Measurement Scaling
1 PAUF 610 TA 1 st Discussion. 2 3 Population & Sample Population includes all members of a specified group. (total collection of objects/people studied)
Data Mining and Decision Support
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Fuzzy Pattern Recognition. Overview of Pattern Recognition Pattern Recognition Procedure Feature Extraction Feature Reduction Classification (supervised)
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Unsupervised Learning
What Is Cluster Analysis?
Data Transformation: Normalization
PREDICT 422: Practical Machine Learning
Data Mining: Concepts and Techniques
Topic 3: Cluster Analysis
Data Mining Chapter 4 Cluster Analysis Part 1
Selected Topics in AI: Data Clustering
CSE572, CBS598: Data Mining by H. Liu
Clustering and Multidimensional Scaling
CSCI N317 Computation for Scientific Applications Unit Weka
CSE572, CBS572: Data Mining by H. Liu
What Is Good Clustering?
Clustering Wei Wang.
Data Transformations targeted at minimizing experimental variance
Group 9 – Data Mining: Data
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
Data Pre-processing Lecture Notes for Chapter 2
Unsupervised Learning
Presentation transcript:

Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

Introduction

Adriano Cruz *NCE e IM - UFRJ Cluster 3 What is cluster analysis?  The process of grouping a set of physical or abstract objects into classes of similar objects.  The class label of each class is unknown.  Classification separates objects into classes when the labels are known.

Adriano Cruz *NCE e IM - UFRJ Cluster 4 What is cluster analysis? cont.  Clustering is a form of learning by observations.  Neural Networks learn by examples.  Unsupervised learning.

Adriano Cruz *NCE e IM - UFRJ Cluster 5Applications  In business helps to discover distinct groups of customers.  In data mining used to gain insight into the distribution of data, to observe the characteristics of each cluster.  Pre-processing step for classification.  Pattern recognition.

Adriano Cruz *NCE e IM - UFRJ Cluster 6Requirements  Scalability: work with large databases.  Ability to deal with different types of attributes (not only interval based data).  Clusters of arbitrary shape, not only spherical.  Minimal requirements about domain.  Ability do deal with noisy data.

Adriano Cruz *NCE e IM - UFRJ Cluster 7 Requirements cont.  Insensitivity to the order of input records.  Work with samples of high dimensionality.  Constrained-based clustering  Interpretability and usability: results should be easily interpretable.

Adriano Cruz *NCE e IM - UFRJ Cluster 8 Sensitivity to Input Order  Some algorithms are sensitive to the order of input data  Leader algorithm is an example  Ellipse: ; Triangle:

Clustering Techniques

Adriano Cruz *NCE e IM - UFRJ Cluster 10 Heuristic Clustering Techniques  Incomplete or heuristic clustering: geometrical methods or projection techniques.  Dimension reduction techniques (e.g. PCA) are used obtain a graphical representation in two or three dimensions.  Heuristic methods based on visualisation are used to determine the clusters.

Adriano Cruz *NCE e IM - UFRJ Cluster 11 Deterministic Crisp Clustering  Each datum will be assigned to only one cluster.  Each cluster partition defines a ordinary partition of the data set.

Adriano Cruz *NCE e IM - UFRJ Cluster 12 Overlapping Crisp Clustering  Each datum will be assigned to at least one cluster.  Elements may belong to more than one cluster at various degrees.

Adriano Cruz *NCE e IM - UFRJ Cluster 13 Probabilistic Clustering  For each element, a probabilistic distribution over the clusters is determined.  The distribution specifies the probability with which a datum is assigned to a cluster.  If the probabilities are interpreted as degree of membership then these are fuzzy clustering techniques.

Adriano Cruz *NCE e IM - UFRJ Cluster 14 Possibilistic Clustering  Degrees of membership or possibility indicate to what extent a datum belongs to the clusters.  Possibilistic cluster analysis drops the constraint that the sum of memberships of each datum to all clusters is equal to one.

Adriano Cruz *NCE e IM - UFRJ Cluster 15 Hierarchical Clustering  Descending techniques: they divide the data into more fine-grained classes.  Ascending techniques: they combine small classes into more coarse-grained ones.

Adriano Cruz *NCE e IM - UFRJ Cluster 16 Objective Function Clustering  An objective function assigns to each cluster partition values that have to be optimised.  This is strictly an optimisation problem.

Data Types

Adriano Cruz *NCE e IM - UFRJ Cluster 18 Data Types  Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature.  Binary variables have only two states. Ex. smoker, fever, client, owner.  Nominal variables are a generalisation of a binary variable with m states. Ex. Map colour, Marital state.

Adriano Cruz *NCE e IM - UFRJ Cluster 19 Data Types cont.  Ordinal variables are ordered nominal variables. Ex. Olympic medals, Professional ranks.  Ratio-scaled variables have a non-linear scale. Ex. Growth of a bacteria population

Adriano Cruz *NCE e IM - UFRJ Cluster 20 Interval-scaled variables  Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature.  Interval-scaled variables are dependent on the units used.  Measurement unit can affect analysis, so standardisation should be used.

Adriano Cruz *NCE e IM - UFRJ Cluster 21Problems Person Age (yr) Height (cm) A35190 B40190 C35160 D40160

Adriano Cruz *NCE e IM - UFRJ Cluster 22Standardisation  Converting original measurements to unitless values.  Attempts to give all variables the equal weight.  Useful when there is no prior knowledge of the data.

Adriano Cruz *NCE e IM - UFRJ Cluster 23 Standardisation algorithm  Z-scores indicate how far and in what direction an item deviates from its distribution's mean, expressed in units of its distribution's standard deviation.  The transformed scores will have a mean of zero and standard deviation of one.  It is useful when comparing relative standings of items from distributions with different means and/or different standard deviation.

Adriano Cruz *NCE e IM - UFRJ Cluster 24 Standardisation algorithm  Consider n values of a variable x.  Calculate the mean value.  Calculate the standard deviation.  Calculate the z-score.

Adriano Cruz *NCE e IM - UFRJ Cluster 25 Z-scores example

Adriano Cruz *NCE e IM - UFRJ Cluster 26 Real heights and ages charts

Adriano Cruz *NCE e IM - UFRJ Cluster 27 Z-scores for heights and ages

Adriano Cruz *NCE e IM - UFRJ Cluster 28 Data chart

Adriano Cruz *NCE e IM - UFRJ Cluster 29 Data chart

Similarities

Adriano Cruz *NCE e IM - UFRJ Cluster 31 Data Matrices  Data matrix: represents n objects with p characteristics.  Ex. person = {age, sex, income,...}  Dissimilarity matrix: represents a collection of dissimilarities between all pairs of objects.

Adriano Cruz *NCE e IM - UFRJ Cluster 32Dissimilarities  Dissimilarity measures some form of distance between objects.  Clustering algorithms use dissimilarities to cluster data.  How can dissimilarities be measured?

Adriano Cruz *NCE e IM - UFRJ Cluster 33 How to calculate dissimilarities?  The most popular methods are based on the distance between pairs of objects.  Minkowski distance:  p is the number of characteristics  q is the distance type  q=2 (Euclides distance), q=1 (Manhattan)

Adriano Cruz *NCE e IM - UFRJ Cluster 34Similarities  It is also possible to work with similarities [s(x i,x j )]  0<=s(x i,x j )<=1  s(x i,x i )=1  s(x i,x j )=s(x j,x i )  It is possible to consider that d(x i,x j )=1- s(x i,x j )

Adriano Cruz *NCE e IM - UFRJ Cluster 35Distances

Adriano Cruz *NCE e IM - UFRJ Cluster 36Dissimilarities  There are other ways to obtain dissimilarities.  So we no longer speak of distances.  Basically dissimilarities are nonnegative numbers (d(i,j)) that are small (close to 0) when i and j are similar.

Adriano Cruz *NCE e IM - UFRJ Cluster 37Pearson  Pearson product-moment correlation between variables f and g  Coefficients lie between –1 and +1

Adriano Cruz *NCE e IM - UFRJ Cluster 38 Pearson - cont  A correlation of +1 means that there is a perfect positive linear relationship between variables.  A correlation of -1 means that there is a perfect negative linear relationship between variables.  A correlation of 0 means there is no linear relationship between the two variables.

Adriano Cruz *NCE e IM - UFRJ Cluster 39 Pearson - ex  ryz = ; ryw = ; ryr=

Adriano Cruz *NCE e IM - UFRJ Cluster 40 Correlation and dissimilarities 1  d(f,g)=(1-R(f,g))/2 (1)  Variables with a high positive correlation (+1) receive a dissimilarity close to 0  Variables with strongly negative correlation will be considered very dissimilar

Adriano Cruz *NCE e IM - UFRJ Cluster 41 Correlation and dissimilarities 2  d(f,g)=1-|R(f,g)| (2)  Variables with a high positive correlation (+1) and negative correlation will receive a dissimilarity close to 0

Adriano Cruz *NCE e IM - UFRJ Cluster 42 Numerical Example NameWeightHeightMonthYear Ilan Jack Kim Lieve Leon Peter Talia Tina

Adriano Cruz *NCE e IM - UFRJ Cluster 43 Numerical Example NameWeightHeightMonthYear Ilan Jack Kim Lieve Leon Peter Talia Tina

Adriano Cruz *NCE e IM - UFRJ Cluster 44 Numerical Example 1 QuantiWeightHeightMonthYear CorrWeight1 Height Month Year DissWeight0 (1)Height Month Year DissWeight0 (2)Height Month Year

Adriano Cruz *NCE e IM - UFRJ Cluster 45 Binary Variables  Binary variables have only two states.  States can be symmetric or asymmetric.  Binary variables are symmetric if both states are equally valuable. Ex. gender  When the states are not equally important the variable is asymmetric. Ex. disease tests (1-positive; 0-negative)

Adriano Cruz *NCE e IM - UFRJ Cluster 46 Contingency tables  Consider objects described by p binary variables  q variables are equal to one on i and j  r variables are 1 on i and 0 on object j

Adriano Cruz *NCE e IM - UFRJ Cluster 47 Symmetric Variables  Dissimilarity based on symmetric variables is invariant.  The result should not change when variables are interchanged.  Simple dissimilarity coefficient:

Adriano Cruz *NCE e IM - UFRJ Cluster 48 Symmetric Variables  Dissimilarity  Similarity

Adriano Cruz *NCE e IM - UFRJ Cluster 49 Asymmetric Variables  Similarity based on asymmetric variables is not invariant.  Two ones are more important than two zeros  Jacard coefficient:

Adriano Cruz *NCE e IM - UFRJ Cluster 50 Computing dissimilarities

Adriano Cruz *NCE e IM - UFRJ Cluster 51 Computing Dissimilarities JackMary q 1,1 r 1,0 s 0,1 t 0,0 FeverYY1000 CoughNN0001 Test1PP1000 Test2NN0001 Test3NP0010 Test4NN

Adriano Cruz *NCE e IM - UFRJ Cluster 52 Computing dissimilarities Jim and Mary have the highest dissimilarity value, so they have low probability of having the same disease.

Adriano Cruz *NCE e IM - UFRJ Cluster 53 Nominal Variables  A nominal variable is a generalisation of the binary variable.  A nominal variable can take more than two states  Ex. Marital status: married, single, divorced  Each state can be represented by a number or letter  There is no specific ordering

Adriano Cruz *NCE e IM - UFRJ Cluster 54 Computing dissimilarities  Consider two objects i and j, described by nominal variables  Each object has p characteristics  m is the number of matches

Adriano Cruz *NCE e IM - UFRJ Cluster 55 Binarising nominal variables  An nominal variable can encoded to create a new binary variable for each state  Example:  Marital state = {married, single, divorced}  Married: 1=yes – 0=no  Single: 1=yes – 0=no  Divorced: 1=yes – 0=no  Ex. Marital state = {married}  married = 1, single = 0, divorced = 0

Adriano Cruz *NCE e IM - UFRJ Cluster 56 Ordinal variables  A discrete ordinal variable is similar to a nominal variable, except that the states are ordered in a meaningful sequence  Ex. Bronze, silver and gold medals  Ex. Assistant, associate, full member

Adriano Cruz *NCE e IM - UFRJ Cluster 57 Computing dissimilarities  Consider n objects defined by a set of ordinal variables  f is one of these ordinal variables and have M f states.  These states define the ranking r f  {1,…, M f }.

Adriano Cruz *NCE e IM - UFRJ Cluster 58 Steps to calculate dissimilarities  Assume that the value of f for the ith object is x if. Replace each x if by its corresponding rank r if g {1,…,M f }.  Since the number of states of each variable differs, it is often necessary map the range onto [0.0,1.0] using the equation  Dissimilarity can be computed using distance measures of interval-scaled variables

Adriano Cruz *NCE e IM - UFRJ Cluster 59 Ratio-scaled variables  Variables on a non-linear scale, such as exponential  To compute dissimilarities there are three methods Treat as interval-scaled. Not always good. Treat as interval-scaled. Not always good. Apply a transformation like y=log(x) and treat as interval-scaled Apply a transformation like y=log(x) and treat as interval-scaled Treat as ordinal data and assume ranks as interval-scaled Treat as ordinal data and assume ranks as interval-scaled

Adriano Cruz *NCE e IM - UFRJ Cluster 60 Variables of mixed types  One technique is to bring all variables onto a common scale of the interval [ ]  Suppose that the data set contains p variables of mixed type. Dissimilarity is between i and j is

Adriano Cruz *NCE e IM - UFRJ Cluster 61  Dissimilarity is between i and j is Variables of mixed types

Adriano Cruz *NCE e IM - UFRJ Cluster 62  The contribution of each variable is dependent on its type  f is binary or nominal:  f is interval-based:  f is ordinal of ratio-scaled: compute ranks and treat as interval-based Variables of mixed types cont

Clustering Methods

Adriano Cruz *NCE e IM - UFRJ Cluster 64 Classification types  Clustering is an unsupervised method

Adriano Cruz *NCE e IM - UFRJ Cluster 65 Clustering Methods  Partitioning  Hierarchical  Density-based  Grid-based  Model-based

Adriano Cruz *NCE e IM - UFRJ Cluster 66 Partitioning Methods  Given n objects k partitions are created.  Each partition must contain at least one element.  It uses an iterative relocation technique to improve partitioning.  Distance is the usual criterion.

Adriano Cruz *NCE e IM - UFRJ Cluster 67 Partitioning Methods cont.  They work well for finding spherical-shaped clusters.  They are not efficient on very large databases.  K-means where each cluster is represented by the mean value of the objects in the cluster.  K-medoids where each cluster is represented by an object near the centre of the cluster.

Adriano Cruz *NCE e IM - UFRJ Cluster 68 Hierarchical Methods  Creates a hierarchical decomposition of the set  Agglomerative approaches start with each object forming a separate group  Merges objects or groups until all objects belong to one group or a termination condition occurs  Divisive approaches starts with all objects in the same cluster  Each successive iteration splits a cluster until all objects are on separate clusters or a termination condition occurs

Adriano Cruz *NCE e IM - UFRJ Cluster 69 Hierarchical Clustering cont  Definition of cluster proximity.  Min: most similar (sensitive to noise)  Max: most dissimilar (break large clusters

Adriano Cruz *NCE e IM - UFRJ Cluster 70 Density-based methods  Method creates clusters until the density in the neighbourhood exceeds some threshold  Able to find clusters of arbitrary shapes

Adriano Cruz *NCE e IM - UFRJ Cluster 71 Grid-based methods  Grid methods divide the object space into finite number of cells forming a grid-like structure.  Cells that contain more than a certain number of elements are treated as dense.  Dense cells are connected to form clusters.  Fast processing time, independent of the number of objects.  STING and CLIQUE are examples.

Adriano Cruz *NCE e IM - UFRJ Cluster 72 Model-based methods  Model-based methods hypothesise a model for each cluster and find the best fit of the data to the given model.  Statistical models  SOM networks

Adriano Cruz *NCE e IM - UFRJ Cluster 73 Partition methods  Given a database of n objects a partition method organises them into k clusters (k<= n)  The methods try to minimise an objective function such as distance  Similar objects are “close” to each other