Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLUSTERING (Segmentation)

Similar presentations


Presentation on theme: "CLUSTERING (Segmentation)"— Presentation transcript:

1 CLUSTERING (Segmentation)
Saed Sayad

2 Data Mining Steps 1 2 3 4 5 6 www.ismartsoft.com Problem Definition
Data Preparation 3 Data Exploration 4 Modeling 5 Evaluation 6 Deployment

3 What is Clustering? Given a set of records, organize
the records into clusters Age Income A cluster is a subset of records which are similar

4 Clustering Requirements
The ability to discover some or all of the hidden clusters. Within-cluster similarity and between-cluster disimilarity. Ability to deal with various types of attributes. Can deal with noise and outliers. Can handle high dimensionality. Scalability, Interpretability and usability.

5 Similarity - Distance Measure
To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality

6 Similarity - Distance Measure
Euclidean Manhattan Minkowski

7 Similarity - Correlation
Dissimilar Credit$ Credit$ Age Age

8 Similarity – Hamming Distance
Gene 1 A T C G Gene 2 Hamming Distance 1

9 Clustering Methods Exclusive vs. Overlapping
Hierarchical vs. Partitive Deterministic vs. Probabilistic Incremental vs. Batch learning

10 Exclusive vs. Overlapping
Age Income Income Age

11 Hierarchical vs. Partitive
Age Income

12 Hierarchical Clustering
Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy. There are two types of hierarchical clustering: Agglomerative Divisive

13 Hierarchical Clustering
Agglomerative Divisive

14 Hierarchical Clustering - Agglomerative
Assign each observation to its own cluster. Compute the similarity (e.g., distance) between each of the clusters. Join the two most similar clusters. Repeat steps 2 and 3 until there is only a single cluster left.

15 Hierarchical Clustering - Divisive
Assign all of the observations to a single cluster. Partition the cluster to two least similar clusters. Proceed recursively on each cluster until there is one cluster for each observation.

16 Hierarchical Clustering – Single Linkage

17 Hierarchical Clustering – Complete Linkage

18 Hierarchical Clustering – Average Linkage

19 K Means Clustering Clusters the data into k groups where k is predefined. Select k points at random as cluster centers. Assign observations to their closest cluster center according to the Euclidean distance function. Calculate the centroid or mean of all instances in each cluster (this is the mean part) Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds.

20 K Means Clustering Income Age

21 K Means Clustering Sum of Squares function

22 Clustering Evaluation
Sarle’s Cubic Clustering Criterion The Pseudo-F Statistic The Pseudo-T2 Statistic Beale’s F-Type Statistic Target-based

23 Clustering Evaluation
Target Variable Categorical Chi2 Test K-S Test Numerical ANOVA H Test

24 Chi2 Test Actual Y N Predicted n11 n12 n21 n22

25 Analysis of Variance (ANOVA)
Source of Variation Sum of Squares Degree of Freedom Mean Square F P Between Groups SSB dfB MSB = SSB/dfB F=MSB/MSW P(F) Within Groups SSW dfw MSW = SSW/dfw Total SST dfT

26 Clustering - Applications
Marketing: finding groups of customers with similar behavior. Insurance & Banking: identifying frauds. Biology: classification of plants and animals given their features. Libraries: book ordering. City-planning: identifying groups of houses according to their house type, value and geographical location. World Wide Web: document classification; clustering weblog data to discover groups with similar access patterns.

27 Summary Clustering is the process of organizing objects (records or variables) into groups whose members are similar in some way. Hierarchical and K-Means are the two most used clustering techniques. The effectiveness of the clustering method depends on the similarity function. The result of the clustering algorithm can be interpreted and evaluated in different ways.

28 Questions?


Download ppt "CLUSTERING (Segmentation)"

Similar presentations


Ads by Google