1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English

2 Example: Custormer Segmentation zGiven: a Large data base of customer data containing their properties and past buying records: zFind groups of customers with similar behavior (clusters) zFind customers with unusual behavior (outliers)

3 Problem Definition: Given a set of N items in D dimensions zFind: a natural partitioning of the data set into a number of clusters (k) + outliers, such that: y items in same cluster are similar  intra-cluster similarity is maximized yitems from different clusters are different  inter-cluster similarity is minimized zNo predefined classes! Unsupervised Learnig zUsed either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms.

4 Clustering: Many Methods zPartitioning methods yk-means, k-medoids zHierarchical methods yAgglomerative/divisive, BIRCH, CURE zLinkage-based methods zDensity-based methods yDBSCAN, DENCLUE zStatistical methods yIBM-IM demographic clustering, COBWEB With different strengths and objectives

5 Differences Among Clustering Methods zNotion of Distance between X= and Y= : (|x 1 -y 1 | q + … + |x n -y n | q ) 1/q yEuclidean: q=2, yManhattan: q=1 zDistance from the center? or zfrom neighbors (density-based) zThe Dimensionality Curse.

6 Example Data Sets zOutliers are clear (or are they noise?) zShould we cluster according to a distance from a centroid or by the density of their neighborhood?

7 Partition to minimize distances from centers zPeople of similar age, income, education level… zCluster and partition to minimize cost of distribution or utilities in a flat location

8 K-Means K-means (MacQueen, 1967) is one of the simplest clustering algorithms to minimize distance from centers.MacQueen, 1967 1.Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2.Assign each object to the group that has the closest centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids. 4.Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

9 K-Means (cont.) zThe procedure will always terminate z but not always in the most optimal configuration, z sensitive to the initial randomly selected cluster centers zMany variations and improvements

10 Clusters Example (5 pairs) Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one.

Solutions to Initial Centroids Problem zMultiple runs yHelps, but probability is not on your side zStart with more than k initial centroids and then select k centroids from the most widely separated resulting clusters zUse hierarchical clustering to determine initial centroids on a small sample of data zBisecting K-means yNot as susceptible to initialization issues zPostprocessing

15 Partition to Minimize Distance from Neighbors: Density-Based Clustering zA natural model for describing the spreading of information or diseases zFinding frequent trajectories: e.g. from cell-phone calls, or RFID data.

16 DBSCAN Algorithm: Density Concepts zTwo global parameters: yEps: Maximum radius of the neighborhood yMinPts: Minimum number of points in an Eps-neighborhood of that point zCore Object: object with at least MinPts objects within a radius ‘Eps-neighborhood’—e.g. q zBorder Object: object on the border of a cluster—e.g. p p q MinPts = 5 Eps = 1 cm

17 DBSCAN: The Algorithm zArbitrary select a point p yRetrieve all points density-reachable from p wrt Eps and MinPts. yIf p is a core point, a cluster is formed. And repeat this process for all points density-reachable form p. yIf p is a border point, no points are density- reachable from p and DBSCAN visits the next point of the database. zRepeat the process until all of the points have been processed.

18 DBSCAN Summary zDensity-based Algorithm DBSCAN can discover clusters of arbitrary shape. zR*-Tree spatial index reduce the time complexity from O(n 2 ) to O(n log n). zNo suitable for higher dimensions: dimensionality curse

19 The Dimensionality Curse zAdding a dimension stretches the points across that dimension: yHigh-dimensional data is extremely sparse yDistance measure becomes meaningless—due to equi-distance zSpecial algorithms based on dimensionality reduction and subspace clustering are used.

1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

Similar presentations

Presentation on theme: "1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

Similar presentations

Presentation on theme: "1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English."— Presentation transcript:

Similar presentations

About project

Feedback