1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

Slides:



Advertisements
Similar presentations
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Qiang Yang Adapted from Tan et al. and Han et al.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Clustering Methods Professor: Dr. Mansouri
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Cluster Analysis.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
An Introduction to Clustering
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Instructor: Qiang Yang
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English CS240B lecture notes.
DATA MINING LECTURE 8 Clustering The k-means algorithm
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
Density-Based Clustering Algorithms
Topic9: Density-based Clustering
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Other Clustering Techniques
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
CSE572, CBS598: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English

2 Example: Custormer Segmentation zGiven: a Large data base of customer data containing their properties and past buying records: zFind groups of customers with similar behavior (clusters) zFind customers with unusual behavior (outliers)

3 Problem Definition: Given a set of N items in D dimensions zFind: a natural partitioning of the data set into a number of clusters (k) + outliers, such that: y items in same cluster are similar  intra-cluster similarity is maximized yitems from different clusters are different  inter-cluster similarity is minimized zNo predefined classes! Unsupervised Learnig zUsed either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms.

4 Clustering: Many Methods zPartitioning methods yk-means, k-medoids zHierarchical methods yAgglomerative/divisive, BIRCH, CURE zLinkage-based methods zDensity-based methods yDBSCAN, DENCLUE zStatistical methods yIBM-IM demographic clustering, COBWEB With different strengths and objectives

5 Differences Among Clustering Methods zNotion of Distance between X= and Y= : (|x 1 -y 1 | q + … + |x n -y n | q ) 1/q yEuclidean: q=2, yManhattan: q=1 zDistance from the center? or zfrom neighbors (density-based) zThe Dimensionality Curse.

6 Example Data Sets zOutliers are clear (or are they noise?) zShould we cluster according to a distance from a centroid or by the density of their neighborhood?

7 Partition to minimize distances from centers zPeople of similar age, income, education level… zCluster and partition to minimize cost of distribution or utilities in a flat location

8 K-Means K-means (MacQueen, 1967) is one of the simplest clustering algorithms to minimize distance from centers.MacQueen, Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2.Assign each object to the group that has the closest centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids. 4.Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

9 K-Means (cont.) zThe procedure will always terminate z but not always in the most optimal configuration, z sensitive to the initial randomly selected cluster centers zMany variations and improvements

10 Clusters Example (5 pairs) Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one.

10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one.

Solutions to Initial Centroids Problem zMultiple runs yHelps, but probability is not on your side zStart with more than k initial centroids and then select k centroids from the most widely separated resulting clusters zUse hierarchical clustering to determine initial centroids on a small sample of data zBisecting K-means yNot as susceptible to initialization issues zPostprocessing

15 Partition to Minimize Distance from Neighbors: Density-Based Clustering zA natural model for describing the spreading of information or diseases zFinding frequent trajectories: e.g. from cell-phone calls, or RFID data.

16 DBSCAN Algorithm: Density Concepts zTwo global parameters: yEps: Maximum radius of the neighborhood yMinPts: Minimum number of points in an Eps-neighborhood of that point zCore Object: object with at least MinPts objects within a radius ‘Eps-neighborhood’—e.g. q zBorder Object: object on the border of a cluster—e.g. p p q MinPts = 5 Eps = 1 cm

17 DBSCAN: The Algorithm zArbitrary select a point p yRetrieve all points density-reachable from p wrt Eps and MinPts. yIf p is a core point, a cluster is formed. And repeat this process for all points density-reachable form p. yIf p is a border point, no points are density- reachable from p and DBSCAN visits the next point of the database. zRepeat the process until all of the points have been processed.

18 DBSCAN Summary zDensity-based Algorithm DBSCAN can discover clusters of arbitrary shape. zR*-Tree spatial index reduce the time complexity from O(n 2 ) to O(n log n). zNo suitable for higher dimensions: dimensionality curse

19 The Dimensionality Curse zAdding a dimension stretches the points across that dimension: yHigh-dimensional data is extremely sparse yDistance measure becomes meaningless—due to equi-distance zSpecial algorithms based on dimensionality reduction and subspace clustering are used.