1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.

Slides:



Advertisements
Similar presentations
DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.
Advertisements

CS690L: Clustering References:
Qiang Yang Adapted from Tan et al. and Han et al.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: J.W. Han, I. Witten, E. Frank.
Clustering II.
Cluster Analysis.
Clustering.
An Introduction to Clustering
Clustering II.
Instructor: Qiang Yang
Cluster Analysis.
Segmentação (Clustering) (baseado nos slides do Han)
Cluster Analysis.
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Spatial and Temporal Data Mining
Cluster Analysis.
What is Cluster Analysis?
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Cluster Analysis Part I
Advanced Database Technologies
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
11/15/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Clustering.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Clustering Sunita Sarawagi
Unit 5 : Cluster Analysis May 26, 2016 Data Mining: Concepts and Techniques1.
November 1, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques Clustering.
CIS664-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University Clustering I (based on.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Data Mining and Warehousing: Chapter 8
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 10. Cluster Analysis 1.
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Clustering.
Data Mining Algorithms
Hierarchical Clustering
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Data Mining Lecture 7. Course Syllabus Clustering Techniques (Week 6) –K-Means Clustering –Other Clustering Techniques.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Clustering CENG 514.
More on Clustering in COSC 4335
CSE 5243 Intro. to Data Mining
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining: Concepts and Techniques Clustering
Topic 3: Cluster Analysis
Data Mining: Concepts and Techniques
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
Cluster Analysis What is Cluster Analysis?
Data Mining: Clustering
Clustering Wei Wang.
Hierarchical Clustering
Topic 5: Cluster Analysis
Lecture 10 Clustering.
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: where i = (x i1, x i2, …, x ip ) and j = (x j1, x j2, …, x jp ) are two p-dimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance

2 Similarity and Dissimilarity Between Objects (Cont.) If q = 2, d is Euclidean distance: Properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures

3 Nominal/Categorical Variables Categorical variables can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching (aka Jaccard distance) m: # of matches, p: total # of variables A = {bread, apple, banana} B = {banana, milk, cheese, bread} d(A,B) = (5 – 2) / 5 = 0.6

example Stu-idCourse Grade S1A S2B S3C S4A 4 s1s2s3S4 S10 S210 S3110 S40110 Distance Matrix

5 Major Clustering Approaches Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Top down or bottom up Density-based approach: Based on connectivity and density functions Good for arbitrary shapes Grid-based approach: Quantize the object space into cells

6 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

7 The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment

8 The K-Means Clustering Method Example K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means reassign

9 Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k) 2 ), CLARA: O(ks 2 + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes

Initial centroid selection 10

Initial centroid selection 11

12 What Is the Problem of the K-Means Method? The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

13 The K-Medoids Clustering Method Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering PAM works effectively for small data sets, but does not scale well for large data sets CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized sampling

K-Medoids Case 1: p currently belongs cluster of o j If o j is replaced by o random then p assigned to o i Case 2: p currently belongs cluster of o j If o j is replaced by o random then p assigned to o random Case 3: p currently belongs cluster of o i If o j is replaced by o random then no reassignment Case 4: p currently belongs cluster of o i If o j is replaced by o random then p assigned to o random For each replacement calculate error change If error reduced o j is replaced with o random 14

15 A Typical K-Medoids Algorithm (PAM) Total Cost = K=2 Arbitrary choose k object as initial medoids Assign each remainin g object to nearest medoids Randomly select a nonmedoid object,O ramdom Compute total cost of swapping Total Cost = 26 Swapping O and O ramdom If quality is improved. Do loop Until no change

16 What Is the Problem with PAM? Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean Pam works efficiently for small data sets but does not scale well for large data sets. O(k(n-k) 2 ) for each iteration where n is # of data,k is # of clusters  Sampling based method, CLARA(Clustering LARge Applications)

17 CLARA (Clustering Large Applications) (1990) CLARA It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness: Efficiency depends on the sample size A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

18 Hierarchical Clustering Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative (AGNES) divisive (DIANA)

19 AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g., Splus Use the dissimilarity matrix. Merge nodes that have the least dissimilarity (min dist) Go on in a non-descending fashion Eventually all nodes belong to the same cluster

20 Dendrogram: Shows How the Clusters are Merged Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

21 DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g., Splus Inverse order of AGNES Eventually each node forms a cluster on its own

Distance between clusters Min distance nearest-neighbor clustering Max distance Average distance Mean distance Centroids / medoids 22

Example (min dist) 23

24 Recent Hierarchical Clustering Methods Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n 2 ), where n is the number of total objects can never undo what was done previously Integration of hierarchical with distance-based clustering BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters ROCK (1999): clustering categorical data by neighbor and link analysis CHAMELEON (1999): hierarchical clustering using dynamic modeling

25 Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD ’ 96) OPTICS: Ankerst, et al (SIGMOD ’ 99). DENCLUE: Hinneburg & D. Keim (KDD ’ 98) CLIQUE: Agrawal, et al. (SIGMOD ’ 98) (more grid-based)

26 Density-Based Clustering: Basic Concepts Two parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps- neighbourhood of that point N Eps (p):{q belongs to D | dist(p,q) <= Eps} Directly density-reachable: A point p is directly density- reachable from a point q w.r.t. Eps, MinPts if p belongs to N Eps (q) core point condition: |N Eps (q)| >= MinPts p q MinPts = 5 Eps = 1 cm

27 Density-Reachable and Density-Connected Density-reachable: A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p 1, …, p n, p 1 = q, p n = p such that p i+1 is directly density-reachable from p i Density-connected A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q p1p1 pq o

28 DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5

29 DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed.

30 DBSCAN: Sensitive to Parameters

31 Clustering Complex Objects

32 OPTICS: A Cluster-Ordering Method (1999) OPTICS: Ordering Points To Identify the Clustering Structure Ankerst, Breunig, Kriegel, and Sander (SIGMOD ’ 99) Produces a special order of the database wrt its density-based clustering structure Dynamically sets/changes eps value Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure

33 Density-Based Clustering