Clustering Wei Wang.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
Clustering.
What is Cluster Analysis?
Clustering.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Data Mining Techniques: Clustering
Clustering II.
Clustering.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.
What is Cluster Analysis
Segmentação (Clustering) (baseado nos slides do Han)
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis.
What is Cluster Analysis?
Cluster Analysis Part I
11/15/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Clustering.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Clustering Sunita Sarawagi
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Algorithms
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering Analysis CS 685: Special Topics in Data Mining Jinze Liu.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Data Mining Lecture 7. Course Syllabus Clustering Techniques (Week 6) –K-Means Clustering –Other Clustering Techniques.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Clustering.
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
What Is Cluster Analysis?
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Clustering I Data Mining Soongsil University.
Clustering CENG 514.
Slides by Eamonn Keogh (UC Riverside)
What Is the Problem of the K-Means Method?
Data Mining K-means Algorithm
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Data Mining: Concepts and Techniques Clustering
Topic 3: Cluster Analysis
©Jiawei Han and Micheline Kamber Department of Computer Science
CSE 5243 Intro. to Data Mining
Data Mining: Concepts and Techniques
Selected Topics in AI: Data Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Cluster Analysis What is Cluster Analysis?
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Data Mining: Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining – Chapter 4 Cluster Analysis Part 2
What Is Good Clustering?
Topic 5: Cluster Analysis
What is Cluster Analysis?
Presentation transcript:

Clustering Wei Wang

Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering methods Outlier analysis

What Is Clustering? Group data into clusters Similar to one another within the same cluster Dissimilar to the objects in other clusters Unsupervised learning: no predefined classes Cluster 1 Cluster 2 Outliers

Application Examples A stand-alone tool: explore data distribution A preprocessing step for other algorithms Pattern recognition, spatial data analysis, image processing, market research, WWW, … Cluster documents Cluster web log data to discover groups of similar access patterns

What Is A Good Clustering? High intra-class similarity and low inter-class similarity Depending on the similarity measure The ability to discover some or all of the hidden patterns

Requirements of Clustering Scalability Ability to deal with various types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters

Requirements of Clustering Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability

Data Matrix For memory-based clustering Also called object-by-variable structure Represents n objects with p variables (attributes, measures) A relational table

Dissimilarity Matrix For memory-based clustering Also called object-by-object structure Proximities of pairs of objects d(i,j): dissimilarity between objects i and j Nonnegative Close to 0: similar

How Good Is A Clustering? Dissimilarity/similarity depends on distance function Different applications have different functions Judgment of clustering quality is typically highly subjective

Types of Data in Clustering Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types

Similarity and Dissimilarity Between Objects Distances are normally used measures Minkowski distance: a generalization If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance Weighed distance

Properties of Minkowski Distance Nonnegative: d(i,j)  0 The distance of an object to itself is 0 d(i,i) = 0 Symmetric: d(i,j) = d(j,i) Triangular inequality d(i,j)  d(i,k) + d(k,j)

Categories of Clustering Approaches (1) Partitioning algorithms Partition the objects into k clusters Iteratively reallocate objects to improve the clustering Hierarchy algorithms Agglomerative: each object is a cluster, merge clusters to form larger ones Divisive: all objects are in a cluster, split it up into smaller clusters

Categories of Clustering Approaches (2) Density-based methods Based on connectivity and density functions Filter out noise, find clusters of arbitrary shape Grid-based methods Quantize the object space into a grid structure Model-based Use a model to find the best fit of data

Partitioning Algorithms: Basic Concepts Partition n objects into k clusters Optimize the chosen partitioning criterion Global optimal: examine all partitions (kn-(k-1)n-…-1) possible partitions, too expensive! Heuristic methods: k-means and k-medoids K-means: a cluster is represented by the center K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster

K-means Arbitrarily choose k objects as the initial cluster centers Until no change, do (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster Update the cluster means, i.e., calculate the mean value of the objects for each cluster

K-Means: Example Update the cluster means 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 Update the cluster means 4 Assign each objects to most similar center 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

Pros and Cons of K-means Relatively efficient: O(tkn) n: # objects, k: # clusters, t: # iterations; k, t << n. Often terminate at a local optimum Applicable only when mean is defined What about categorical data? Need to specify the number of clusters Unable to handle noisy data and outliers unsuitable to discover non-convex clusters

Variations of the K-means Aspects of variations Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes Use mode instead of mean Mode: the most frequent item(s) A mixture of categorical and numerical data: k-prototype method

A Problem of K-means Sensitive to outliers + Sensitive to outliers Outlier: objects with extremely large values May substantially distort the distribution of the data K-medoids: the most centrally located object in a cluster + 1 2 3 4 5 6 7 8 9 10

PAM: A K-medoids Method PAM: partitioning around Medoids Arbitrarily choose k objects as the initial medoids Until no change, do (Re)assign each object to the cluster to which the nearest medoid Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’ If S < 0 then swap o with o’ to form the new set of k medoids

Swapping Cost Measure whether o’ is better than o as a medoid Use the squared-error criterion Compute Eo’-Eo Negative: swapping brings benefit

PAM: Example K=2 Do loop Until no change Total Cost = 20 10 9 8 7 Arbitrary choose k object as initial medoids Assign each remaining object to nearest medoids 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 K=2 Randomly select a nonmedoid object,Oramdom Total Cost = 26 Do loop Until no change 10 10 9 Compute total cost of swapping 9 Swapping O and Oramdom If quality is improved. 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Pros and Cons of PAM PAM is more robust than k-means in the presence of noise and outliers Medoids are less influenced by outliers PAM is efficiently for small data sets but does not scale well for large data sets O(k(n-k)2 ) for each iteration Sampling based method: CLARA

CLARA (Clustering LARge Applications) CLARA (Kaufmann and Rousseeuw in 1990) Built in statistical analysis packages, such as S+ Draw multiple samples of the data set, apply PAM on each sample, give the best clustering Perform better than PAM in larger data sets Efficiency depends on the sample size A good clustering on samples may not be a good clustering of the whole data set

CLARANS (Clustering Large Applications based upon RANdomized Search) The problem space: graph of clustering A vertex is k from n numbers, vertices in total PAM search the whole graph CLARA search some random sub-graphs CLARANS climbs mountains Randomly sample a set and select k medoids Consider neighbors of medoids as candidate for new medoids Use the sample set to verify Repeat multiple times to avoid bad samples