Clustering.

Clustering

The K-Means Clustering Method
Example 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 Update the cluster means 4 Assign each objects to most similar center 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

Cluster Analysis 群聚分析 Cluster 群聚: 一群 data objects Cluster analysis
在同一群內相當相似在不同群內非常不相似 Cluster analysis 把資料依相似性分群 Clustering 是 unsupervised classification: 無預先設好的類別標籤 Typical applications 作為了解資料分佈的工具(stand-alone tool) 作為其他方法的 preprocessing step

Clustering Examples Segment customer database based on similar buying patterns. Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Group houses in a town into neighborhoods based on similar features. City-planning: Identifying groups of houses according to their house type, value, and geographical location Identify similar Web usage patterns Document classification Cluster Weblog data to discover groups of similar access patterns

Clustering Example

Geographic Distance Based
Clustering Houses Geographic Distance Based Size Based

何謂 Good Clustering? 好的分群方法產生高品質的 clusters 結果的品質決定於 clustering 方法的
high intra-class similarity （cluster 內：高相似） low inter-class similarity （cluster 間：低相似）結果的品質決定於 clustering 方法的 similarity measure implementation clustering 方法的品質也可以用「找出(部分或全部)隱藏的 pattern 能力」來度量

Clustering 需求 (Requirements/Issues)
擴充性（scalability ）處理各種型態的屬性（types of attributes）找出任意形狀的cluster 決定輸入參數時需盡量減少所需的 domain knowledge 處理noise 及outlier的能力對輸入資料的順序要 insensitive high dimensionality 可以整合 user-specified constraints Interpretability Usability Dynamic data: if cluster membership changes over time

Impact of Outliers on Clustering

資料結構 Data matrix Dissimilarity matrix (two modes) (one mode)
n obj. * p var. n * n

Clustering Quality 的度量
Dissimilarity/Similarity metric: 以 distance function表示，d(i, j) Distance functions 的定義依照變數型態而不同 interval-scaled, boolean, categorical, ordinal and ratio variables 依照各個應用與資料的意義訂定變數的weights 有時很難定義 “similar enough” or “good enough” 答案很主觀

Similarity and Dissimilarity Between Objects
Distances: 度量兩data objects的 similarity 或 dissimilarity properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Manhattan distance Euclidean distance

Similarity and Dissimilarity (Cont.)
Minkowski distance: i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) 是兩個 p-dimensional 的 data objects, q 是正整數 Manhattan distance: q = 1 Euclidean distance: q = 2

Clustering Analysis 的資料型態
Interval-scaled variables weight, height, latitude, … (roughly linear) Binary variables symmetric: gender asymmetric: fever (Y/N), test (P/N) Nominal, ordinal, and ratio variables map_color, weather; ordering; AeBt Variables of mixed types

Interval-valued variables
Standardize data 先標準化 (xif 變成zif) 算 mean absolute deviation where 算 standardized measurement (z-score) 用 mean absolute deviation 比用 standard deviation 更 robust standard deviation 對差值平方 mf: 平均

Binary Variables A contingency table for binary data 各種可能
Simple matching coefficient (invariant, if the binary variable is symmetric): Jaccard coefficient (noninvariant if the binary variable is asymmetric): Object j Object i

Dissimilarity between Binary Variables
Example gender 是 symmetric attribute 其他是 asymmetric binary attribute 讓 Y 跟 P 為 1, N 為 0

Nominal Variables binary variable 的 generalization : 超過兩種狀態，如 red, yellow, blue, green Method 1: Simple matching m: # of matches, p: total # of variables Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states Show an example

Ordinal Variables Can be discrete or continuous
Order is important, e.g., rank Can be treated like interval-scaled 將 xif 用他的 rank替代, rif  {1, …, Mf} 將各個 ordinal variable 對應到 [0, 1] 取代 i-th object 的 f-th 變數用 interval-scaled variables的方法計算dissimilarity Example

Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods: treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) apply logarithmic transformation, (log-log maybe) yif = log(xif) treat them as continuous ordinal data as ordinal data, treat their rank as interval-scaled

Variables of Mixed Types
symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio 可以用 weighted formula 來結合當 f 是 binary or nominal: dij(f) = 0 if xif = xjf , otherwise dij(f) = 1 當 f 是interval-based: dij(f) = |xif-xjf|/max(xf)-min(xf) 當 f 是ordinal or ratio-scaled 算 ranks rif 把 zif 當 interval-scaled ij(f) = 0: (1) 缺 xif 或 xjf (2) xif=xjf =0, f asymmetric 其他： ij(f) = 1

Types of Clustering Hierarchical – Nested set of clusters created.
Partitional – One set of clusters created. Incremental – Each element handled one at a time. Simultaneous – All elements handled together. Overlapping/Non-overlapping

Clustering Approaches
Hierarchical Partitional Categorical Large DB Agglomerative Divisive Sampling Compression

Major Clustering Approaches
Partitioning algorithms Construct various partitions evaluate them by some criterion Hierarchical algorithms Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Partitional Algorithms
K-Means K-Nearest Neighbor PAM BEA GA

Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment cluster mean is mi = (1/m)(ti1 + … + tim)

Example 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 Update the cluster means 4 Assign each objects to most similar center 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

K-Means Example Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 Stop as the clusters with these means are the same.

Comments on the K-Means Method
Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) Comment: Often terminates at a local optimum. Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes

K-Means 的問題 The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 1 2 3 4 5 6 7 8 9 10 PAM (Partitioning Around Medoids, 1987)

K-Nearest Neighbor Items are iteratively merged into the existing clusters that are closest. Incremental Threshold, t, used to determine if items are added to existing clusters or a new cluster is created.

PAM Partitioning Around Medoids (PAM) (K-Medoids)
Handles outliers well. Ordering of input does not impact results. Does not scale well. Each cluster represented by one item, called the medoid. Initial set of k medoids randomly chosen.

PAM Cost Calculation At each step in algorithm, medoids are changed if the overall cost is improved. Cjih – cost change for an item tj associated with swapping medoid ti with non-medoid th.

BEA Bond Energy Algorithm Database design (physical and logical)
Vertical fragmentation Determine affinity (bond) between attributes based on common usage. Algorithm outline: Create affinity matrix Convert to BOND matrix Create regions of close bonding

BEA Modified from [OV99]

Genetic Algorithm Example
{A,B,C,D,E,F,G,H} Randomly choose initial solution: {A,C,E} {B,F} {D,G,H} or , , Suppose crossover at point four and choose 1st and 3rd individuals: , , What should termination criteria be?

Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e agglomerative (AGNES) divisive (DIANA)

AGNES (Agglomerative Nesting)
Implemented in Splus (e.g) Use Single-Linkage method and the dissimilarity matrix Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster Single-linkage: cloest pair Complete-linkage: distant

Dendrogram Shows Hierarchical Clustering
dendrogram:將 data objects decompose 為數層 nested partitioning (tree of clusters) 4 clusters

Levels of Clustering

Agglomerative Example
B C D E 1 2 3 4 5 A B E C D Threshold of 1 2 3 4 5 A B C D E

DIANA (Divisive Analysis)
Implemented in statistical analysis packages, e.g., Splus Inverse order of AGNES Eventually each node forms a cluster on its own

Distance Between Clusters
Single Link: smallest distance between points Complete Link: largest distance between points Average Link: average distance between points Centroid: distance between centroids

Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98)

Model-Based Clustering Methods
Attempt to optimize the fit between the data and some mathematical model Statistical approach Conceptual clustering COBWEB (Fisher’87) AI approach a “prototype” for each cluster (called exemplar) put new obj. to the most similar exemplar Neural network approach Self-Organization feature Map (SOM) several units competing for the current object

Model-Based Clustering Methods

Self-organizing feature maps (SOMs)
Clustering is also performed by having several units competing for the current object The unit whose weight vector is closest to the current object wins The winner and its neighbors learn by having their weights adjusted Useful for visualizing high-dimensional data in 2- or 3-D space Example Tool

What Is Outlier Discovery?
何謂 outliers? Michael Jordon、CEO薪水、age = 999 那些跟其他資料相當不相似的資料 (considerably dissimilar!) Problem:find top k outliers among n objects Applications: Credit card/ Telecom fraud detection Customer segmentation Medical analysis Approaches Statistical-based Distance-based Deviation-based

Outlier Discovery: Statistical Approaches
Assume a model underlying distribution that generates data set (e.g. normal distribution) Use discordancy tests depending on data distribution distribution parameter (e.g., mean, variance) number of expected outliers Drawbacks most tests are for single attribute In many cases, data distribution may not be known

Distance-Based Approach
參數： p （為一分數）, D Distance-based outlier DB(p, D)-outlier: dataset S中的object O， S中至少 p 的object 跟 O 的距離大於 D 沒有夠多的鄰居 distance-based outlier mining algorithms Index-based algorithm Nested-loop algorithm Cell-based algorithm ...

Constraint-Based Clustering
ATM allocation problem

Clustering Large Databases
Most clustering algorithms assume a large data structure which is memory resident. Clustering may be performed first on a sample of the database then applied to the entire database. Algorithms BIRCH DBSCAN CURE

Desired Features for Large Databases
One scan (or less) of DB Online Suspendable, stoppable, resumable Incremental Work with limited main memory Different techniques to scan (e.g. sampling) Process each tuple once

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies
Incremental, hierarchical, one scan Save clustering information in a tree Each entry in the tree contains information about one cluster New nodes inserted in closest entry in tree

Clustering Feature CT Triple: (N,LS,SS) N: Number of points in cluster
LS: Sum of points in the cluster SS: Sum of squares of points in the cluster CF Tree Balanced search tree Node has CF triple for each child Leaf node represents cluster and has CF value for each subcluster in it. Subcluster has maximum diameter

BIRCH Algorithm

Improve Clusters

DBSCAN Density Based Spatial Clustering of Applications with Noise
Outliers will not effect creation of cluster. Input MinPts – minimum number of points in cluster Eps – for each point in cluster there must be another point in it less than this distance away.

DBSCAN Density Concepts
Eps-neighborhood: Points within Eps distance of a point. Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. Density-reachable: A point si density-reachable form another point if there is a path from one to the other consisting of only core points.

Density Concepts

DBSCAN Algorithm

CURE Clustering Using Representatives
Use many points to represent a cluster instead of only one Points will be well scattered

CURE Approach

CURE Algorithm

CURE for Large Databases

Summary Cluster analysis groups objects based on their similarity
cluster analysis has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods hierarchical methods density-based methods grid-based methods model-based methods Outlier detection and analysis useful for fraud detection, etc. performed by statistical, distance-based or deviation-based approaches research issues: constraint-based clustering

Comparison of Clustering Techniques

Clustering vs. Classification
No prior knowledge Number of clusters Meaning of clusters Unsupervised learning clusters are not known a priori!

Clustering.

Similar presentations

Presentation on theme: "Clustering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering.

Similar presentations

Presentation on theme: "Clustering."— Presentation transcript:

Similar presentations

About project

Feedback