Clustering Dr. Jieh-Shan George YEH

Slides:



Advertisements
Similar presentations
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
Qiang Yang Adapted from Tan et al. and Han et al.
Clustering Prof. Navneet Goyal BITS, Pilani
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.
An Introduction to Clustering
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Instructor: Qiang Yang
SCAN: A Structural Clustering Algorithm for Networks
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
DATA MINING LECTURE 8 Clustering The k-means algorithm
Advanced Database Technologies
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Critical Issues with Respect to Clustering Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Topic9: Density-based Clustering
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005.
Graph preprocessing. Framework for validating data cleaning techniques on binary data.
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
1. Randomized Hill Climbing Neighborhood Randomized Hill Climbing: Sample p points randomly in the neighborhood of the currently best solution; determine.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Решение задач Data Mining. R и Hadoop. Классификация Decision Tree  Исходные данные >names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length“ [4]
Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Clustering CSC 600: Data Mining Class 21.
What Is the Problem of the K-Means Method?
数据挖掘 Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
The University of Adelaide, School of Computer Science
CSE572, CBS598: Data Mining by H. Liu
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
GPX: Interactive Exploration of Time-series Microarray Data
CSE572, CBS572: Data Mining by H. Liu
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Clustering Dr. Jieh-Shan George YEH

k-Means Clustering k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. The problem is computationally difficult (NP- hard)

k-Means Clustering: Example iris2 <- iris # remove species from the data iris2$Species <- NULL # the cluster number is set to 3 kmeans.result <- kmeans(iris2, 3) table(iris$Species, kmeans.result$cluster)

plot(iris2[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster) # plot cluster centers points(kmeans.result$centers[,c("Sepal.Length", "Sepal.Width")], col = 1:3, pch = 8, cex=2)

FPC: FLEXIBLE PROCEDURES FOR CLUSTERING

k-Medoids Clustering The k-medoids algorithm is a clustering algorithm k-medoids chooses datapoints as centers (medoids or exemplars) The most common realization of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm

k-Medoids Clustering #PAM (Partitioning Around Medoids) library(fpc) pamk.result <- pamk(iris2) # number of clusters pamk.result$nc # check clustering against actual species table(pamk.result$pamobject$clustering, iris$Species)

layout(matrix(c(1,2),1,2)) # 2 graphs per page plot(pamk.result$pamobject) layout(matrix(1)) # change back to one graph per page # pamk() produces two clusters: one is "setosa", and the other is a mixture of "versicolor" and "virginica".

CLUSTER: CLUSTER ANALYSIS EXTENDED ROUSSEEUW ET AL

Partitioning Around Medoids (PAM) library(cluster) pam.result <- pam(iris2, 3) table(pam.result$clustering, iris$Species) layout(matrix(c(1,2),1,2)) # 2 graphs per page plot(pam.result) layout(matrix(1)) # change back to one graph per page

It's hard to say which one is better out of the above two clusterings produced respectively with pamk() and pam(). It depends on the target problem and domain knowledge and experience. In

Hierarchical Clustering # sample of 40 records from the iris data, so that the clustering plot will not be over crowded. idx <- sample(1:dim(iris)[1], 40) irisSample <- iris[idx,] irisSample$Species <- NULL hc <- hclust(dist(irisSample), method="ave") plot(hc, hang = -1, labels=iris$Species[idx]) # cut tree into 3 clusters rect.hclust(hc, k=3) groups <- cutree(hc, k=3)

Density-based Clustering DBSCAN density reachability and connectivity clustering library(fpc) iris2 <- iris[-5] # remove class tags ds <- dbscan(iris2, eps=0.42, MinPts=5)

# compare clusters with original class labels table(ds$cluster, iris$Species) "1" to "3" in the first column are three identified clusters, while " 0" stands for noises or outliers

plot(ds, iris2)

plot(ds, iris2[c(1,4)])

plotcluster(iris2, ds$cluster)

label new data # create a new dataset for labeling set.seed(435) idx <- sample(1:nrow(iris), 10) newData <- iris[idx,-5] newData <- newData + matrix(runif(10*4, min=0, max=0.2), nrow=10, ncol=4) # label new data myPred <- predict(ds, iris2, newData) # plot result plot(iris2[c(1,4)], col=1+ds$cluster) points(newData[c(1,4)], pch="*", col=1+myPred, cex=3) # check cluster labels table(myPred, iris$Species[idx])