Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
What is Cluster Analysis
Cluster Analysis: Basic Concepts and Algorithms
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
DATA MINING LECTURE 8 Clustering The k-means algorithm
Data Mining Cluster Analysis: Basic Concepts and Algorithms
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Computational Biology
Data Mining: Basic Cluster Analysis
Data Mining K-means Algorithm
Hierarchical Clustering
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
K-means and Hierarchical Clustering
Clustering.
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

Applications Group related documents for browsing Group genes and proteins that have similar functionality Group stocks with similar price fluctuations Reduce the size of large data sets Group users with similar buying mentalities

Clustering is ambiguous There is no correct or incorrect solution for clustering. How many clusters? Four ClustersTwo Clusters Six Clusters

Challenges faced Scalability Ability to deal with different types of attributes Noise & Outliers Complex shapes and types of data Incremental clustering and insensitivity to the order of input records High dimensionality Constraint-based clustering Interpretability and usability

Types of Data Data Matrix n-objects with p-variables. The structure is in the form of a relational table, or n x p matrix Dissimilarity Matrix object-by-object structure. Stores a collection of proximities that are available for all pair of n objects. d(i, j) is the dissimilarity between objects i and j. d(i, j) = d(j, i) and d(i, i) = 0

Types of Data Interval- Scaled Variables Binary Variables Nominal Ordinal Ratio-Scaled variables Variables of Mixed Types

Interval- Scaled Variables

Interval-scaled variables contd…

Binary variables Binary variable has only two states 0 and 1 Dissimilarity between two binary variables is by a 2*2 contingency table for binary variables 10 1qrq+r 0sts+t q+sr+tp OBJ i OBJ j

Dissimilarity between binary variables NameGenderFeverCoughTest-1Test-2Test-3Test-4 JackMYNPNNN MaryFYNPNPN JimMYYNNNN D(Jack,Mary)=0.33 D(Jack,Jim)=0.67 D(Mary,Jim)=0.75

Categorical Variables

Ordinal similar to nominal variables, but values are ordered in some sequence. Eg. rank or employees can be assistant, associate, full Ratio-Scaled variables Makes a positive measurement on a non-linear scale Eg. Growth of bacteria, radioactivity Variables of Mixed Types Other types of data

Types of clustering Hierarchical clustering(BIRCH) A set of nested clusters organized as a hierarchical tree Partitional Clustering(k-means,k-mediods) A division data objects into non-overlapping (distinct) subsets (i.e., clusters) such that each data object is in exactly one subset Density – Based(DBSCAN) Based on density functions Grid-Based(STING) Based on nultiple-level granularity structure Model-Based(SOM) Hypothesize a model for each of the clusters and find the best fit of the data to the given model

Partitional Clustering Original PointsA Partitional Clustering

Hierarchical Clustering Traditional Hierarchical Clustering Non-traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Dendrogram

Clustering Algorithms Partitional K-means K-mediods Hierarchial Agglomerative Divisive

K-Mean Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output: set of k clusters Algo Randomly select k samples & mark them a initial cluster Repeat Assign/ reassign in sample to any given cluster to which it is most similar depending upon the mean of the cluster Update the cluster’s mean until No Change.

K-Means (Array) Step 1:Randomly assign objects to k clusters Step 2:Find the mean of each cluster Step 3:Re-assign objects to the cluster with closest mean. Step 4:Go to step2 Repeat until no change.

Example 1 Given: {2,3,6,8,9,12,15,18,22} Assume k=3. Solution: Randomly partition given data set: K1 = 2,8,15mean = 8.3 K2 = 3,9,18mean = 10 K3 = 6,12,22mean = 13.3 Reassign K1 = 2,3,6,8,9mean = 5.6 K2 = mean = 0 K3 = 12,15,18,22mean = 16.75

Reassign K1 = 3,6,8,9mean = 6.5 K2 = 2mean = 2 K3 = 12,15,18,22mean = Reassign K1 = 6,8,9mean = 7.6 K2 = 2,3mean = 2.5 K3 = 12,15,18,22mean = Reassign K1 = 6,8,9mean = 7.6 K2 = 2,3mean = 2.5 K3 = 12,15,18,22mean = STOP

Example 2 Given {2,4,10,12,3,20,30,11,25} Assume k=2. Solution: K1 = 2,3,4,10,11,12 K2 = 20, 25, 30

Advantages K-means is relatively scalable and efficient in processing large data sets The computational complexity of the algorithm is O(nkt) n: the total number of objects k: the number of clusters t: the number of iterations Normally: k<<n and t<<n Disadvantage Can be applied only when the mean of a cluster is defined Users need to specify k K-means is not suitable for discovering clusters with non convex shapes or clusters of very different size It is sensitive to noise and outlier data points (can influence the mean value)

K-Means (graph) Step1: Form k centroids, randomly Step2: Calculate distance between centroids and each object Use Euclidean’s law do determine min distance: d(A,B) = (x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 Step3: Assign objects based on min distance to k clusters Step4: Calculate centroid of each cluster using C =(x 1 +x 2 +…x n, y 1 +y 2 +…y n ) n n Go to step 2. Repeat until no change in centroids.

Example 1 There are four types of medicines and each have two attributes, as shown below. Find a way to group them into 2 groups based on their features. MedicineWeightpH A11 B21 C43 D54

Solution Plot the values on a graph. Mark any k centeroids

Calculate Euclidean distance of each point from the centeroids. D = Based on minimum distance, we assign points to clusters:K1 = A K2 = B, C, D Calculate new centeroids C = 2+4+5,1+3+4=(11/3, 8/3) 3 3

Marking the new centroids Continue the iteration, until there is no change in the centroids or clusters.

Final solution

Example 2 Use K-means algorithm to create two clusters. Given:

Example 3. Group the below points into 3 clusters

Agglomerative Step1:Make each object as a cluster Step2:Calculate the Euclidean distance from every point to every other point. i.e., construct a Distance Matrix Step3:Identify two clusters with shortest distance. Merge them Go to Step 2 Repeat until all objects are in one cluster

Example Find single link technique to find clusters in the given database. XY

Plot given data

Construct a distance matrix

Identify two nearest clusters

Repeat process until all objects in same cluster

Dendogram

Single Link Min distance matrix

Complete link Max distance matrix

Average link Average distance matrix

Use below data and draw single link, complete link and average link dendogram. ObjectXY A22 B32 C11 D31 E1.50.5