CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering.
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Basic Concepts and Algorithms
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Unsupervised learning
Data Mining Techniques: Clustering
Introduction to Bioinformatics
Lecture 6 Image Segmentation
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
4. Ad-hoc I: Hierarchical clustering
Tree Clustering & COBWEB. Remember: k-Means Clustering.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Prepared by: Mahmoud Rafeek Al-Farra
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Data Clustering Michael J. Watts
Clustering.
Information Organization: Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods

CS 8751 ML & KDDData Clustering2 What is Clustering? Form of unsupervised learning - no information from teacher The process of partitioning a set of data into a set of meaningful (hopefully) sub-classes, called clusters Cluster: –collection of data points that are “similar” to one another and collectively should be treated as group –as a collection, are sufficiently different from other groups

CS 8751 ML & KDDData Clustering3 Clusters

CS 8751 ML & KDDData Clustering4 Characterizing Cluster Methods Class - label applied by clustering algorithm –hard versus fuzzy: hard - either is or is not a member of cluster fuzzy - member of cluster with probability Distance (similarity) measure - value indicating how similar data points are Deterministic versus stochastic –deterministic - same clusters produced every time –stochastic - different clusters may result Hierarchical - points connected into clusters using a hierarchical structure

CS 8751 ML & KDDData Clustering5 Basic Clustering Methodology Two approaches: Agglomerative: pairs of items/clusters are successively linked to produce larger clusters Divisive (partitioning): items are initially placed in one cluster and successively divided into separate groups

CS 8751 ML & KDDData Clustering6 Cluster Validity One difficult question: how good are the clusters produced by a particular algorithm? Difficult to develop an objective measure Some approaches: –external assessment: compare clustering to a priori clustering –internal assessment: determine if clustering intrinsically appropriate for data –relative assessment: compare one clustering methods results to another methods

CS 8751 ML & KDDData Clustering7 Basic Questions Data preparation - getting/setting up data for clustering –extraction –normalization Similarity/Distance measure - how is the distance between points defined Use of domain knowledge (prior knowledge) –can influence preparation, Similarity/Distance measure Efficiency - how to construct clusters in a reasonable amount of time

CS 8751 ML & KDDData Clustering8 Distance/Similarity Measures Key to grouping points distance = inverse of similarity Often based on representation of objects as feature vectors An Employee DB Term Frequencies for Documents Which objects are more similar?

CS 8751 ML & KDDData Clustering9 Distance/Similarity Measures Properties of measures: based on feature values x instance#,feature# for all objects x i,B, dist(x i, x j )  0, dist(x i, x j )=dist(x j, x i ) for any object x i, dist(x i, x i ) = 0 dist(x i, x j )  dist(x i, x k ) + dist(x k, x j ) Manhattan distance: Euclidean distance:

CS 8751 ML & KDDData Clustering10 Distance/Similarity Measures Minkowski distance (p): Mahalanobis distance: where  -1 is covariance matrix of the patterns More complex measures: Mutual Neighbor Distance (MND) - based on a count of number of neighbors

CS 8751 ML & KDDData Clustering11 Distance (Similarity) Matrix Note that d ij = d ji (i.e., the matrix is symmetric). So, we only need the lower triangle part of the matrix. The diagonal is all 1’s (similarity) or all 0’s (distance) Note that d ij = d ji (i.e., the matrix is symmetric). So, we only need the lower triangle part of the matrix. The diagonal is all 1’s (similarity) or all 0’s (distance) Similarity (Distance) Matrix –based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values) –(i, j) entry in the matrix is the distance (similarity) between items i and j

CS 8751 ML & KDDData Clustering12 Employee Data Set # Age Yrs Salary Sex Group ,000 M Accnt ,000 M DBMS ,000 M Servc ,000 F DBMS ,000 F Accnt ,000 M Servc ,000 F Servc ,000 F Presd ,000 M Accnt

CS 8751 ML & KDDData Clustering13 Calculating Distance Try to normalize (values fall in a range 0 to 1, approximately) SexDiff is 0 if same Sex, 1 if different, GroupDiff is 0 if same group, 1 if different Example:

CS 8751 ML & KDDData Clustering14 Employee Distance Matrix

CS 8751 ML & KDDData Clustering15 Employee Distance Matrix Theshold, for example, keep links when distance <= 1.8

CS 8751 ML & KDDData Clustering16 Visualizing Distance –Threshold Graph

CS 8751 ML & KDDData Clustering17 Agglomerative Single-Link Single-link: connect all points together that are within a threshold distance Algorithm: 1. place all points in a cluster 2. pick a point to start a cluster 3. for each point in current cluster add all points within threshold not already in cluster repeat until no more items added to cluster 4. remove points in current cluster from graph 5. Repeat step 2 until no more points in graph

CS 8751 ML & KDDData Clustering18 Agglomerative Single-Link Example After all but 8 is connected

CS 8751 ML & KDDData Clustering19 Agglomerative Complete-Link (Clique) Complete-link (clique): all of the points in a cluster must be within the threshold distance In the threshold distance matrix, a clique is a complete graph Algorithms based on finding maximal cliques (once a point is chosen, pick the largest clique it is part of) –not an easy problem

CS 8751 ML & KDDData Clustering20 Complete Link – Clique Search Look for all maximal cliques: {1,3,9} {1,2,9} ??

CS 8751 ML & KDDData Clustering21 Hierarchical Clustering E8 E4 E5 E1 E9 E3 E6 E7 E2 Based on some method of representing hierarchy of data points One idea: hierarchical dendogram (connects points based on similarity)

CS 8751 ML & KDDData Clustering22 Hierarchical Agglomerative Compute distance matrix Put each data point in its own cluster Find most similar pair of clusters –merge pairs of clusters (show merger in dendogram) –update proximity matrix –repeat until all patterns in one cluster

CS 8751 ML & KDDData Clustering23 Partitional Methods Divide data points into a number of clusters Difficult questions –how many clusters? –how to divide the points? –how to represent cluster? Representing cluster: often done in terms of centroid for cluster –centroid of cluster minimizes squared distance between the centroid and all points in cluster

CS 8751 ML & KDDData Clustering24 k-Means Clustering 1. Choose k cluster centers (randomly pick k data points as center, or randomly distribute in space) 2. Assign each pattern to the closest cluster center 3. Recompute the cluster centers using the current cluster memberships (moving centers may change memberships) 4. If a convergence criterion is not met, goto step 2 Convergence criterion: –no reassignment of patterns –minimal change in cluster center

CS 8751 ML & KDDData Clustering25 k-Means Clustering

CS 8751 ML & KDDData Clustering26 k-Means Variations What if too many/not enough clusters? After some convergence: –any cluster with too large a distance between members is split –any clusters too close together are combined –any cluster not corresponding to any points is moved –thresholds decided empirically

CS 8751 ML & KDDData Clustering27 An Incremental Clustering Algorithm 1. Assign first data point to a cluster 2. Consider next data point. Either assign data point to an existing cluster or create a new cluster. Assignment to cluster based on threshold 3. Repeat step 2 until all points are clustered Useful for efficient clustering