START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
CS Clustering1 Unsupervised Learning and Clustering In unsupervised learning you are given a data set with no output classifications Clustering is.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Introduction to Bioinformatics
Cluster Analysis.
Clustering CMPUT 466/551 Nilanjan Ray. What is Clustering? Attach label to each observation or data points in a set You can say this “unsupervised classification”
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Unsupervised Learning and Data Mining
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Clustering. What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Clustering.
DATA MINING Introductory and Advanced Topics Part II - Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

START OF DAY 8 Reading: Chap. 14

Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR down

What is Clustering? Unsupervised learning Seeks to organize data into “reasonable” groups Often based on some similarity (or distance) measure defined over data elements Quantitative characterization may include – Centroid / Medoid – Radius – Diameter – Compactness – Etc.

Clustering Taxonomy Partitional methods: – Algorithm produces a single partition or clustering of the data elements Hierarchical methods: – Algorithm produces a series of nested partitions, each of which represents a possible clustering of the data elements Symbolic Methods: – Algorithm produces a hierarchy of concepts

Distance-based Clustering

Estimating 1 Mean (I) What is the maximum likelihood hypothesis for the mean of a single Normal distribution given observed instances from it? From section 6.4, it is the hypothesis that minimizes the sum of squared errors:

Estimating 1 Mean (II) And, in this case, the sum of squared errors is minimized by the sample mean: What if there are k means to estimate?

Intuition Consider k hidden variables Each instance is thus extended from x i to : – x i is the observed instance – z ij are the hidden variables – z ij = 1 if x i was generated by the jth Gaussian and 0 otherwise

The (EM) k-Means Algorithm Initialization: – Set h = where the  i ’s are arbitrary values Step 1: – Calculate the expected value E[z ij ] for each hidden variable z ij, assuming the current hypothesis h Step 2: – Calculate a new maximum likelihood hypothesis h’ =, assuming the value taken on by each z ij is its expected value E[z ij ] as calculated in Step 1 – Replace h by h’ – If stopping condition is not met, go to Step 1

The k-Means Algorithm Instead of considering actual probabilities, we simplify as follows: – E[z ij ] is 1 if x i is closest to  j and 0 otherwise (i.e., we “threshold” the expected value) –  j ’ is the mean of the x i ’s for which E[z ij ] = 1 (i.e., we compute the unweighted center of gravity of those points closest to  j )

K-means Overview Algorithm builds a single k-subset partition Works with numeric data only Starts with k random centroids Uses iterative re-assignment of data items to clusters based on some distance to centroids until all assignments remain unchanged

K-means Algorithm 1)Pick a number, k, of cluster centers (at random, do not have to be data items) 2)Assign every item to its nearest cluster center (e.g., using Euclidean distance) 3)Move each cluster center to the mean of its assigned items 4)Repeat steps 2 and 3 until convergence (e.g., change in cluster assignments less than a threshold)

k-Means Example (I) k1k1 k2k2 k3k3 X Y Pick 3 initial cluster centers (randomly)

k-Means Example (II) k1k1 k2k2 k3k3 X Y Assign each point to the closest cluster center

k-Means Example (III) X Y Move each cluster center to the mean of each cluster k1k1 k2k2 k2k2 k1k1 k3k3 k3k3

k-Means Example (IV) X Y Reassign points closest to a different new cluster center Q: Which points are reassigned? k1k1 k2k2 k3k3

k-Means Example (V) X Y A: three points with animation k1k1 k3k3 k2k2

k-Means Example (VI) X Y re-compute cluster means k1k1 k3k3 k2k2

k-Means Example (VII) X Y move cluster centers to cluster means Re-assign No change k2k2 k1k1 k3k3

k-Means Demo

K-means Discussion Result can vary significantly depending on initial choice of seeds Can get trapped in local minimum – Example: – Restart with different random seeds Does not handle outliers well Does not scale very well instances initial cluster centers

K-means Summary Advantages Simple, understandable Items automatically assigned to clusters Disadvantages Must pick number of clusters beforehand All items forced into a cluster Sensitive to outliers Time complexity is O(mkn) where m is # of iterations

K-medoids Overview Aka Partitioning Around Medoids (PAM) Similar to K-means: – Algorithm builds a single k-subset partition – Works with numeric data only – Uses iterative re-assignment of medoids as long as overall clustering quality improves Difference with K-means: – Starts with k random medoids and sticks to medoids (i.e., data points)

K-medoids Quality Measures Clustering quality: – Sum of all distances from a non-medoid object to the medoid for the cluster it is in (an item is assigned to the cluster represented by the medoid to which it is closest) Quality impact: – C jih = cost change for item j associated with swapping medoid i for non-medoid h – Total impact to clustering quality by medoid change (h replaces i):

K-medoids Algorithm 1)Pick a number, k, of random data items as medoids 2)Calculate 3)If TC mn < 0, replace m by n and go back to 2 4)Assign every item to its nearest medoid The pair (n,m) of medoid/non-medoid with the smallest impact on clustering quality

K-medoids Example (I) Assume k=2 Select X5 and X9 as medoids Current clustering: {X1,X2,X5,X6,X7},{X3,X4,X8,X9,X10}

K-medoids Example (II) Must try to replace X5 by X1, X2, X3, X4, X6, X7, X8, X10 Must try to replace X9 by X1, X2, X3, X4, X6, X7, X8, X10 Replace X5 by X4: -9 Replace X5 by X6: -5 Replace X5 by X7: 0 Replace X5 by X8: -1 Replace X5 by X10: 1  Replace X5 by X3 Replace X9 by X1: -7 Replace X9 by X2: -5 Replace X9 by X3: -7 Replace X9 by X4: -8 Replace X9 by X6: -3 Replace X9 by X7: 5 Replace X9 by X8: -1 Replace X9 by X10: -4

K-medoids Example (III) X3 and X9 are new medoids Current clustering: {X1,X2,X3,X4},{X5,X6,X7,X8,X9,X10} No change in medoids yields better quality  DONE! (I think)

K-medoids Discussion As in K-means, user must select the value of k, but the resulting clustering is independent of the initial choice of medoids Handles outliers well Does not scale well – CLARA and CLARANS improve on the time complexity of K-medoids by using sampling and neighborhoods

K-medoids Summary Advantages Simple, understandable Items automatically assigned to clusters Handles outliers Disadvantages Must pick number of clusters beforehand High time complexity

Hierarchical Clustering Two Approaches – Agglomerative Each instance is initially its own cluster. Most similar instance/clusters are then progressively combined until all instances are in one cluster. Each level of the hierarchy is a different set/grouping of clusters. – Divisive: Start with all instances as one cluster and progressively divide until all instances are their own cluster. You can then decide what level of granularity you want to output.

Hierarchical Agglomerative Clustering 1.Assign each data item to its own cluster 2.Compute pairwise distances between clusters 3.Merge the two closest clusters 4.If more than one cluster is left, go to step 2

Cluster Distances Complete-link – Maximum pairwise distance between the items of two different clusters Single-link – Minimum pairwise distance between the items of two different clusters Average-link – Average pairwise distance between the items of two different clusters

HAC Example Assume single-link

HAC Demo

HAC Discussion No need to specify number of clusters * May also be done after the dendrogram is built Still need to know when to stop: * Too early  clustering too fine Too late  clustering too coarse Trade one parameter (K) for another (distance threshold)?

Picking “the” Threshold Guessing (sub-optimal) Looking for “jumps” in the distance function (subjective) Human examination (expensive, unreasonable) Semi-supervised learning – Must-link, cannot-link constraints – Stated explicitly or implicitly (e.g., through labels)

Simple Solution Select random sample S of items Label items in S Cluster S Find the threshold value T that maximizes some clustering quality measure on S Cluster complete dataset up to T |S|=50 was shown to give reasonable results

HAC Summary Best implementation is O(n 2 logn) All clusterings in one pass Impact of cluster distance – Single link – can lead to long chained clusters where some points are quite far from each other – Complete link – finds more compact clusters – Average link – used less because have to re- compute the average each time

Symbolic Clustering

COBWEB Demo dortmund.de/kdnet/auto?self=$81d91eaae317b2bebb

END OF DAY 8 Homework: Clustering