Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Machine Learning and Data Mining Linear regression
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Machine Learning and Data Mining Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Introduction to Bioinformatics - Tutorial no. 12
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
DOCUMENT CLUSTERING. Clustering  Automatically group related documents into clusters.  Example  Medical documents  Legal documents  Financial documents.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Prepared by: Mahmoud Rafeek Al-Farra
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
CHAPTER 1: Introduction. 2 Why “Learn”? Machine learning is programming computers to optimize a performance criterion using example data or past experience.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Machine Learning and Data Mining Clustering
Machine Learning and Data Mining Clustering
K-means and Hierarchical Clustering
CS 188: Artificial Intelligence Fall 2008
Text Categorization Berlin Chen 2003 Reference:
Machine Learning and Data Mining Clustering
Unsupervised Learning: Clustering
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Machine Learning and Data Mining Clustering
Unsupervised Learning
Presentation transcript:

Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA +

Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering. K-Means clustering.

What is Clustering? You can say this “unsupervised learning” There is not any label for each instance of data. Clustering is alternatively called as “grouping”. Clustering algorithms rely on a distance metric between data points. –Data points close to each other are in a same cluster.

Unsupervised learning Supervised learning –Predict target value (“y”) given features (“x”) Unsupervised learning –Understand patterns of data (just “x”) –Useful for many reasons Data mining (“explain”) Missing data values (“impute”) Representation (feature generation or selection) One example: clustering X1X1 X2X2

Clustering Example Clustering has wide applications in: Economic Science –especially market research –Grouping customers with same preferences for advertising or predicting sells. Text mining –Search engines cluster documents.

Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering. K-Means clustering.

Cluster Distances There are four different measures to compute the distance between each pair of clusters: 1- Min distance 2- Max distance 3- Average distance 4- Mean distance

Cluster Distances (Min) X1X1 X2X2

Cluster Distances (Max) X1X1 X2X2

Cluster Distances (Average) Average of distance between all nodes in cluster C i and C j X1X1 X2X2

Cluster Distances (Mean) µ i = Mean of all data points in cluster i (over all features) X1X1 X2X2

Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering. K-Means clustering.

Hierarchical Agglomerative Clustering Define a distance between clusters Initialize: every example is a cluster. Iterate: –Compute distances between all clusters –Merge two closest clusters Save both clustering and sequence of cluster operations as a Dendrogram. Initially, every datum is a cluster

Iteration 1 Merge two closest clusters

Iteration 2 Cluster 2Cluster 1

Iteration 3 Builds up a sequence of clusters (“hierarchical”) Algorithm complexity O(N 2 ) (Why?) In matlab: “linkage” function (stats toolbox)

Dendrogram Two clusters Three clusters Stop the process whenever there are enough number of clusters.

Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering. K-Means clustering.

K-Means Clustering Always there are K cluster. –In contrast, in agglomerative clustering, the number of clusters reduces. iterate: (until clusters remains almost fix). –For each cluster, compute cluster means: ( X i is feature i) –For each data point, find the closest cluster.

Choosing the number of clusters Cost function for clustering is : what is the optimal value of k? (can increasing k ever increase the cost?) This is a model complexity issue –Much like choosing lots of features – they only (seem to) help –Too many clusters leads to over fitting. To reduce the number of clusters, one solution is to penalize for complexity: –Add (# parameters) * log(N) to the cost –Now more clusters can increase cost, if they don’t help “enough”

Choosing the number of clusters (2) The Cattell scree test: Dissimilarity Number of Clusters Scree is a loose accumulation of broken rock at the base of a cliff or mountain. The best K is 3.

2-means Clustering Select two points randomly.

2-means Clustering Find distance of all points to these two points.

2-means Clustering Find two clusters based on distances.

2-means Clustering Find new means. Green points are mean values for two clusters.

2-means Clustering Find new clusters.

2-means Clustering Find new means. Find distance of all points to these two mean values. Find new clusters. Nothing is changed.

Summary Definition of clustering –Difference between supervised and unsupervised learning. –Finding labels for each datum. Clustering algorithms –Agglomerative clustering Each data is a cluster. Combine closest clusters Stop when there is a sufficient number of cluster. –K-means Always K clusters exist. Find new mean value. Find new clusters. Stop when nothing changes in clusters (or changes are less than very small value).