Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
An Introduction of Support Vector Machine
Machine learning continued Image source:
CS Clustering1 Unsupervised Learning and Clustering In unsupervised learning you are given a data set with no output classifications Clustering is.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Lecture 21: Spectral Clustering
Non-metric affinity propagation for unsupervised image categorization Delbert Dueck and Brendan J. Frey ICCV 2007.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Cluster Analysis: Basic Concepts and Algorithms
Unsupervised Learning and Data Mining
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.
DATA MINING LECTURE 8 Clustering The k-means algorithm
Clustering Unsupervised learning Generating “classes”
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Computer Vision James Hays, Brown
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
Incremental Methods for Machine Learning Problems Aristidis Likas Department of Computer Science University of Ioannina
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
K ERNEL - BASED W EIGHTED M ULTI - VIEW C LUSTERING Grigorios Tzortzis and Aristidis Likas Department of Computer Science, University of Ioannina, Greece.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning Queens College Lecture 7: Clustering.
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’
Clustering Anna Reithmeir Data Mining Proseminar 2017
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Data Mining K-means Algorithm
Jianping Fan Dept of CS UNC-Charlotte
CSE572, CBS598: Data Mining by H. Liu
Jianping Fan Dept of Computer Science UNC-Charlotte
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
Presentation transcript:

Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression

Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’ from the objects of the other groups. A distance (or similarity) measure is required. Unsupervised learning: no class labels Clustering is NP-complete Clustering Examples: documents, images, time series, image segmentation, video analysis, gene clustering, motif discovery, web applications Big Issue: Number of Clusters estimation Difficult to evaluate solutions

Clustering Cluster Assignments: hard vs (fuzzy/probabilistic) Clustering Methods −Hierarchical (agglomerative, divisive) −Density-based (non-parametric) −Parametric (k-means, mixture models etc) Clustering Problems – Data Vectors – Similarity Matrix

Agglomerative Clustering The simplest approach and a good starting point Starting from singleton clusters at each step we merge the two most similar clusters A similarity (or distance) measure between clusters is needed Output: dendrogram Drawback: merging decisions are permanent (cannot be corrected at a later stage)

Density-Based Clustering (eg DBSCAN) Identify ‘dense regions’ in the data space Merge neighboring dense regions Require a lot of points Complexity: O(n 2 ) Core Border Outlier Eps = 1cm MinPts = 5 (set empirically, how?)

Parametric methods k-means (data vectors): O(n) (n: the number of objects to be clustered) k-medoids (similarity matrix): O(n 2 ) Mixture models (data vectors): O(n) Spectral clustering (similarity matrix): O(n 3 ) Kernel k-means (similarity matrix): O(n 2 ) Affinity Propagation (similarity matrix): O(n 2 )

k-means Partition a dataset X of N vectors x i into M subsets (clusters) C k such that intra-cluster variance is minimized. Intra-cluster variance: distance from the cluster prototype m k k-means: Prototype = cluster center Finds local minima w.r.t. clustering error – sum of intra-cluster variances Highly dependent on the initial positions (examples) of the centers m k

k-medoids Similar to k-means The represenative is the cluster medoid: the cluster object with smallest average distance to the other cluster objects At each iteration the medoid is computed instead of centroid Increased complexity: O(n 2 ) Medoid: more robust to outliers k-medoids can be used with similarity matrix

k-means (1) vs k-medoids (2)

10 Spectral Clustering (Ng & Jordan, NIPS2001)Ng & Jordan, NIPS2001 Input: Similarity matrix between pairs of objects, number of clusters M Example: a(x,y)=exp(-||x-y|| 2 /σ 2 ) (RBF kernel) Spectral analysis of the similarity matrix: compute top M eigenvectors and form matrix U The i-th object corresponds to a vector in R k : i-th row of U. Rows are clustered in M clusters using k-means

11 Spectral Clustering k-means spectral (RBF kernel,σ=1) 2 rings dataset

Spectral Clustering ↔ Graph cut Data graph Vertices: objects Edge weight: pairwise similarity Clustering = Graph Partitioning

13 Cluster Indicator vector z i =(0,0,…,0,1,0,…0) T for object i Indicator matrix Z=[z 1,…,z n ] (nxk, for k clusters), Z T Z=I Graph partitioning = trace maximization wrt Z: The relaxed problem: is solved optimally using the spectral algorithm to obtain Y k-means is applied on y ij to obtain z ij Spectral Clustering ↔ Graph cut

Kernel-Based Clustering (non-linear cluster separation) – Given a set of objects and the kernel matrix K=[K ij ] containing the similarities between each pair of objects – Goal: Partition the dataset into subsets (clusters) C k such that intra- cluster similarity is maximized. – Kernel trick: Data points are mapped from input space to a higher dimensional feature space through a transformation φ(x). RBF kernel: K(x,y)=exp(-||x-y|| 2 /σ 2 )

Kernel k-Means Kernel k-means = k-means in feature space – Minimizes the clustering error in feature space Differences from k-means – Cluster centers m k in feature space cannot be computed – Each cluster C k is explicitly described by its data objects – Computation of distances from centers in feature space: Finds local minima - Strong dependence on the initial partition

Spectral Relaxation of Kernel k-means 1 Dhillon, I.S., Guan, Y., Kulis, B., Weighted graph cuts without eigenvectors: A multilevel approach, IEEE TPAMI, 2007 Spectral methods can substitute kernel k-means and vice versa Constant

Exemplar-Based Methods Cluster data by identifying representative exemplars – An exemplar is an actual dataset point, similar to a medoid – All data points are considered as possible exemplars – The number of clusters is decided during learning (but a depends on a user-defined parameter) Methods – Convex Mixture Models – Affinity Propagation

Affinity Propagation (AP) (Frey et al., Science 2007)Frey et al., Science 2007 Clusters data by identifying representative exemplars – Exemplars are identified by transmitting messages between data points Input to the algorithm – A similarity matrix where s(i,k) indicates how well data point x k is suited to be an exemplar for data point x i. – Self-similarities s(k,k) that control the number of identified clusters and a higher value means that x k is more likely to become an exemplar Self-similarities are independent of the other similarities Higher values result in more clusters

Affinity Propagation Clustering criterion: – s(i,c i ) is the similarity between the data point x i and its exemplar – Minimized by passing messages between data points, called responsibilities and availabilities Responsibility r(i,k): – Sent from x i to candidate exemplar x k reflects the accumulated evidence for how well suited x k is to serve as the exemplar of x i taking into account other potential exemplars for x i

Affinity Propagation Availability a(i,k) – Sent from candidate exemplar x k to x i reflects the accumulated evidence for how appropriate it would be for x i to choose x k as its exemplar, taking into account the support from other points that x k should be an exemplar The algorithm alternates between responsibility and availability calculation and The exemplars are the points with r(k,k)+a(k,k)>0 –

Affinity Propagation

Incremental Clustering Bisecting k-means (Steinbach,Karypis & Kumar, SIGKDD 2000) Start with k=1 (m 1 = data average) Assume a solution with k clusters – Find the ‘best’ cluster split in two subclusters – Replace the cluster center with the two subcluster centers – Run k-means with k+1 centers (optional) – k:=k+1 Until M clusters have been added Split a cluster using several random trials Each trial: – Randomly initialize two centers from the cluster points – Run 2-means using the cluster points only Keep the split of the trial with the lowest clustering error

Global k-means (Likas, Vlassis & Verbeek, PR 2003)Likas, Vlassis & Verbeek, PR 2003 Incremental, deterministic clustering algorithm that runs k-Means several times Finds near-optimal solutions wrt clustering error Idea: a near-optimal solution for k clusters can be obtained by running k-means from an initial state – the k-1 centers are initialized from a near-optimal solution of the (k- 1)-clustering problem – the k-th center is initialized at some data point x n (which?) Consider all possible initializations (one for each x n )

Global k-means In order to solve the M-clustering problem: – Solve the 1-clustering problem (trivial) – Solve the k-clustering problem using the solution of the (k-1)-clustering problem Execute k-Means N times, initialized as at the n-th run (n=1,…,N). Keep the solution corresponding to the run with the lowest clustering error as the solution with k clusters – k:=k+1, Repeat step 2 until k=M.

Best Initial m 2 Best Initial m 3 Best Initial m 4 Best Initial m 5

Fast Global k-Means How is the complexity reduced? – We select the initial state with the greatest reduction in clustering error in the first iteration of k-means (reduction can be computed analytically) – k-means is executed only once from this state Restrict the set of candidate initial points (kd-tree, summarization)

Global Kernel k-Means (Tzortzis & Likas, IEEE TNN 2009)Tzortzis & Likas, IEEE TNN 2009 In order to solve the M-clustering problem: 1.Solve the 1-clustering problem with Kernel k-Means (trivial solution) 2.Solve the k-clustering problem using the solution of the (k-1)-clustering problem a)Let denote the solution to the (k-1)-clustering problem b)Execute Kernel k-Means N times, initialized during the n-th run as c)Keep the run with the lowest clustering error as the solution with k clusters d)k := k+1 3.Repeat step 2 until k=M. The fast Global kernel k-means can be applied Select representative data points using convex mixture models

Best Initial C 3 Best Initial C 4 Blue circles: optimal initialization of the cluster to be added RBF kernel: K(x,y)=exp(-||x-y|| 2 /σ 2 ) Best Initial C 2

Clustering Methods: Summary Usually we assume that the number of clusters is given k-means is still the most widely used method Mixture models could be used when lots of data are available Spectral clustering (or kernel k-means) the most popular when similarity matrix is given Beware of the parameter initialization problem! Ground truth absence makes evaluation difficult How could we estimate the number of clusters?