Clustering. How are we doing on the pass sequence? Pretty good! We can now automatically learn the features needed to track both people But, it sucks.

Slides:



Advertisements
Similar presentations
Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Advertisements

Neural Networks and Kernel Methods
Algorithms and applications
Clustering.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Clustering Basic Concepts and Algorithms
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Unsupervised Learning
Machine Learning and Data Mining Clustering
Visual Recognition Tutorial
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Non-metric affinity propagation for unsupervised image categorization Delbert Dueck and Brendan J. Frey ICCV 2007.
CS292 Computational Vision and Language Pattern Recognition and Classification.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial Basis Function Networks
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Computer Vision James Hays, Brown
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
CSIE Dept., National Taiwan Univ., Taiwan
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
Image segmentation Prof. Noah Snavely CS1114
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Visual Information Systems Recognition and Classification.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Vector Quantization Vector quantization is used in many applications such as image and voice compression, voice recognition (in general statistical pattern.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Distributed cooperation and coordination using the Max-Sum algorithm
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Deep Feedforward Networks
LECTURE 11: Advanced Discriminant Analysis
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Hidden Markov Models Part 2: Algorithms
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Jianping Fan Dept of Computer Science UNC-Charlotte
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Machine Learning and Data Mining Clustering
Presentation transcript:

Clustering

How are we doing on the pass sequence? Pretty good! We can now automatically learn the features needed to track both people But, it sucks that we need to hand-label the coordinates of both men in 30 frames and hand- label the 2 classes for the white-shirt tracker Same set of weights used for all hidden units

Unsupervised learning Goal: Learn a machine or model that explains the data using a predefined set of assumptions about how the explanations can work There isnt any labeled data, just input patterns (ie, instead of x,t -pairs, we have only x s) Examples of unsupervised learning –Clustering (eg, k-means clustering) –Dimensionality reduction (eg, PCA) –Data compression methods (eg, ZIP, JPEG, MPEG) –Generative models & Bayesian networks (tomorrow)

A simple example of unsupervised learning Our explanation of the data is that each training case is equal to an unknown number plus zero-mean Gaussian noise with unknown variance 2 Unsupervised learning –Determine and 2, and –For each training case, determine the noise signal Noise signal for training case x n

A simple example of unsupervised learning What can this model be used for? Outlier detection / novelty detection –Given a new input x N+1, define p as follows (see figure) If x N+1 < let p = z=- x N+1 N (z|, 2 )dz and otherwise let p = z=x N+1 N (z|, 2 )dz –Then, we can say that the new input x N+1 is an outlier (unusual) if p < 0.01 Preference prediction –Given a set of new inputs, we can rank them according to probability x N+1

Problem Training data Gaussian estimated from training data This test case has higher density than any of the training cases, but is quite probably an outlier

K -means clustering Data and initial (random) means Set each mean to the average of its data Assign each point to its nearest mean Iteration 1 Iteration 2 Iteration 3 Iteration 4 and convergence

K -means clustering Suppose x 1,…,x N are real-valued or vector-valued Goal: Learn K means and assign each training case to a mean Denote the means by 1,…, K Use one-of-K encoding for the assignments: r nk = 1 if x n is assigned to k, and r nk = 0 otherwise The goal of k-means clustering is to find the r s and s that minimize the following cost function: Generally, finding an exact solution takes time that is exponential in N

K -means clustering Note that given the s we can efficiently find the best r s, and given the r s, we can efficiently find the best s Heres the algorithm: Pick initial means randomly or cleverly Loop until convergence –Assign each training case to its nearest mean: –Set each mean to the average of its training cases: Each step minimizes J wrt the r s or the s, but the procedure is not guaranteed to find the minimum of J

Example Data and initial (random) means Set each mean to the average of its data Assign each point to its nearest mean Iteration 1 Iteration 2 Iteration 3 Iteration 4 and convergence Iteration

Example: Color quantization Consider each image pixel as a 3-D vector (RGB) and use K -means clustering to find K prototype colors Now, each pixel in the image can be stored using log 2 (K) bits, with some loss in color quality 1 bit per pixel24 bits per pixel1.6 bits per pixel3.3 bits per pixel

EM for a mixture of Gaussians Initialization: Pick s, s, s randomly but validly E Step: For each training case, we need q(z) = p(z|x) = p(x|z)p(z) / ( z p(x|z)p(z)) Defining = q(z nk =1 ), we need to actually compute: M Step: Do it in the log-domain! Recall: For labeled data, (z nk )=z nk

EM for mixture of Gaussians: E step c z 1 = 0.5, 2 = 0.5, Images from data set

c =1 Images from data set z=z= c =2 P(c|z) c = 0.5, 2 = 0.5, EM for mixture of Gaussians: E step

Images from data set z=z= c c =1 c =2 P(c|z) = 0.5, 2 = 0.5, EM for mixture of Gaussians: E step

Images from data set z=z= c c =1 c =2 P(c|z) = 0.5, 2 = 0.5, EM for mixture of Gaussians: E step

Images from data set z=z= c c =1 c =2 P(c|z) = 0.5, 2 = 0.5, EM for mixture of Gaussians: E step

c 1 = 0.5, 2 = 0.5, z Set 1 to the average of zP(c =1 |z) Set 2 to the average of zP(c =2 |z) EM for mixture of Gaussians: M step

c 1 = 0.5, 2 = 0.5, z Set 1 to the average of zP(c =1 |z) Set 2 to the average of zP(c =2 |z) EM for mixture of Gaussians: M step

c 1 = 0.5, 2 = 0.5, z Set 1 to the average of diag((z- 1 ) T (z- 1 ))P(c =1 |z) Set 2 to the average of diag((z- 2 ) T (z- 2 ))P(c =2 |z) EM for mixture of Gaussians: M step

c 1 = 0.5, 2 = 0.5, z Set 1 to the average of diag((z- 1 ) T (z- 1 ))P(c =1 |z) Set 2 to the average of diag((z- 2 ) T (z- 2 ))P(c =2 |z) EM for mixture of Gaussians: M step

… after iterating to convergence: c z 1 = 0.6, 2 = 0.4,

Non-vector-space clustering For K-means clustering and EM, The data must lie in a vector space, where Euclidean distance is a natural measure of similarity –Some methods (such as mixtures of Gaussians) allow each cluster to stretch and rotate its data, but these methods are still essentially based on Euclidean distance There can be advantages to using kernels k(x i,x k ) to measure similarity and then making predictions using training cases Now, we will study this approach for clustering, using s(i,k) to denote the similarity of x i to x k (these are like kernels, but not necessarily formally the same)

K -centers clustering (aka K -medoids clustering and K -medians clustering)

Randomly choose initial exemplars, (data centers) K -centers clustering (aka K -medoids clustering and K -medians clustering)

Assign data points to nearest centers K -centers clustering (aka K -medoids clustering and K -medians clustering)

For each cluster, pick best new center K -centers clustering (aka K -medoids clustering and K -medians clustering)

For each cluster, pick best new center

K -centers clustering (aka K -medoids clustering and K -medians clustering) Assign data points to nearest centers

Assign data points to nearest centers K -centers clustering (aka K -medoids clustering and K -medians clustering) Convergence: Final set of exemplars (centers)

The K -centers clustering algorithm Denote the index of the center that x i is currently assigned to by c i ( c i i indicates that x i is a center) Initialization: Randomly pick K points and set c k = k Loop until convergence –For all i s.t. c i i set c i argmax k:c k =k s(i,k) –For all k s.t. c k k Compute k new argmax i:c i =k ( j:c j =k s(j,i) ) For all i s.t. c i k set c i k new This is similar to K-means clustering, except that the means are always data points

Handwritten digit clustering and recognition

Example: Clustering 3000 Buffalo digits Similarity: s(i,k) = - || x i - x k || 2 K=10 K=20 K=40 K=80

The effect of random initialization Squared error Affinity propagation

k-centers clustering How good is the solution?

Solution that minimizes distances

Affinity propagation (Frey & Dueck, Science, Feb 2007)

Affinity propagation

Solution that minimizes distances

How does affinity propagation work? The sum-product or max-sum algorithms (loopy belief propagation) can be used to approximately maximize the k-centers objective function (more on this tomorrow) Result – Affinity Propagation: Data points exchange responsibilities and availabilities Sending responsibilities Candidate exemplar k r(i,k) Data instance i Competing candidate exemplar k a(i,k) Sending availabilities Candidate exemplar k a(i,k) Data instance i Supporting data instance i r(i,k)

Sending responsibilities Candidate exemplar k r(i,k) Data instance i Competing candidate exemplar k a(i,k) Sending availabilities Candidate exemplar k a(i,k) Data instance i Supporting data instance i r(i,k) Making decisions:

Message damping Unstable dynamics are avoided in practice by damping messages: r(i,k)* = r(i,k) + (1- ) r(i,k) old a(i,k)* = a(i,k) + (1- ) a(i,k) old Default: = 0.5

MATLAB implementation (from

Document summarization using affinity propagation s(sentence i,sentence k) = - Number of bits needed to encode the words in sentence i using the words in sentence k and a global dictionary Preference( sentence k ) = - Number of bits needed to encode the words in sentence k using only the global dictionary 1) Affinity propagation identifies exemplars by recursively sending real-valued messages between pairs of data points. 2) The number of identified exemplars (number of clusters) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. 3) The availability is set to the self- responsibility plus the sum of the positive responsibilities the candidate exemplar receives from other points. 4) For different numbers of clusters, the reconstruction errors achieved by affinity propagation and k-centers clustering are compared.

Questions?

How are we doing on the pass sequence? Can clustering be used to automatically learn the two modes of tracking for the man in the white shirt? Maybe, but this system is getting too complex! Is there any simple way to put the pieces together…?