Learning Representations. Maximum likelihood s r s?s? World Activity Probabilistic model of neuronal firing as a function of s Generative Model.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Image Modeling & Segmentation
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Pattern Recognition and Machine Learning
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Segmentation and Fitting Using Probabilistic Methods
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Gaussian Mixture Models and Expectation Maximization.
Radial Basis Function Networks
Crash Course on Machine Learning
Neural Networks Lecture 8: Two simple learning algorithms
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Projects: 1.Predictive coding in balanced spiking networks (Erwan Ledoux). 2.Using Canonical Correlation Analysis (CCA) to analyse neural data (David Schulz).
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
BCS547 Neural Decoding.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Lecture 2: Statistical learning primer for biologists
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
Neural networks and support vector machines
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
Machine Learning Basics
Latent Variables, Mixture Models and EM
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Support Vector Machines
Presentation transcript:

Learning Representations

Maximum likelihood s r s?s? World Activity Probabilistic model of neuronal firing as a function of s Generative Model

Maximum likelihood s r ss World Activity r i =f i (s)+n i Generative Model

Maximum likelihood W D WW World Observations s r ss World Activity Probabilistic model of neuronal firing as a function of s Probabilistic model of how the data are generated given W Generative Model

Maximum likelihood W WW World Observations y*=f(x,W)+n Generative Model D={x i, y i* }

Maximum likelihood learning To learn the optimal parameters W, we seek to maximum the likelihood of the data which can be done through gradient descent or analytically in some special cases.

Maximum likelihood W D={x i, y i* } WW World Observations y*=Wx+n Generative Model

Maximum likelihood W D={x i, y i* } WW World Observations y*=Wx+n Generative Model Note that the y*’s are treated as corrupted data Product over all examples

Maximum likelihood learning Minimizing quadratic distance is equivalent to maximizing a gaussian likelihood function.

Maximum likelihood learning Analytical Solution Gradient descent: Delta rule

Maximum likelihood learning Example: training a two layer network Very important: you need to cross validate

Maximum likelihood learning Supervised learning: The data consists of pairs of input/output vectors {x i,y i* }. Assume that the data were generated by a network and then corrupted by gaussian noise. Learning:Adjust the parameters of your network to increase the likelihood that the data were indeed generated by your network. Note: if your network is nothing like the system that generated the data, you could be in trouble.

Maximum likelihood learning Unsupervised learning The data consists of input vectors only {x i }. Causal models assume that the data are due to some hidden causes plus noise. This is the generative model. Goal of learning: given a set of observations, find the parameters of the generative model. As usual, we will find the parameters by maximizing the likelihood of the observations.

Maximum likelihood learning Causes Sensory stimuli Generative model Example: unsupervised learning in a two layer network The network represents the joint distribution

Maximum likelihood learning Wait! The network is upside down! Arent’t we doing things the wrong way around? No: the idea is that what’s responsible for the sensory stimuli are high order causes like the presence of physical objects in the world, their identity, their location, their color and so on. Generative model goes from the causes/objects to the sensory stimuli Recognition will go from stimuli to objects/causes

Maximum likelihood learning The network represents the joint distribution P(x,y). Given the joint distribution, inferences are easy. We use Bayes rule to compute P(y|x) (recognition) or P(x|y) (expectations).

Mixture of Gaussians x1 x2

Maximum likelihood learning Example: Mixture of gaussians Causes Sensory stimuli Generative model: 5 parameters, p(y=1), p(y=2),    y

Maximum likelihood learning Example: mixture of gaussians Causes Sensory stimuli Recognition model: given a pair x’s, what was the cluster? y

Maximum likelihood learning Causes Sensory stimuli Generative model Example: unsupervised learning in a two layer network

Maximum likelihood learning Causes Sensory stimuli Recognition

Maximum likelihood learning Learning consist in adjusting the weights to maximum the likelihood of the data, Problem: the generative model specifies: The data set does not specify the hidden causes y i, which we would need for a learning rule like delta rule…

Maximum likelihood learning Fix #1. You don’t know y i ? Estimate it! Pick the MAP estimate of y i on each trial (Olshausen and Field, as we will see later): Note that the weights are presumably incorrect (otherwise, there would be no need for learning). As a result, the y’s obtained this way are also incorrect. Hopefully, they’re good enough… Main problem: this breaks down if p(y i |x i,w) is multimodal.

Maximum likelihood learning Fix #2. Sample y i from P(y|x i,w) using Gibbs sampling. Slow, and again we’re sampling from the wrong distribution…However, this is a much better idea for multimodal distributions.

Maximum likelihood learning Fix # 3. Marginalization Use gradient descent to adjust the parameters of the likelihood and prior (very slow).

Maximum likelihood learning Fix #3. Marginalization. Gradient descent for mixture of two gaussians Even when the p(x|y)’s are gaussian, the resulting likelihood function is not quadratic.

Maximum likelihood learning Fix #3. Marginalization Rarely feasible in practice If y is binary vector of N dimension, that sum contains 2 N terms…

Maximum likelihood learning Fix #4. The expectation-maximization algorithm (EM)

Maximum likelihood learning EM: How can we optimize p(y|w)? Let  1 be p(y=1|w) But we don’t know this… Trick: use an approximation We have samples of this

Maximum likelihood learning E-step: use the current parameters to approximate p true (y|x) with p(y|x,w)

Maximum likelihood learning EM: M step: optimize p(y) From E step

Maximum likelihood learning EM: M step. For the mean of p(x|y=1) use: From E step

Maximum likelihood learning EM: Iterate the E and M step. Guaranteed to converge, but it could be a local minima.

Maximum likelihood learning Fix #5. Model the recognition distribution and use EM for training. Wake-Sleep algorithm (Helmholtz machine).

Maximum likelihood learning Causes Sensory stimuli Generative model P(x|y,w) Helmholtz machine Recognition model Q (y|x,v)

Maximum likelihood learning Fix #5. Model the recognition distribution and use EM for training. Wake-Sleep algorithm (Helmholtz machine). (Wake) M step: Use x’s to generate y according to Q(y|x,v), and adjust the w in P(x|y,w).

Maximum likelihood learning Causes Sensory stimuli Generative model P(x|y,w) Helmholtz machine Recognition model Q (y|x,v)

Maximum likelihood learning Fix #5. Model the recognition distribution and use EM for training. Wake-Sleep algorithm (Helmholtz machine). (Wake) M step: Use x’s to generate y according to Q(y|x,v), and adjust the w in P(x|y,w). (Sleep) E step: Generate y with P(y|w), and use it to generate x according to P(x|y,w). Then adjust the v in Q(y|x,v).

Maximum likelihood learning Causes Sensory stimuli Generative model P(x|y,w) Helmholtz machine Recognition model Q (y|x,v)

Maximum likelihood learning Fix #5. Model the recognition distribution and use EM for training. Wake-Sleep algorithm (Helmholtz machine). Advantage: After several approximations, you can get both learning rules to look like delta rule… Usable for hierarchical architectures.

Sparse representations in V1 Ex: Olshausen and Field: a natural image is generated according to a two step process: 1.a set of coefficients, {a i } is drawn according to a sparse prior (Cauchy or related) 2.The image is the result of combining a set of basis functions weighted by the coefficient c i and corrupted by gaussian noise

Sparse representations in V1 Network representation aiai x y  i (x,y): Generative weights

Sparse representations in V1 The sparse prior favors solution with most coefficients set to zero and a few with a high value. Why a sparse prior? Because the response of neurons to natural images is non gaussian and tend to be sparse.

Sparse representations in V1 The likelihood function:

Sparse representations in V1 The generative model is a model of the joint distribution P(I,a  i 

Sparse representations in V1 Learning: 1.Given a set of natural images, how do you learn the basis functions? Answer: find the basis functions maximizing the likelihood of the images, P({I k }|  i . Sure, but where to you get the a’s? Olhausen and Field: For each image, pick the a’s maximizing the posterior over a, P(a|I k,  i  (Fix#1).

Network implementation

aiai x y Recognition weights

Sparse representations in V1 The sparse prior favors patterns of activity for which most neurons are silent and a few are highly active.

Projective fields

Sparse representations in V1 The true receptive fields are input dependent (because of the lateral interactions) in a way that seems somewhat consistent with experimental data With dots With gratings

Infomax idea Represent the world in a format that maximizes mutual information given the limited information capacity of neurons. Is this simply about packing bits in neuronal firing? What if the code is undecipherable?

Information theory and learning The features extracted by infomax algorithms are often meaningful because high level features are often good for compression. Example of scanned text: a page of text can be dramatically compressed if one treats it as a sequence of characters as opposed to pixels (e.g. this page: 800x700x8 vs 200x8, 2400 compression factor) General idea of unsupervised learning: compress the image and hope to discover high order description of the image

Information theory and learning Ex: Decorrelation in the retina leads to center-surround receptive fields. Ex: ICA (factorial code) leads to oriented receptive fields. Problem: what can you do beyond ICA? How can you extract features that simplify computation? We need other constraints…

Sparse Coding Ex: sparseness… Why sparseness? Grandmother cells: very easy to decode and very easy to use for further computation. Sparse is non gaussian which often corresponds to high level features (because it goes against the law of large number)

Learning Representations The main challenges for the future: –Representing hierarchical structure –Learning hierarchical structure