CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

CSC321 Introduction to Neural Networks and Machine Learning Lecture 21 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Independent Component Analysis
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
CSC2515: Lecture 7 (prelude) Some linear generative models and a coding perspective Geoffrey Hinton.
Pattern Recognition and Machine Learning
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
Dimensional reduction, PCA
Independent Component Analysis (ICA) and Factor Analysis (FA)
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial Basis Function Networks
Projection methods in chemistry Autumn 2011 By: Atefe Malek.khatabi M. Daszykowski, B. Walczak, D.L. Massart* Chemometrics and Intelligent Laboratory.
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
CIAR Second Summer School Tutorial Lecture 2b Autoencoders & Modeling time series with Boltzmann machines Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
CSC Lecture 8a Learning Multiplicative Interactions Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSE 185 Introduction to Computer Vision Face Recognition.
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC Lecture 6a Learning Multiplicative Interactions Geoffrey Hinton.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Principal Component Analysis (PCA)
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
CSC2535: Lecture 3: Ways to make backpropagation generalize better, and ways to do without a supervision signal. Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321: Extra Lecture (not on the exam) Non-linear dimensionality reduction Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Feedforward Networks
Dimensionality Reduction
Ch 12. Continuous Latent Variables ~ 12
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
PCA vs ICA vs LDA.
Lecture 14 PCA, pPCA, ICA.
Word Embedding Word2Vec.
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton

Factor Analysis The generative model for factor analysis assumes that the data was produced in three stages: –Pick values independently for some hidden factors that have Gaussian priors –Linearly combine the factors using a factor loading matrix. Use more linear combinations than factors. –Add Gaussian noise that is different for each input. i j

A degeneracy in Factor Analysis We can always make an equivalent model by applying a rotation to the factors and then applying the inverse rotation to the factor loading matrix. –The data does not prefer any particular orientation of the factors. This is a problem if we want to discover the true causal factors. –Psychologists wanted to use scores on intelligence tests to find the independent factors of intelligence.

What structure does FA capture? Factor analysis only captures pairwise correlations between components of the data. –It only depends on the covariance matrix of the data. –It completely ignores higher-order statistics Consider the dataset: 111, 100, 010, 001 This has no pairwise correlations but it does have strong third order structure.

Using a non-Gaussian prior If the prior distributions on the factors are not Gaussian, some orientations will be better than others –It is better to generate the data from factor values that have high probability under the prior. – one big value and one small value is more likely than two medium values that have the same sum of squares. If the prior for each hidden activity is the iso-probability contours are straight lines at 45 degrees.

The square, noise-free case We eliminate the noise model for each data component, and we use the same number of factors as data components. Given the weight matrix, there is now a one-to-one mapping between data vectors and hidden activity vectors. To make the data probable we want two things: –The hidden activity vectors that correspond to data vectors should have high prior probabilities. –The mapping from hidden activities to data vectors should compress the hidden density to get high density in the data space. i.e. the matrix that maps hidden activities to data vectors should have a small determinant. Its inverse should have a big determinant

The ICA density model Assume the data is obtained by linearly mixing the sources The filter matrix is the inverse of the mixing matrix. The sources have independent non-Gaussian priors. The density of the data is a product of source priors and the determinant of the filter matrix Mixing matrix Source vector

The information maximization view of ICA Filter the data linearly and then applying a non- linear “squashing” function. The aim is to maximize the information that the outputs convey about the input. –Since the outputs are a deterministic function of the inputs, information is maximized by maximizing the entropy of the output distribution. This involves maximizing the individual entropies of the outputs and minimizing the mutual information between outputs.

Overcomplete ICA What if we have more independent sources than data components? (independent \= orthogonal) –The data no longer specifies a unique vector of source activities. It specifies a distribution. This also happens if we have sensor noise in square case. –The posterior over sources is non-Gaussian because the prior is non-Gaussian. So we need to approximate the posterior: –MCMC samples –MAP (plus Gaussian around MAP?) –Variational

Self-supervised backpropagation Autoencoders define the desired output to be the same as the input. –Trivial to achieve with direct connections The identity is easy to compute! It is useful if we can squeeze the information through some kind of bottleneck: –If we use a linear network this is very similar to Principal Components Analysis 200 logistic units 20 linear units data recon- struction code

Self-supervised backprop and PCA If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstruction error. The m hidden units will span the same space as the first m principal components –Their weight vectors may not be orthogonal –They will tend to have equal variances

Self-supervised backprop in deep autoencoders We can put extra hidden layers between the input and the bottleneck and between the bottleneck and the output. –This gives a non-linear generalization of PCA It should be very good for non-linear dimensionality reduction. –It is very hard to train with backpropagation –So deep autoencoders have been a big disappointment. But we recently found a very effective method of training them which will be described next week.

A Deep Autoencoder (Ruslan Salakhutdinov) They always looked like a really nice way to do non- linear dimensionality reduction: –But it is very difficult to optimize deep autoencoders using backpropagation. We now have a much better way to optimize them neurons 500 neurons 250 neurons neurons 28x28 linear units

A comparison of methods for compressing digit images to 30 real numbers. real data 30-D deep auto 30-D logistic PCA 30-D PCA

Do the 30-D codes found by the deep autoencoder preserve the class structure of the data? Take the 30-D activity patterns in the code layer and display them in 2-D using a new form of non-linear multi-dimensional scaling (UNI-SNE) Will the learning find the natural classes?

entirely unsupervised except for the colors