CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Advertisements

CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Presented by: Mingyuan Zhou Duke University, ECE April 3, 2009
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Isomap Algorithm.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 14: Embeddings 1Lecture 14: Embeddings.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSC2535: Advanced Machine Learning Lecture 11a Priors and Prejudice Geoffrey Hinton.
Diffusion Maps and Spectral Clustering
Clustering Unsupervised learning Generating “classes”
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
CIAR Second Summer School Tutorial Lecture 2b Autoencoders & Modeling time series with Boltzmann machines Geoffrey Hinton.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Stochastic k-Neighborhood Selection for Supervised and Unsupervised Learning University of Toronto Machine Learning Seminar Feb 21, 2013 Kevin SwerskyIlya.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Graph Embedding: A General Framework for Dimensionality Reduction Dong XU School of Computer Engineering Nanyang Technological University
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Dimensionality reduction: Some Assumptions High-dimensional data often lies on or near a much lower dimensional, curved manifold. A good way to represent.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Manifold learning: MDS and Isomap
Nonlinear Dimensionality Reduction Approach (ISOMAP)
Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Math 285 Project Diffusion Maps Xiaoyan Chong Department of Mathematics and Statistics San Jose State University.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC321: Extra Lecture (not on the exam) Non-linear dimensionality reduction Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Spectral Methods for Dimensionality
Visualizing High-Dimensional Data
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Machine Learning Basics
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Goodfellow: Chapter 14 Autoencoders
Representation Learning with Deep Auto-Encoder
NonLinear Dimensionality Reduction or Unfolding Manifolds
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton

Dimensionality reduction: Some Assumptions High-dimensional data often lies on or near a much lower dimensional, curved manifold. A good way to represent data points is by their low-dimensional coordinates. The low-dimensional representation of the data should capture information about high- dimensional pairwise distances.

Dimensionality reduction methods Global methods assume that all pairwise distances are of equal importance. –Choose the low-D pairwise distances to fit the high-D ones (using magnitude or rank order). Local methods assume that only local distances are reliable in high-D. –Put more weight on getting local distances right.

Linear methods of reducing dimensionality PCA finds the directions that have the most variance. –By representing where each datapoint is along these axes, we minimize the squared reconstruction error. –Linear autoencoders are equivalent to PCA Multi-Dimensional Scaling arranges the low-dimensional points so as to minimize the discrepancy between the pairwise distances in the original space and the pairwise distances in the low-D space. MDS is equivalent to PCA (if we normalize the data right) high-D distance low-D distance

Multi-Dimensional Scaling can be made non-linear by putting more importance on the small distances. An extreme version is the Sammon mapping: Non-linear MDS is also slow to optimize and also gets stuck in different local optima each time. Other non-linear methods of reducing dimensionality high-D distance low-D distance

Linear methods cannot interpolate properly between the leftmost and rightmost images in each row. This is because the interpolated images are NOT averages of the images at the two ends. The method shown here does not interpolate properly either because it can only use examples from the training set. It cannot create new images.

IsoMap: Local MDS without local optima Instead of only modeling local distances, we can try to measure the distances along the manifold and then model these intrinsic distances. –The main problem is to find a robust way of measuring distances along the manifold. –If we can measure manifold distances, the global optimisation is easy: It’s just global MDS (i.e. PCA) 2-D 1-D If we measure distances along the manifold, d(1,6) > d(1,4) 1 4 6

How Isomap measures intrinsic distances Connect each datapoint to its K nearest neighbors in the high-dimensional space. Put the true Euclidean distance on each of these links. Then approximate the manifold distance between any pair of points as the shortest path in this graph. A B

Using Isomap to discover the intrinsic manifold in a set of face images

A probabilistic version of local MDS It is more important to get local distances right than non-local ones, but getting infinitessimal distances right is not infinitely important. –All the small distances are about equally important to model correctly. –Stochastic neighbor embedding has a probabilistic way of deciding if a pairwise distance is “local”.

Stochastic Neighbor Embedding: A probabilistic local method Each point in high-D has a probability of picking each other point as its neighbor. The distribution over neighbors is based on the high-D pairwise distances. High-D Space i j k

Throwing away the raw data The probabilities that each points picks other points as its neighbor contains all of the information we are going to use for finding the manifold. –Once we have the probabilities we do not need to do any more computations in the high-dimensional space. –The input could be “dissimilarities” between pairs of datapoints instead of the locations of individual datapoints in a high-dimensional space.

Evaluating an arrangement of the data in a low- dimensional space Give each datapoint a location in the low- dimensional space. –Evaluate this representation by seeing how well the low-D probabilities model the high-D ones. Low-D Space i j k

The cost function for a low-dimensional representation For points where p ij is large and q ij is small we lose a lot. –Nearby points in high-D really want to be nearby in low-D For points where q ij is large and p ij is small we lose a little because we waste some of the probability mass in the Q i distribution. –Widely separated points in high-D have a mild preference for being widely separated in low-D.

The forces acting on the low-dimensional points Points are pulled towards each other if the p’s are bigger than the q’s and repelled if the q’s are bigger than the p’s i j

Unsupervised SNE embedding of the digits 0-4. Not all the data is displayed

Using SNE to look a feature vectors of words First we learn a feature vector for each word. The feature vectors are trained by trying to predict the feature vector of one word from the feature vectors of the previous two words. The training data is a few million trigrams.

A net for learning what words mean? (Andriy Mnih) The net is given two words and asked to predict the next word. It learns to represent each word as a vector of 100 real- valued features binary neurons 100 reals “open” “the” 100 reals ? A table that converts each word to its feature vector can be learned at the same time as learning the conditional Boltzmann machine

Symmetric SNE In general, the probability of I picking j is not equal to the probability of j picking i. But we could symmetrize to get probabilities of picking pairs of datapoints.

Computing affinities between datapoints Each high dimensional point, i, has a conditional probability of picking each other point, j, as its neighbor. The conditional distribution over neighbors is based on the high-dimensional pairwise distances. High-D Space i j k probability of picking j given that you start at i

Turning conditional probabilities into pairwise probabilities To get a symmetric affinity between i and j we sum the two conditional probabilities and divide by the number of points (points are not allowed to choose themselves). This ensures that all the pairwise affinities sum to 1 so they can be treated as probabilities. joint probability of picking the pair i,j

Evaluating an arrangement of the data in a low-dimensional space Give each data-point a location in the low- dimensional space. –Define low-dimensional probabilities symmetrically. –Evaluate the representation by seeing how well the low-D probabilities model the high-D affinities. Low-D Space i j k

The cost function for a low-dimensional representation For points where p ij is large and q ij is small we lose a lot. –Nearby points in high-D really want to be nearby in low-D For points where q ij is large and p ij is small we lose a little because we waste some of the probability mass in the Q distribution. –Widely separated points in high-D have a mild preference for not being too close in low-D. –But it doesn’t cost much to make the manifold bend back on itself.

The forces acting on the low-dimensional points Points are pulled towards each other if the p’s are bigger than the q’s and repelled if the q’s are bigger than the p’s –Its equivalent to having springs whose stiffnesses are set dynamically. i j extension stiffness

Evaluating the codes found by a deep autoencoder Use 3000 images of handwritten digits from the USPS training set. –Each image is 16x16 and fairly binary. Use a highly non-linear autoencoder –Use logistic output units and linear code units. 200 logistic units 100 logistic units 20 linear code units data reconstruction

Does code space capture the structure of the data? We would like the code space to model the underlying structure of the data. –Digits in the same class should get closer together in code space. –Digits of different classes should get further apart. We can use k nearest neighbors to see if this happens. –Hold out each image in turn and try to label it by using the labels of its k nearest neighbors.

A potential problem PCA is not powerful enough to really mangle the data. Highly non-linear auto- encoders can fracture a manifold into many different domains. –This can lead to very different codes for nearby data-points. A B C A B C A B C A C B

How to fix it We use a regularizer that makes it costly to fracture the manifold. –There are many possible regularizers. Stochastic neighbor embedding can be used as a regularizer. –Its like putting springs between the codes to prevent the codes for similar datapoints from being too far apart.

How the gradients are combined 200 logistic units 100 logistic units 20 linear code units data reconstruction Forces generated by springs attaching this code to the codes for all the other data-points. The stiffness of each spring is dynamically set to be: Back-propagated derivatives of reconstruction error

How well does it work? (Hinton, Min, Salakhutdinov) The hold-one-out KNN error of an autoencoder falls if we combine it with SNE. –SNE makes the search easier for a deep autoencoder. –The strength of the regularizer must be chosen sensibly. –The SNE regularizer alone gives higher hold-one-out KNN errors than a well trained autoencoder Can we visualize the codes that are produced using the regularizer?

Using SNE for visualization To get an idea of the similarities between codes, we can use SNE to map the 20-D codes down to 2-D. –The combination of the generative model of the auto-encoder and the manifold preserving regularizer causes the codes for different classes of digit to be quite well separated.