Presentation on theme: "CSC2535: 2011 Lecture 5b Object Recognition and Information Retrieval with Deep Belief Nets Geoffrey Hinton."— Presentation transcript:
CSC2535: 2011 Lecture 5b Object Recognition and Information Retrieval with Deep Belief Nets Geoffrey Hinton
Training a deep network First train a layer of features that receive input directly from the binary pixels. Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer. It can be proved that each time we add another layer of features we get a better model of the set of training images (if we do it right). –The proof uses variational free energy. –The proof doesn’t apply if the layers get smaller. Also, it requires the weights in each RBM to be initialized properly and to be trained by properly following the gradient of the likelihood.
Modeling real-valued data For images of digits it is possible to represent intermediate intensities as if they were probabilities by using “mean-field” logistic units. –We can treat intermediate values as the probability that the pixel is inked. This will not work for real images. –In a real image, the intensity of a pixel is almost always almost exactly the average of the neighboring pixels. –Mean-field logistic units cannot represent precise intermediate values.
A standard type of real-valued visible unit We can model pixels as Gaussian variables. Alternating Gibbs sampling is still easy, though learning needs to be much slower. E energy-gradient produced by the total input to a visible unit parabolic containment function Welling et. al. (2005) show how to extend RBM’s to the exponential family. See also Bengio et. al. (2007)
The two conditional distributions for a Gaussian-Bernoulli RBM
A random sample of 10,000 binary filters learned by Alex Krizhevsky on a million 32x32 color images.
3D Object Recognition: The NORB dataset Stereo-pairs of grayscale images of toy objects. - 6 lighting conditions, 162 viewpoints -Five object instances per class in the training set - A different set of five instances per class in the test set - 24,300 training cases, 24,300 test cases Animals Humans Planes Trucks Cars Normalized- uniform version of NORB
NORB Pre-processing Each image in a stereo pair is 96x96, so a training case is 18432-dimensional. We reduce the dimensionality to 8976 by blurring the borders of an image. Only the central 64x64 region has full resolution. We also make the data zero-mean and scale each dimension to have unit-variance. We then use an RBM with Gaussian visible units to learn a layer of 4000 binary features from the unlabelled data. These features are not fine-tuned.
The model contains ~116 million parameters and is trained with only 24,300 labeled images. 4000 binary units for class 2 8976 Gaussian units 4000 binary units 4000 binary units for class 1 4000 binary units for class 3 or Five competing generative models Each class- specific model is trained generatively on data from its own class All five models are also trained discriminatively to make the right model have the lowest free energy.
Results (Nair & Hinton, NIPS 2009) Support Vector Machines11.6% Convolutional Neural Networks (built in geometric knowledge) 6.0% Single top-level RBM with label variable generative + discriminative training 10.4% Competing top-level RBMs: generative + discriminative training 6.5% Competing top-level RBMs: generative + discriminative training plus extra unlabeled translated data 5.3% Error rate
An embarrassing problem The front end was a Restricted Boltzmann Machine with Gaussian visible units and binary hidden units. We tried learning the residual noise model for each Gaussian visible unit and it did not work. –So we just fixed the residual noise variance to be 1, which is much too big.
Why its hard to learn an appropriate noise level for each visible unit Suppose the model of the residual noise in a visible unit has a standard deviation of 0.1. The bottom-up weight is effectively multiplied by 10 and the top-down weight is effectively divided by 10. –The resulting bottom-up effects are much too strong. They saturate the hidden units. –The resulting top-down effects are much too weak. They cannot move the visible units far enough away from their mean. hidden i j visible
How to implement an integer-valued variable (Teh and Hinton, 2001) One way to model an integer-valued variable is to make N identical copies of a Bernoulli unit. All copies have the same probability, of being “on”. –The total number of “on” copies (the “firing rate”) has a binomial distribution with variance N p(1-p) –Provided p is small, the total number of “on” copies is Poisson-distributed with variance approximately N p It is very hard to stay in the Poisson regime because the “firing rate” grows exponentially with the input to the unit. –So learning is very unstable.
How to fix the problem with an infinite model Make infinitely many copies of each binary hidden unit. All copies have the same weights and the same adaptive bias, b, but they have different fixed offsets to the bias: Now if the bottom-up effects become 10 times stronger, 10 times as many hidden units will turn on producing a top-down effect that is 10 times as big.
A fast approximation Contrastive divergence learning works well for the sum of binary units with offset biases. It also works for rectified linear units. These are much faster to compute than the sum of many logistic units. output = max(0, x + randn*sqrt(logistic(x)) )
A harder version of the NORB dataset The objects are jittered in position, size, and intensity. A distractor object is placed off-center in the image A cluttered background is added to the image (but its at a different depth so stereo can be quite effective at ignoring it.) To make experiments quicker the 96x96 images were reduced to 32x32 images in the simplest possible way.
Results on jittered-cluttered NORB (Nair & Hinton, ICML 2010) Multinomial logistic regression on the pixels 49.9% Support Vector Machines43.3% Convolutional Neural Networks (built in geometric knowledge) 7.2% Multinomial logistic regression on second hidden layer using binary units 18.8% Multinomial logistic regression on second hidden layer using RELU units 15.2% Error rate
Equivariance Simple convolutional neural nets give translational equivariance, not translational invariance. A small amount of translational invariance is then achieved at each layer by using local averaging or maxing. representation image translated representation translated image
Intensity equivariance Consider a net with multiple layers of rectified linear units that all have a bias of zero. –Ignore the sampling noise for now. If we scale all the intensities in the input image by S, exactly the same set of hidden units will be above threshold in every hidden layer, but all of their activities will be scaled up by S. –Its a linear model. –If the ratios of intensities do not change, it will not switch to a different linear model. –So we have a very non-linear model (globally) that respects the linear property
Intensity invariance Equivariance ensures that if we can represent an image well, we can also represent it just as well when all the intensities are scaled up. Given equivariance to intensity as the basic property of the model, we can then achieve invariance by a little bit of local normalization at each layer. –Local divisive normalization does for intensity what local averaging or maxing does for translation in a convolutional net.
Test examples from the CIFAR-10 dataset plane car bird cat deer dog frog horse ship truck
The CIFAR-10 labeled subset of the MIT tiny images dataset There are 5000 32x32 carefully labelled training images and 1000 32x32 carefully labelled testing images for each of 10 different classes. (so its very like MNIST!) There are 80 million very badly labelled images that are found by trawling the web using about 50,000 search words or phrases taken from “Wordnet”. –Less than 10% of these images have a dominant object that fits the search term. –For example, about 90% of images found using “cat” do not contain a cat! The badly labelled images are very useful for unsupervised pre-training.
Some ways to make convolutional neural nets work really well (Alex Krizhevsky) Use RELU units instead of binary units in the first hidden layer. Use overlapping pools for “sub-sampling” –This retains more information about the positions of features, but codes this information in units that have large receptive fields. In addition to the usual convolutional units that have local receptive fields with weight-sharing, use a separate set of global units. –The global units learn to take care of edge effects that would otherwise mess up the local units.
Local filters learned by the 64 convolutional kernels in the first hidden layer of RELUs After unsupervised pre-training After supervised fine-tuning
Results of Alex Krizhevsky’s experiments Pre-train DBN with 2 hidden layers of fully connected binary units. Add softmax layer and fine-tune with backprop ~35% Convolutional neural net (2 hidden layers with pooling) ~29% Convolutional RELU net + global filters (1 hidden layer but no pooling) ~29% Convolutional RELU net + global filters (1 hidden layer with overlapping pooling) ~25% Convolutional RELU net + global filters (2 hidden layers with overlapping pooling) ~21% Error rate
Mixtures of linear models Mixtures of linear models are easy to fit and work surprisingly well on many simple problems. Given a set of images of digits that have been approximately normalized for size and position, we can get quite a good model by using a mixture of factor analysers. –Each factor analyser models one class or subclass. –The factors typically model small local translations of parts of the digit. This type of model looks very different from a multilayer neural net.
Some problems with mixtures of linear models For data with many different things going on at once we may need exponentially many components in the mixture –Consider modeling pairs of digits. If there are too many components in the mixture they lose statistical power. –Each linear model only gets trained on a very small fraction of the data. –A mixture cannot benefit from the fact that the top of a 2 is like the top of a 3. Its hard to stitch together the latent representations of different factor analysers because they all use different coordinate systems. –This makes it hard to associate actions with the latent representations.
An old idea for learning a mixture of exponentially many linear models (Sallans, Hinton & Ghahramani, 1997) First generate a hierarchical binary state by running a sigmoid belief net. Each unit in the SBN can switch in a corresponding linear unit. –So the SBM chooses which factors to use in a hierarchical linear model. We get combinatorially many linear models that share factors –but inference is hard.
A better version of the old idea Combine the binary switch and the linear variable into a single rectified linear unit. – Rectification is equivalent to the rule: If the total input is negative, exclude that unit from the linear model. –This has the nice property that there is no discontinuity at the point at which the unit is switched in, because its value is 0. Use an undirected model with only one hidden layer so that inference is easy. –We can infer the value of a hidden unit without knowing which other units are switched in.
Overview An efficient way to train a multilayer neural network to extract a low-dimensional representation. Document retrieval ( published work with Russ Salakhutdinov) –How to model a bag of words with an RBM –How to learn binary codes –Semantic hashing: retrieval in no time Image retrieval (unpublished work with Alex Krizhevsky) –How good are 256-bit codes for retrieval of small color images? –Ways to use the speed of semantic hashing for higher-quality image retrieval (work in progress).
Deep Autoencoders (with Ruslan Salakhutdinov) They always looked like a really nice way to do non-linear dimensionality reduction: –But it is very difficult to optimize deep autoencoders using backpropagation. We now have a much better way to optimize them: –First train a stack of 4 RBM’s –Then “unroll” them. –Then fine-tune with backprop. 1000 neurons 500 neurons 250 neurons 30 1000 neurons 28x28
A comparison of methods for compressing digit images to 30 real numbers. real data 30-D deep auto 30-D logistic PCA 30-D PCA
Compressing a document count vector to 2 numbers We train the autoencoder to reproduce its input vector as its output This forces it to compress as much information as possible into the 2 real numbers in the central bottleneck. These 2 numbers are then a good way to visualize documents. 2000 reconstructed counts 500 neurons 2000 word counts 500 neurons 250 neurons 2 output vector We need a special type of RBM to model counts real-valued units
First compress all documents to 2 numbers using a type of PCA Then use different colors for different document categories Yuk!
First compress all documents to 2 numbers. Then use different colors for different document categories
The replicated softmax model: How to modify an RBM to model word count vectors Modification 1: Keep the binary hidden units but use “softmax” visible units that represent 1-of-N Modification 2: Make each hidden unit use the same weights for all the visible softmax units. All the models in this family have 5 hidden units. This model is for 8-word documents.
The replicated softmax model: How to modify an RBM to model word count vectors Modification 3: Use as many softmax visible units as there are non-stop words in the document. –So its actually a family of different-sized RBMs that share weights. It not a single generative model. Modification 4: Multiply each hidden bias by the number of words in the document The replicated softmax model gives much higher probability to held out bags of words than LDA topic models (in NIPS 2009)
The relative negative log probs assigned to bags of words by LDA and replicated softmax RBM
Finding real-valued codes for retrieval Train an auto-encoder using 10 real-valued units in the code layer. Compare with Latent Semantic Analysis that uses PCA on the transformed count vector Non-linear codes are much better. 2000 reconstructed counts 500 neurons 2000 word counts 500 neurons 250 neurons 10
Retrieval performance on 400,000 Reuters business news stories
Finding binary codes for documents Train an auto-encoder using 30 logistic units for the code layer. During the fine-tuning stage, add noise to the inputs to the code units. –The “noise” vector for each training case is fixed. So we still get a deterministic gradient. –The noise forces their activities to become bimodal in order to resist the effects of the noise. –Then we simply threshold the activities of the 30 code units to get a binary code. 2000 reconstructed counts 500 neurons 2000 word counts 500 neurons 250 neurons 30 noise
Using a deep autoencoder as a hash-function for finding approximate matches hash function “supermarket search”
Another view of semantic hashing Fast retrieval methods typically work by intersecting stored lists that are associated with cues extracted from the query. Computers have special hardware that can intersect 32 very long lists in one instruction. –Each bit in a 32-bit binary code specifies a list of half the addresses in the memory. Semantic hashing uses machine learning to map the retrieval problem onto the type of list intersection the computer is good at.
How good is a shortlist found this way? Russ has only implemented it for a million documents with 20-bit codes --- but what could possibly go wrong? –A 20-D hypercube allows us to capture enough of the similarity structure of our document set. The shortlist found using binary codes actually improves the precision-recall curves of TF-IDF. –Locality sensitive hashing (the fastest other method) is much slower and has worse precision-recall curves.
Semantic hashing for image retrieval Currently, image retrieval is typically done by using the captions. Why not use the images too? –Pixels are not like words: individual pixels do not tell us much about the content. –Extracting object classes from images is hard. Maybe we should extract a real-valued vector that has information about the content? –Matching real-valued vectors in a big database is slow and requires a lot of storage Short binary codes are easy to store and match
A two-stage method First, use semantic hashing with 28-bit binary codes to get a long “shortlist” of promising images. Then use 256-bit binary codes to do a serial search for good matches. –This only requires a few words of storage per image and the serial search can be done using fast bit-operations. But how good are the 256-bit binary codes? –Do they find images that we think are similar?
Some depressing competition Inspired by the speed of semantic hashing, Weiss, Fergus and Torralba (NIPS 2008) used a very fast spectral method to assign binary codes to images. –This eliminates the long learning times required by deep autoencoders. They claimed that their spectral method gave better retrieval results than training a deep auto-encoder using RBM’s. –But they could not get RBM’s to work well for extracting features from RGB pixels so they started from 384 GIST features. –This is too much dimensionality reduction too soon.
A comparison of deep auto-encoders and the spectral method using 256-bit codes (Alex Krizhevsky) Train auto-encoders “properly” –Use Gaussian visible units with fixed variance. Do not add noise to the reconstructions. –Use a cluster machine or a big GPU board. –Use a lot of hidden units in the early layers. Then compare with the spectral method –The spectral method has no free parameters. Also compare with Euclidean match in pixel space
Krizhevsky’s deep autoencoder 1024 8192 4096 2048 1024 512 256-bit binary code The encoder has about 67,000,000 parameters. It takes a few GTX 285 GPU days to train on two million images. There is no theory to justify this architecture
An obvious extension Use a multimedia auto-encoder that represents captions and images in a single code. –The captions should help it extract more meaningful image features such as “contains an animal” or “indoor image” RBM’s already work much better than standard LDA topic models for modeling bags of words. –So the multimedia auto-encoder should be a win (for images) a win (for captions) a win (for the interaction during training)
A less obvious extension Semantic hashing gives incredibly fast retrieval but its hard to go much beyond 32 bits. We can afford to use semantic hashing several times with variations of the query and merge the shortlists –Its easy to enumerate the hamming ball around a query image address in ascending address order, so merging is linear time. Apply many transformations to the query image to get transformation independent retrieval. –Image translations are an obvious candidate.
A more interesting extension Computer vision uses images of uniform resolution. –Multi-resolution images still keep all the high- resolution pixels. Even on 32x32 images, people use a lot of eye movements to attend to different parts of the image. –Human vision copes with big translations by moving the fixation point. –It only samples a tiny fraction of the image at high resolution. The “post-retinal’’ image has resolution that falls off rapidly outside the fovea. –With less “neurons” intelligent sampling becomes even more important.
How we extract multiple images with fewer pixels per image This variable resolution “retina” is applied at 81 different locations in the image. Each of the 81 “post- retinal” images is used for semantic hashing. –We also use the whole image.
A more human metric for image similarity Two images are similar if fixating at point X in one image and point Y in the other image gives similar post-retinal images. So use semantic hashing on post-retinal images. –The address space is used for post-retinal images and each address points to the whole image that the post-retinal image came from. –So we can accumulate similarity over multiple fixations. The whole image addresses found after each fixation have to be sorted to allow merging
Summary Restricted Boltzmann Machines provide an efficient way to learn a layer of features without any supervision. –Many layers of representation can be learned by treating the hidden states of one RBM as the data for the next. This allows us to learn very deep nets that extract short binary codes for unlabeled images or documents. –Using 32-bit codes as addresses allows us to get approximate matches at the speed of hashing. Semantic hashing is fast enough to allow many retrieval cycles for a single query image. –So we can try multiple transformations of the query.