Building High-Level Features Using Large Scale Unsupervised Learning

Building High-Level Features Using Large Scale Unsupervised Learning
Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A.Y. Ng Presenting: Din Malachi Tomer Gafni

OUTLINE Introduction Architecture Learning and Optimization Results
Conclusions

Introduction The Problem: building high level, class specified feature detectors from only unlabeled data. For example: is it possible to learn a face detector using only unlabeled images? Inspired by neuroscientific conjecture that there exist highly class-specific neurons in the human brain (“grandmother neurons”). The need for large labeled sets poses a significant challenge for problems where labeled data are rare.

Three Kinds of Learning

High Level and Low Level Features
Low level features- minor details of the image, like lines or dots, that can be picked up by, say, a convolutional filter. High level features- built on top of low-level features to detect objects and larger shapes in the image.

High Level and Low Level Features
Mostly the first couple convolutional layers will learn filters for finding lines, dots, curves etc. while the later layers will learn to recognize common objects and shapes.

From Recent Works to the Presented Work
Contemporary computer vision methodology typically emphasizes the role of labeled data to obtain high level features (i.e. bounding box around the face). Approaches that make use of unlabeled data have worked well for building low level features, but worked poorly for building high level data.

From Recent Works to the Presented Work
In this work, we address the problem by scaling up the core components involved in training deep networks: Data set: 200X200 images from over 10 millions YouTube videos. Model: Deep Autoencoder. Computational resources: 1000 machines (16,000 cores). Google Builds a Brain that Can Search for Cat Videos, Time, June 2012 How Many Computers to Identify a Cat? 16,000, NYT June 2012

AutoEncoder An autoencoder neural network is an unsupervised learning algorithm The autoencoder tries to learn the identity function ℎ 𝑋 =𝑋 The main goal is reducing dimensionality of the problem

Restricted Boltzmann Machines (RBM)
Invented by Geoff Hinton Only two layers Fully connected Sigmoid activation function Hinton, Geoffrey E. "A practical guide to training restricted Boltzmann machines." Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, ‏

Restricted Boltzmann Machines (RBM)
Reconstructions: Activations are the inputs Same weights Reconstructions are the outputs -Reconstruction does something different from regression, which estimates a continuous value based on many inputs, and different from classification, which makes guesses about which discrete label to apply to a given input example. -Reconstruction is making guesses about the probability distribution of the original input TWO BIASES-The hidden bias helps the RBM produce the activations on the forward pass (since biases impose a floor so that at least some nodes fire no matter how sparse the data), while the visible layer’s biases help the RBM learn the reconstructions on the backward pass. - The question the RBM is asking itself on the forward pass is: Given these pixels, should my weights send a stronger signal to the elephant node or the dog node? And the question the RBM asks on the backward pass is: Given an elephant, which distribution of pixels should I expect? Hinton, Geoffrey E. "A practical guide to training restricted Boltzmann machines." Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, ‏

Multiple Layers of RBMs
First training between visible layer and the first hidden layer Then, hidden layer 1 is the new ‘visible’ layer, and on… Called Deep Belief Learning

Back to AutoEncoder Unrolled multiple layers of RBM
Hierarchy of representations with increasing level of abstraction Each module transforms its input representation into a higher-level one *להוסיף דברים מ - *להחליף תמונה לתמונה איכותית מהמאגר של המאמר

Architecture of the Presented Paper
Sparse Autoencoder with three important ingredients: Local receptive fields Pooling Local Contrast Normalization (LCN)

First sublayer – Local Receptive Fields
18x18 pixels RF windows 8 Feature maps (channels) Each neuron connects to all input channels

Second Sublayer - Pooling
L2 Pooling- taking the square root of the sum of the squares of the activations 5x5 overlapping windows Pooling over one feature

Third Layer - Local Contrast Normalization
5x5 overlapping windows Connects to all input channels Relatively dominant activations are preferred over high activations on all features Enforcing a sort of local competition between adjacent feature, and between features at the same spatial location in different feature maps

One Layer Summary

Learning and Optimization
During learning, the encoding weights W1 and decoding weights W2 of the first sublayers are adjust using the following optimization problem Global reconstruction cost Ensures the representations encode important information about the data = they can reconstruct the input data Group Sparsity / Spatial pooling – • Outputs of second sublayer. • Lower sum of activations is preferred. • Encourages pooling to group similar features together to achieve invariances.

Training the Network

Experiments Analysis of the learned representation in recognizing faces (“the face detector”). The test set consists of 37,000 images, 13,026 are labeled faces and the rest are distractors. After training, we use this set to measure the performance of each neuron in classifying faces against distractors.

Results The best neuron in the network performs very well in recognizing faces, despite the fact that no supervisory signals were given during training (81.7% accuracy in detecting faces). When removing the LCN sublayers, and the accuracy of the best neuron drops to 78.5%. Histogram of activation values for face images (red) and random images (blue). Even with exclusively unlabeled data the neuron learns to differentiate between faces and random distractors.

Visualization Is the optimal stimulus of the neuron really a face?
First method: visualizing the most responsive stimuli in the test set. Second method: Perform numerical optimization to find the optimal stimulus: f(x;W,H) is the output of the tested neuron given learned parameters W, H and input x.

Visualization Top 48 stimuli of the best neuron The optimal stimulus according to from the test set numerical constraint optimization

Invariance Propeties The face detector is robust against common object transformation: translation, scaling and out of plain rotation.

Cat and Human Body Detector
Is the network able to detect other high level concepts? We construct two datasets, one for classifying human bodies and one for classifying cat faces. The high level detectors also outperform standard baselines in terms of recognition rates, achieving 74.8% and 76.7% on cat and human body respectively.

Summary of Numerical Comparisons

Conclusions In this work we simulated high level class specified neurons using unlabeled data. The work shows that it is possible to train neurons to be selective for high level concepts (human faces, human bodies, cat faces) using entirely unlabeled data. These neurons naturally capture complex invariances such as out of plane and scale invariances.

Building High-Level Features Using Large Scale Unsupervised Learning

Similar presentations

Presentation on theme: "Building High-Level Features Using Large Scale Unsupervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building High-Level Features Using Large Scale Unsupervised Learning

Similar presentations

Presentation on theme: "Building High-Level Features Using Large Scale Unsupervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback