Large Image Databases and Small Codes for Object Recognition Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)

Slides:



Advertisements
Similar presentations
Semi-Supervised Learning in Gigantic Image Collections
Advertisements

CSC2535: 2013 Advanced Machine Learning Lecture 8b Image retrieval using multilayer neural networks Geoffrey Hinton.
Greedy Layer-Wise Training of Deep Networks
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
CS590M 2008 Fall: Paper Presentation
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Addressing the Medical Image Annotation Task using visual words representation Uri Avni, Tel Aviv University, Israel Hayit GreenspanTel Aviv University,
Presented by Arshad Jamal, Rajesh Dhania, Vinkal Vishnoi Active hashing and its application to image and text retrieval Yi Zhen, Dit-Yan Yeung, Published.
Deep Learning.
Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.
Large dataset for object and scene recognition A. Torralba, R. Fergus, W. T. Freeman 80 million tiny images Ron Yanovich Guy Peled.
Discriminative and generative methods for bags of features
Large-scale matching CSE P 576 Larry Zitnick
Recognition: A machine learning approach
Small Codes and Large Image Databases for Recognition CVPR 2008 Antonio Torralba, MIT Rob Fergus, NYU Yair Weiss, Hebrew University.
1 Large Scale Similarity Learning and Indexing Part II: Learning to Hash for Large Scale Search Fei Wang and Jun Wang IBM TJ Watson Research Center.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
Fast and Compact Retrieval Methods in Computer Vision Part II A. Torralba, R. Fergus and Y. Weiss. Small Codes and Large Image Databases for Recognition.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Deep Belief Networks for Spam Filtering
Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (NYU) Yair Weiss (Hebrew U.) Antonio Torralba (MIT) TexPoint fonts used in EMF. Read.
Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Submitted by:Supervised by: Ankit Bhutani Prof. Amitabha Mukerjee (Y )Prof. K S Venkatesh.
Deep Boltzman machines Paper by : R. Salakhutdinov, G. Hinton Presenter : Roozbeh Gholizadeh.
Opportunities of Scale, Part 2 Computer Vision James Hays, Brown Many slides from James Hays, Alyosha Efros, and Derek Hoiem Graphic from Antonio Torralba.
Agenda Introduction Bag-of-words models Visual words with spatial location Part-based models Discriminative methods Segmentation and recognition Recognition-based.
Radial-Basis Function Networks
Large Scale Recognition and Retrieval. What does the world look like? High level image statistics Object Recognition for large-scale search Focus on scaling.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Efficient Image Search and Retrieval using Compact Binary Codes
Machine learning Image source:
Indexing Techniques Mei-Chen Yeh.
CIAR Second Summer School Tutorial Lecture 2b Autoencoders & Modeling time series with Boltzmann machines Geoffrey Hinton.
CSE 185 Introduction to Computer Vision Pattern Recognition.
How to do backpropagation in a brain
Human abilities Presented By Mahmoud Awadallah 1.
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
A shallow introduction to Deep Learning
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Minimal Loss Hashing for Compact Binary Codes
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
Efficient Image Search and Retrieval using Compact Binary Codes Rob Fergus (NYU) Jon Barron (NYU/UC Berkeley) Antonio Torralba (MIT) Yair Weiss (Hebrew.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Contextual models for object detection using boosted random fields by Antonio Torralba, Kevin P. Murphy and William T. Freeman.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Recognition: Face Recognition
Basic machine learning background with Python scikit-learn
Structure learning with deep autoencoders
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Rob Fergus Computer Vision
Machine learning overview
Minwise Hashing and Efficient Search
Presentation transcript:

Large Image Databases and Small Codes for Object Recognition Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)

Banksy, 2006 Object Recognition Pixels  Description of scene contents

Amazing resource Maybe we can use it for recognition? Large image datasets Internet contains billions of images

Web image dataset 79.3 million images Collected using image search engines List of nouns taken from Wordnet 32x32 resolution a-bomb a-horizon a._conan_doyle a._e._burnside a._e._housman a._e._kennelly a.e. a_battery a_cappella_singing a_horizon Example first 10: See “80 Million Tiny images” TR

Noisy Output from Image Search Engines

Nearest-Neighbor methods for recognition # images

Issues Need lots of data How to find neighbors quickly? What distance metric to use? How to transfer labels from neighbors to query image? What happens if labels are unreliable?

Overview 1.Fast retrieval using compact codes 2.Recognition using neighbors with unreliable labels

Overview 1.Fast retrieval using compact codes 2.Recognition using neighbors with unreliable labels

Binary codes for images Want images with similar content to have similar binary codes Use Hamming distance between codes – Number of bit flips – E.g.: Semantic Hashing [Salakhutdinov & Hinton, 2007] – Text documents Ham_Dist( , )=1 Ham_Dist( , )=3

Binary codes for images Permits fast lookup via hashing – Each code is a memory address – Find neighbors by exploring Hamming ball around query address – Lookup time depends on radius of ball, NOT on # data points Address Space Query Image Semantic Hash Function Semantically similar images Query address Figure adapted from Salakhutdinov & Hinton ‘07

Compact Binary Codes Google has few billion images (10 9 ) Big PC has ~10 Gbytes (10 11 bits) Codes must fit in memory (disk too slow)  Budget of 10 2 bits/image 1 Megapixel image is 10 7 bits 32x32 color image is 10 4 bits  Semantic hash function must also reduce dimensionality

Code requirements Preserves neighborhood structure of input space Very compact (<10 2 bits/image) Fast to compute Three approaches: 1.Locality Sensitive Hashing (LSH) 2.Boosting 3.Restricted Boltzmann Machines (RBM’s)

Image representation: Gist vectors Pixels not a convenient representation Use GIST descriptor instead 512 dimensions/image L2 distance btw. Gist vectors not bad substitute for human perceptual distance Oliva & Torralba, IJCV 2001

1. Locality Sensitive Hashing Gionis, A. & Indyk, P. & Motwani, R. (1999) Take random projections of data Quantize each projection with few bits For our N bit code: – Compute first N PCA components of data – Each random projection must be linear combination of the N PCA components

2. Boosting Modified form of BoostSSC [Shaknarovich, Viola & Darrell, 2003] GentleBoost with exponential loss Positive examples are pairs of neighbors Negative examples are pairs of unrelated images Each binary regression stump examines a single coordinate in input pair, comparing to some learnt threshold to see that they agree

3. Restricted Boltzmann Machine (RBM) Hidden units: h Visible units: v Symmetric weights w Parameters: Weights wBiases b Network of binary stochastic units Hinton & Salakhutdinov, Science 2006

RBM architecture Hidden units: h Visible units: v Symmetric weights w Learn weights and biases using Contrastive Divergence Parameters: Weights wBiases b Network of binary stochastic units Hinton & Salakhutdinov, Science 2006 Convenient conditional distributions:

Multi-Layer RBM: non-linear dimensionality reduction 512 w1w1 Input Gist vector (512 dimensions) Layer w2w2 Layer N w3w3 Layer 3 Output binary code (N dimensions)

Training RBM models Two phases 1.Pre-training – Unsupervised – Use Contrastive Divergence to learn weights and biases – Gets parameters to right ballpark 2.Fine-tuning – Can make use of label information – No longer stochastic – Back-propagate error to update parameters – Moves parameters to local minimum

Greedy pre-training (Unsupervised) 512 w1w1 Input Gist vector (512 dimensions) Layer 1

Greedy pre-training (Unsupervised) Activations of hidden units from layer 1 (512 dimensions) w2w2 Layer 2

Greedy pre-training (Unsupervised) Activations of hidden units from layer 2 (256 dimensions) 256 N w3w3 Layer 3

Backpropagation using Neighborhood Components Analysis objective 512 w1w1 Input Gist vector (512 dimensions) Layer w2w2 Layer N w3w3 Layer 3 Output binary code (N dimensions)

Neighborhood Components Analysis Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004 Labels defined in input space (high dim.) Points in output space (low dimensional)

Neighborhood Components Analysis Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004 Output of RBM W are RBM weights

Neighborhood Components Analysis Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004 Pulls nearby points OF SAME CLASS closer Push nearby points OF DIFFERENT CLASS away

Neighborhood Components Analysis Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004 Pulls nearby points OF SAME CLASS closer Preserves neighborhood structure of original, high dimensional, space Push nearby points OF DIFFERENT CLASS away

Two test datasets 1.LabelMe – 22,000 images (20,000 train | 2,000 test) – Ground truth segmentations for all – Can define ground truth distance btw. images using these segmentations – Per pixel labels [SUPERVISED] 2.Web data – 79.3 million images – Collected from Internet – No labels, so use L2 distance btw. GIST vectors as ground truth distance [UNSUPERVISED] – Noisy image labels only

Retrieval Experiments

Examples of LabelMe retrieval 12 closest neighbors under different distance metrics

LabelMe Retrieval Size of retrieval set % of 50 true neighbors in retrieval set 0 2,000 10,000 20,0000

LabelMe Retrieval Size of retrieval set % of 50 true neighbors in retrieval set 0 2,000 10,000 20,0000 Number of bits % of 50 true neighbors in first 500 retrieved

Web images retrieval % of 50 true neighbors in retrieval set Size of retrieval set

Web images retrieval Size of retrieval set % of 50 true neighbors in retrieval set Size of retrieval set

Examples of Web retrieval 12 neighbors using different distance metrics

Retrieval Timings

LabelMe Recognition examples

LabelMe Recognition

Overview 1.Fast retrieval using compact codes 2.Recognition using neighbors with unreliable labels

Label Assignment Retrieval gives set of nearby images How to compute label? Issues: – Labeling noise – Keywords can be very specific e.g. yellowfin tuna Query Grover Cleveland Linnet Birdcage Chiefs Casing Neighbors

Wordnet – a Lexical Dictionary Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun aardvark Sense 1 aardvark, ant bear, anteater, Orycteropus afer => placental, placental mammal, eutherian, eutherian mammal => mammal => vertebrate, craniate => chordate => animal, animate being, beast, brute, creature => organism, being => living thing, animate thing => object, physical object => entity

Wordnet Hierarchy Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun aardvark Sense 1 aardvark, ant bear, anteater, Orycteropus afer => placental, placental mammal, eutherian, eutherian mammal => mammal => vertebrate, craniate => chordate => animal, animate being, beast, brute, creature => organism, being => living thing, animate thing => object, physical object => entity Convert graph structure into tree by taking most common meaning

Wordnet Voting Scheme Ground truth One image – one vote

Classification at Multiple Semantic Levels Votes: Animal6 Person33 Plant5 Device3 Administrative4 Others22 Votes: Living44 Artifact9 Land3 Region7 Others10

Wordnet Voting Scheme

Person Recognition 23% of all images in dataset contain people Wide range of poses: not just frontal faces

Person Recognition – Test Set 1016 images from Altavista using “person” query High res and 32x32 available Disjoint from 79 million tiny images

Person Recognition Task: person in image or not? 64-bit RBM code trained on web data (Weak image labels only) Viola-Jones

Object Classification Hand-designed metric

Distribution of objects in the world , , ,000 LabelMestatistics object rank 10% of the classes account for 93% of the labels! 1500 classes with less than 100 samples Number of labeled samples LabelMe dataset

Conclusions Possible to build compact codes for retrieval – # bits seems to depend on dataset – Much room for improvement – Use JPEG coefficients as input to RBM Can do interesting things with lots of data – What would happen with Google’s ~ 2 billion images? – Video data