Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (NYU) Yair Weiss (Hebrew U.) Antonio Torralba (MIT) TexPoint fonts used in EMF. Read.

Slides:



Advertisements
Similar presentations
Semi-Supervised Learning in Gigantic Image Collections
Advertisements

Partitional Algorithms to Detect Complex Clusters
EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.
EKF, UKF TexPoint fonts used in EMF.
CS479/679 Pattern Recognition Dr. George Bebis
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Model: Parts and Structure. History of Idea Fischler & Elschlager 1973 Yuille ‘91 Brunelli & Poggio ‘93 Lades, v.d. Malsburg et al. ‘93 Cootes, Lanitis,
1cs542g-term Notes  In assignment 1, problem 2: smoothness = number of times differentiable.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Logistic Regression Principal Component Analysis Sampling TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A A A.
Small Codes and Large Image Databases for Recognition CVPR 2008 Antonio Torralba, MIT Rob Fergus, NYU Yair Weiss, Hebrew University.
Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Face Recognition Jeremy Wyatt.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Unsupervised Learning
Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)
ICA Alphan Altinok. Outline  PCA  ICA  Foundation  Ambiguities  Algorithms  Examples  Papers.
Radial Basis Function Networks
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Diffusion Maps and Spectral Clustering
Efficient Image Search and Retrieval using Compact Binary Codes
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Presented By Wanchen Lu 2/25/2013
Alignment Introduction Notes courtesy of Funk et al., SIGGRAPH 2004.
Internet-scale Imagery for Graphics and Vision James Hays cs195g Computational Photography Brown University, Spring 2010.
Multimodal Interaction Dr. Mike Spann
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
A New Method of Probability Density Estimation for Mutual Information Based Image Registration Ajit Rajwade, Arunava Banerjee, Anand Rangarajan. Dept.
CS654: Digital Image Analysis
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Efficient Image Search and Retrieval using Compact Binary Codes Rob Fergus (NYU) Jon Barron (NYU/UC Berkeley) Antonio Torralba (MIT) Yair Weiss (Hebrew.
1. Systems of Linear Equations and Matrices (8 Lectures) 1.1 Introduction to Systems of Linear Equations 1.2 Gaussian Elimination 1.3 Matrices and Matrix.
Statistical Models of Appearance for Computer Vision 主講人:虞台文.
Mesh Segmentation via Spectral Embedding and Contour Analysis Speaker: Min Meng
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Spectral Methods for Dimensionality
Data Transformation: Normalization
Semi-Supervised Clustering
Intrinsic Data Geometry from a Training Set
Linli Xu Martha White Dale Schuurmans University of Alberta
Dimensionality Reduction
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Distributions cont.: Continuous and Multivariate
Unsupervised Riemannian Clustering of Probability Density Functions
Feature description and matching
Semi-Supervised Learning in Gigantic Image Collections
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Grouping.
Principal Component Analysis
Introduction PCA (Principal Component Analysis) Characteristics:
LECTURE 09: DISCRIMINANT ANALYSIS
Machine learning overview
Using Manifold Structure for Partially Labeled Classification
Presentation transcript:

Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (NYU) Yair Weiss (Hebrew U.) Antonio Torralba (MIT) TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

What does the world look like? High level image statistics Object Recognition for large-scale search Gigantic Image Collections

Spectrum of Label Information Human annotationsNoisy labels Unlabeled

Semi-Supervised Learning using Graph Laplacian V = data points E = n x n affinity matrix W Graph Laplacian: [Zhu03,Zhou04]

SSL using Graph Laplacian If labeled: If unlabeled: Want to find label function f that minimizes: y = labels, λ = weights Rewrite as: Straightforward solution SmoothnessAgreement with labels

Smooth vectors will be linear combinations of eigenvectors U with small eigenvalues: Eigenvectors of Laplacian [Belkin & Niyogi 06, Schoelkopf & Smola 02, Zhu et al 03, 08]

Rewrite System Let U = smallest k eigenvectors of L, α = coeffs. Optimal is now solution to k x k system:

Computational Bottleneck Consider a dataset of 80 million images Inverting L –Inverting 80 million x 80 million matrix Finding eigenvectors of L –Diagonalizing 80 million x 80 million matrix

Large Scale SSL - Related work Nystrom method: pick small set of landmark points –Compute exact solution on these –Interpolate solution to rest Others iteratively use classifiers to label data –E.g. Boosting-based method of Loeff et al. ICML’08 [see Zhu ‘08 survey] DataLandmarks

Our Approach

Overview of Our Approach DataLandmarks Density Reduce n Limit as n  ∞ NystromOurs

Consider Limit as n  ∞ Consider x to be drawn from 2D distribution p(x) Let L p (F) be a smoothness operator on p(x), for a function F(x): Analyze eigenfunctions of L p (F) where 2

Eigenvectors & Eigenfunctions

Claim: If p is separable, then: Eigenfunctions of marginals are also eigenfunctions of the joint density, with same eigenvalue p(x 1,x 2 ) p(x 1 ) p(x 2 ) Key Assumption: Separability of Input data [Nadler et al. 06,Weiss et al. 08]

Numerical Approximations to Eigenfunctions in 1D 300k points drawn from distribution p(x) Consider p(x 1 ) p(x) Data p(x 1 ) Histogram h(x 1 )

Solve for values g of eigenfunction at set of discrete locations (histogram bin centers) –and associated eigenvalues –B x B system (# histogram bins = 50) P is diag(h(x 1 )) Numerical Approximations to Eigenfunctions in 1D Affinity between discrete locations

1D Approximate Eigenfunctions Solve 1 st Eigenfunction of h(x 1 ) 2 nd Eigenfunction of h(x 1 ) 3 rd Eigenfunction of h(x 1 )

Separability over Dimension Build histogram over dimension 2: h(x 2 ) Now solve for eigenfunctions of h(x 2 ) 1 st Eigenfunction of h(x 2 ) 2 nd Eigenfunction of h(x 2 ) 3 rd Eigenfunction of h(x 2 ) Data

From Eigenfunctions to Approximate Eigenvectors Take each data point Do 1-D interpolation in each eigenfunction  k dimensional vector (for k eigenfunctions) Very fast operation (has to be done nk times) Histogram bin 150 Eigenfunction value

Preprocessing Need to make data separable Rotate using PCA Not separable Separable Rotate

Overall Algorithm 1.Rotate data to maximize separability (currently use PCA) 2.For each dimension: –Construct 1D histogram –Solve numerically for eigenfunctions/values 3.Order eigenfunctions from all dimensions by increasing eigenvalue & take first k 4.Interpolate data into k eigenfunctions –Yields approximate eigenvectors of Normalized Laplacian 5.Solve k x k least squares system to give label function

Experiments on Toy Data

Comparison of Approaches

Data

Nystrom Comparison Too few landmark points results in highly unstable eigenvectors

Nystrom Comparison Eigenfunctions fail when data has significant dependencies between dimensions

Experiments on Real Data

Experiments Images from 126 classes downloaded from Internet search engines, total 63,000 images Dump truck Emu Labels (correct/incorrect) provided by Geoff Hinton, Alex Krizhevsky, Vinod Nair (U. Toronto and CIFAR)

Input Image Representation Pixels not a convenient representation Use Gist descriptor (Oliva & Torralba, 2001) PCA down to 64 dimensions L2 distance btw. Gist vectors rough substitute for human perceptual distance

Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist PCA Joint histogram for pairs of dimensions after PCA MI is mutual information score. 0 = Independent

Real 1-D Eigenfunctions of PCA’d Gist descriptors Eigenfunction 1 Eigenfunction 256 Input Dimension Eigenfunction value Color = Input dimension x min x max Histogram bin 150

Protocol Task is to re-rank images of each class Measure 15% recall Vary # of labeled examples Chance level performance is 33% Total of 63,000 images

80 Million Images

Running on 80 million images PCA to 32 dims, k=48 eigenfunctions Precompute approximate eigenvectors (~20Gb) For each class, labels propagating through 80 million images

Summary Semi-supervised scheme that can scale to really large problems Rather than sub-sampling the data, we take the limit of infinite unlabeled data Assumes input data distribution is separable Can propagate labels in graph with 80 million nodes in fractions of second

Future Work Can potentially use 2D or 3D histograms instead of 1D –Requires more data Consider diagonal eigenfunctions Sharing of labels between classes

Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist PCA Joint histogram for pairs of dimensions after PCA MI is mutual information score. 0 = Independent

Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist ICA Joint histogram for pairs of dimensions after ICA MI is mutual information score. 0 = Independent

Overview of Our Approach Existing large-scale SSL methods try to reduce # points We consider what happens as n  ∞ Eigenvectors  Eigenfunctions Assume input distribution is separable Make crude numerical approx. to Eigenfunctions Interpolate data in these approximate eigenfunctions to give approx. eigenvalues

Eigenfunctions Eigenfunction are limit of Eigenvectors as n  ∞ Analytical forms of eigenfunctions exist only in a few cases: Uniform, Gaussian Instead, we calculate numerical approximation to eigenfunctions [Nadler et al. 06,Weiss et al. 08] [Coifman et al. 05, Nadler et al. 06, Belikin & Niyogi 07]

Complexity Comparison Nystrom Select m landmark points Get smallest k eigenvectors of m x m system Interpolate n points into k eigenvectors Solve k x k linear system Eigenfunction Rotate n points Form d 1-D histograms Solve d linear systems, each b x b k 1-D interpolations of n points Solve k x k linear system Key: n = # data points (big, >10 6 ) l = # labeled points (small, <100) m = # landmark points d = # input dims (~100) k = # eigenvectors (~100) b = # histogram bins (~50) Polynomial in # landmarksLinear in # data points

Can’t build accurate high dimensional histograms –Need too many points Currently just use 1-D histograms –2 or 3D ones possible with enough data This assumes distribution is separable –Assume p(x) = p(x 1 ) p(x 2 ) … p(x d ) For separable distributions, eigenfunctions are also separable Key Assumption: Separability of Input data [Nadler et al. 06,Weiss et al. 08]

Varying # Training Examples