Transfer Learning for Image Classification. Transfer Learning Approaches Leverage data from related tasks to improve performance:  Improve generalization.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Unsupervised Learning

Pattern Recognition and Machine Learning

SVM—Support Vector Machines

Machine learning continued Image source:

Computer vision: models, learning and inference Chapter 8 Regression.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CMPUT 466/551 Principal Source: CMU

Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Discriminative and generative methods for bags of features

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Principal Component Analysis

1 Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Radial Basis Function Networks

Online Learning Algorithms

An Introduction to Support Vector Machines Martin Law.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University

Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Cs: compressed sensing

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

An Introduction to Support Vector Machines (M. Law)

Transfer Learning for Image Classification Group No.: 15 Group member : Feng Cai Sauptik Dhar Sauptik.

Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Biointelligence Laboratory, Seoul National University

Linear Models for Classification

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Ariadna Quattoni Xavier Carreras An Efficient Projection for l 1,∞ Regularization Michael Collins Trevor Darrell MIT CSAIL.

Contextual models for object detection using boosted random fields by Antonio Torralba, Kevin P. Murphy and William T. Freeman.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

Ultra-high dimensional feature selection Yun Li

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

Jianchao Yang, John Wright, Thomas Huang, Yi Ma CVPR 2008 Image Super-Resolution as Sparse Representation of Raw Image Patches.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Semi-Supervised Clustering

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Boosting and Additive Trees (2)

Presented by: Chang Jia As for: Pattern Recognition

Feature space tansformation methods

An Efficient Projection for L1-∞ Regularization

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Transfer Learning for Image Classification

Transfer Learning Approaches Leverage data from related tasks to improve performance:  Improve generalization.  Reduce the run-time of evaluating a set of classifiers. Two Main Approaches:  Learning Shared Hidden Representations.  Sharing Features.

Sharing Features: efficient boosting procedures for multiclass object detection Antonio Torralba Kevin Murphy William Freeman

Snapshot of the idea Goal:  Reduce the computational cost of multiclass object recognition  Improve generalization performance Approach:  Make boosted classifiers share weak learners  Some Notation:

Training a single boosted classifier  Consider training a single boosted classifier: Candidate weak learners Weighted stumps Fit an additive model

Training a single boosted classifier Greedy Approach Minimize the exponential loss Gentle Boosting

Standard Multiclass Case: No Sharing Additive model for each class Minimize the sum of exponential losses Each class has its own set of weak learners:

Multiclass Case: Sharing Features subset of classes corresponding additive classifier classifier for the k class At iteration t add one weak learner to one of the additive models: Minimize the sum of exponential losses Naive approach : Greedy Heuristic:

Sharing features for multiclass object detection Torralba, Murphy, Freeman. CVPR 2004

Learning efficiency

Sharing features shows sub-linear scaling of features with objects (for area under ROC = 0.9). Red: shared features Blue: independent features

How the features are shared across objects Basic features: Filter responses computed at different locations of the image

Uncovering Shared Structures in Multiclass Classification Yonatan Amit Michael Fink Nathan Srebro Shimon Ullman

Structure Learning Framework Class Parameters Structural Parameters  Find optimal parameters: Linear Classifiers Linear Transformations

Multiclass Loss Function  Maximal Hinge Loss:  Hinge Loss:

Snapshot of the idea  Main Idea: Enforce Sharing by finding low rank parameter matrix W Transformation on x Transformation on w  Consider the m by d parameter matrix:  Can be factored:  Rows of theta form a basis

Low Rank Regularization Penalty  Rank of a d by m matrixis the smallest z, such that:  A regularization penalty designed to minimize the rank of W’ would tend to produce solutions where a few basis are shared by all classes.  Minimizing the rank would lead to a hard combinatorial problem  Instead use a trace norm penalty: eigen value of W’

Putting it all together No longer in the objective  For optimization they use a gradient based method that minimizes a smooth approximation of the objective

Mammals Dataset

Results

Transfer Learning for Image Classification via Sparse Joint Regularization Ariadna Quattoni Michael Collins Trevor Darrell

Training visual classifiers when a few examples are available  Problem:  Image classification from a few examples can be hard.  A good representation of images is crucial.  Solution:  We learn a good image representation using: unlabeled data + labeled data from related problems

Snapshot of the idea:  Use unlabeled dataset + kernel function to compute a new representation:  Complex features, high dimensional space  Some of them will be very discriminative (hopefully)  Most will be irrelevant  If we knew the relevant features we could learn from fewer examples.  Related problems may share relevant features.  We can use data from related problems to discover them !!

Semi-supervised learning Large dataset of unlabeled data Small training set of labeled images h-dimensional training set Visual Representation Unsupervised learning ComputeTrain Classifier Step 1: Learn representation Step 2: Train Classifier

Semi-supervised learning:  Raina et al. [ICML 2007] proposed an approach that learns a sparse set of high level features (i.e. linear combinations of the original features) from unlabeled data using a sparse coding technique.  Balcan et al. [ML 2004] proposed a representation based on computing kernel distances to unlabeled data points.

Learning visual representations using unlabeled data only  Unsupervised learning in data space  Good thing:  Lower dimensional representation preserves relevant statistics of the data sample.  Bad things:  The representation might still contain irrelevant features, i.e. features that are useless for classification.

Learning visual representations from unlabeled data + labeled data from related categories Step 1: Learn a representation Large Dataset of unlabeled images Kernel Function Create New Representation Select Discriminative Features of the New Representation Discriminative representation Labeled images from related categories

Our contribution  Main differences with previous approaches:  Our choice of joint regularization norm allows us to express the joint loss minimization as a linear program (i.e. no need for greedy approximations.)  While previous approaches build joint sparse classifiers on the feature space our method discovers discriminative features in a space derived using the unlabeled data and uses these discriminative features to solve future problems.

Overview of the method  Step I: Use the unlabeled data to compute a new representation space [Kernel SVD].  Step II: Use the labeled data from related problems to discover discriminative features in the new space [Joint Sparse Regularization].  Step III: Compute the new discriminative representation for the samples of the target problem.  Step IV: Train the target classifier using the representation of step III.

Step I: Compute a representation using the unlabeled data  A) Compute kernel matrix of unlabeled images:  B) Compute a projection matrix A by taking all the eigen vectors of K. K U  Perform Kernel SVD on the Unlabeled Data.

Step I: Compute a representation using the unlabeled data  C) Project labeled data from related problems to the new space: D  Notational shorthand:

Sidetrack  Another possible method for learning a representation from the unlabeled data would be to create a projection matrix Q by taking the h eigen vectors of A corresponding to the h largest eigenvalues.  We will call this approach the Low Rank Baseline  Our method differs significantly from the low rank approach in that we use training data from related problems to select discriminative features in the new space.

Step II: Discover relevant features by joint sparse approximation  A classifier is a function: where:is the input space, i.e. the representation learnt from unlabeled data. is a binary label, in our application is going to be 1 if an imageand  A loss function: belongs to a particular topic and -1 otherwise.

Step II: Discover relevant features by joint sparse approximation  Consider learning a single sparse linear classifier (on the space learnt from the unlabeled data) of the form:  A sparse model will have only a few features with non-zero coefficients.  A natural choice for parameter estimation would be: Classification error L1 penalizes non-sparse solutions  Donoho [2004] has proven (in a regression setting) that the solution with smallest L1 norm is also the sparsest solution.

Step II: Discover relevant features by joint sparse approximation  Goal: Find a subset of features R such that each problem in:  Solution : Regularized Joint loss minimization: Classification error on training set k penalizes solutions that utilize too many features can be well approximated by a sparse classifier whose non-zero coefficients correspond to features in R.

Step II: Discover relevant features by joint sparse approximation  How do we penalize solutions that use too many features? Coefficients for for feature 2 Coefficients for classifier 2  Problem : not a proper norm, would lead to a hard combinatorial problem.

Step II: Discover relevant features by joint sparse approximation  Instead of using the #non-zero-rows pseudo-norm we will use a convex relaxation [Tropp 2006]  The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.  This norm combines: An L1 norm on the maximum absolute values of the coefficients promotes sparsity on max values. Use few features The L∞ norm on each row promotes non-sparsity on the rows Share features

Step II: Discover relevant features by joint sparse approximation  Using the L1- L∞ norm we can rewrite our objective function as:  For any convex loss this is a convex function, in particular when considering the hinge loss: the optimization problem can be expressed as a linear program.

Step II: Discover relevant features by joint sparse approximation  Linear program formulation ( hinge loss):  Max value constraints: and  Slack variables constraints: and

Step III: Compute the discriminative features representation  Define the set of relevant features to be:  Create a new representation by taking all the features in x’ corresponding to the indexes in R

Experiments: Dataset images, 108 topics. Predict 10 most frequent topics; (binary prediction) Reuters Dataset Data Partitions 3000 unlabeled images. labeled training sets of sizes: 1, 5, 10,15… images as testing data images as source of supervised training data Training set with n positive examples and 2*n

Dataset SuperBowl [341] Danish Cartoons [178 ] Sharon [321] Australian open [209] Trapped coal miners [196] Golden globes [167] Grammys [170] Figure skating [146] Academy Awards [135] Iraq [ 125]

Baseline Representation  Sampling:  ‘Bag of words’ representation that combines: color, texture and raw local image information  Sample image patches on a fixed grid  For each image patch compute: Color features based on HSV color histograms Texture features based on mean responses of Gabor filters at different scales and orientations Raw features, normalized pixel values  Create visual dictionary: for each feature type we do vector quantization and create a dictionary V of 2000 visual words.

Baseline representation  For every feature type map each patch to its closest visual word in the corresponding dictionary.  Compute baseline representation:  Sample image patches over a fix grid  The final representation is given by: whereis the number of times that an image patch was mapped to the i-th color word

Setting  Step 1: Learn a representation using the unlabeled datataset and labeled datatasets from 9 topics.  Step 2: Train a classifier for the 10 th held out topic using the learnt representation.  As evaluation we use the equal error rate averaged over the 10 topics.

 Uses as a representation: where Q is consists of the h highest eigenvectors of the matrix A computed in the first step of the algorithm Experiments  Baseline model (RFB):  Uses raw representation.  For both LRB and SPT we used and RBF kernel when computing the Representation from unlabeled data.  Low Rank baseline (LRB):  Sparse Transfer Model (SPT)  Three models, all linear SVMs  Uses Representation computed by our algorithm

Results:

Mean Equal Error rate per topic for classifiers trained with five positive examples; for the RFB model and the SPT model. SuperBowl; GG: Golden Globes; DC: Danish Cartoons; Gr: Grammys; AO: Australian Open; Sh:Sharon; FS: Figure Skating; AA: Academy Awards; Ir: Iraq.

Results

Conclusion  Summary:  We described a method for learning discriminative sparse image representations from unlabeled images + images from related tasks.  The method is based on learning a representation from the unlabeled data and performing joint sparse approximation on the data from related tasks to find a subset of discriminative features.  The induced representation improves performance when learning with very few examples.

Future work  Develop an efficient algorithm for solving the joint optimization that will scale to very large datasets.  Combine different representations

Joint Sparse Approximation  Discovers image representations which improve learning with few examples  The LP formulation is feasible for small problems but becomes intractable for larger data-sets.

Outline A Joint Sparse Approximation Model for Multi-task Learning An Efficient Algorithm Experiments

Joint Sparse Approximation as a Constrained Convex Optimization Problem  We will use a Projected SubGradient method. Main advantages: simple, scalable, guaranteed convergence rates. A convex function Convex constraints  Projected SubGradient methods have been recently proposed:  L 2 regularization, i.e. SVM [Shalev-Shwartz et al. 2007]  L 1 regularization [Duchi et al. 2008]

Euclidean Projection into the L 1-∞ ball  Projection: Inspecting the Lagrangian shows that it can be reduced to:  We reduced the projection to finding new maximums for each feature across tasks and using them to truncate A.  The total mass removed from a feature across tasks should be the same for all features whose coefficients don’t vanish. 12

t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 Feature IFeature IIFeature III Euclidean Projection into the L 1-∞ ball 3 Features, 6 problems, Q=14 Regret

Euclidean Projection into the L 1-∞ ball

An Efficient Algorithm For Finding μ  Recall that we need to find a vector of maximums µ that sums to Q and a constant θ such that the mass loss for every feature is the same.  Consider the mass loss function and its inverse:  We can construct this function and pick the θ for which This is a piece wise linear function So is its inverse  Consider a function that takes a mass loss and computes the corresponding sum of µ ( the norm of the new matrix) This is also piece wise linear

Complexity  The total cost of the algorithm is dominated by the initial sort  The total cost is in the order of:  The merge of these rows to build the piece wise linear function: of each row of A:  Notice that we only need to consider non-zero rows of A, so d is really the number of non-zero rows rather than the actual number of features

Outline A Joint Sparse Approximation Model for Multi-task Learning An Efficient Algorithm Experiments

Synthetic Experiments  We use the same projected subgradient method, and compared three different types of projection steps:  L 1−∞ projection  L 2 projection  L 1 projection  All tasks consisted of predicting functions of the form:  To generate jointly sparse vectors we randomly selected 10% of the features to be the relevant feature set V. Then for each task we randomly selected a subset v and zeroed all parameters outside v.

Synthetic Experiments Test Error Performance on predicting relevant features

Dataset: News story prediction SuperBowl Danish Cartoons Sharon Australian open Trapped coal miners Golden globes Grammys Figure skating Academy Awards Iraq  40 tasks  Raw image representation: Bag of visual words  Linear Kernel  3000 dimensions

Image Classification Experiments

Conclusion and Future Work  Presented a simple and effective algorithm for training joint models with L 1−∞ constraints.  The algorithm scales linearly with the number of examples and O(dm log(dm) ) with the number of problems and dimensions.  The experiments on a real image dataset show that this algorithm can find solutions that are jointly sparse, resulting in lower test error.  We believe our approach can be extended to work on an online multi-task learning setting.

Future Work: Online Multitask Classification [Cavallanti et al. 2008]  There are m binary classification tasks indexed by 1,…., M  At each time step t=1,2,…,T the learner receives a task index k and the corresponding instance vector.  Based on this information the learner outputs a binary prediction an then observes the correct label y k  We are interested in comparing the learner’s mistake count to that of the optimal predictors:

Thanks!

Lagrangian