1 Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.
Regularized risk minimization
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
A CTION R ECOGNITION FROM V IDEO U SING F EATURE C OVARIANCE M ATRICES Kai Guo, Prakash Ishwar, Senior Member, IEEE, and Janusz Konrad, Fellow, IEEE.

Generalizing Backpropagation to Include Sparse Coding David M. Bradley and Drew Bagnell Robotics Institute Carnegie.
Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.
CMPUT 466/551 Principal Source: CMU
Nonlinear Unsupervised Feature Learning How Local Similarities Lead to Global Coding Amirreza Shaban.
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Learning of Pseudo-Metrics. Slide 1 Online and Batch Learning of Pseudo-Metrics Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
Learning Data Representations with “Partial Supervision” Ariadna Quattoni.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Group Norm for Learning Latent Structural SVMs Overview Daozheng Chen (UMD, College Park), Dhruv Batra (TTI Chicago), Bill Freeman (MIT), Micah K. Johnson.
An Introduction to Support Vector Machines Martin Law.
Learning to Learn By Exploiting Prior Knowledge
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
1 Blockwise Coordinate Descent Procedures for the Multi-task Lasso with Applications to Neural Semantic Basis Discovery ICML 2009 Han Liu, Mark Palatucci,
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
Cs: compressed sensing
BrainStorming 樊艳波 Outline Several papers on icml15 & cvpr15 PALM Information Theory Learning.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Supervised Learning of Edges and Object Boundaries Piotr Dollár Zhuowen Tu Serge Belongie.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Transfer Learning for Image Classification Group No.: 15 Group member : Feng Cai Sauptik Dhar Sauptik.
Transductive Regression Piloted by Inter-Manifold Relations.
Geodesic Flow Kernel for Unsupervised Domain Adaptation Boqing Gong University of Southern California Joint work with Yuan Shi, Fei Sha, and Kristen Grauman.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Image Classification for Automatic Annotation
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Ariadna Quattoni Xavier Carreras An Efficient Projection for l 1,∞ Regularization Michael Collins Trevor Darrell MIT CSAIL.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
Transfer Learning for Image Classification. Transfer Learning Approaches Leverage data from related tasks to improve performance:  Improve generalization.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Max-Margin Training of Upstream Scene Understanding Models Jun Zhu Carnegie Mellon University Joint work with Li-Jia Li *, Li Fei-Fei *, and Eric P. Xing.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Strong Supervision from Weak Annotation: Interactive Training of Deformable Part Models S. Branson, P. Perona, S. Belongie.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Multiplicative updates for L1-regularized regression
Boosting and Additive Trees (2)
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Jan Rupnik Jozef Stefan Institute
Group Norm for Learning Latent Structural SVMs
Bilinear Classifiers for Visual Recognition
CS 2750: Machine Learning Support Vector Machines
CSCI B609: “Foundations of Data Science”
A Motivating Application: Sensor Array Signal Processing
Optimal sparse representations in general overcomplete bases
Sparse Learning Based on L2,1-norm
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks
Primal Sparse Max-Margin Markov Networks
An Efficient Projection for L1-∞ Regularization
Presentation transcript:

1 Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell

/ 46 2 Motivation  We want to be able to build classifiers for thousands of visual categories.  We want to exploit rich and complex feature representations. Problem : Goal:  We might only have a few labeled samples per category. Solution :  Transfer Learning, leverage labeled data from multiple related tasks.

/ 46 3 Thesis Contributions  Learn an image representation using supervised data from auxiliary tasks automatically derived from unlabeled images + meta-data.  A transfer learning model based on joint regularization and an efficient optimization algorithm for training jointly sparse classifiers in high dimensional feature spaces. We study efficient transfer algorithms for image classification which can exploit supervised training data from a set of related tasks:

/ 46 4 A method for learning image representations from unlabeled images + meta-data Large dataset of unlabeled images + meta-data Create auxiliary problems Structure Learning [ Ando & Zhang, JMLR 2005] Visual Representation  Structure Learning: Task specific parametersShared Parameters

/ 46 5 A method for learning image representations from unlabeled images + meta-data

/ 46 6 Outline  An overview of transfer learning methods.  A joint sparse approximation model for transfer learning.  Asymmetric transfer experiments.  An efficient training algorithm.  Symmetric transfer image annotation experiments.

/ 46 7 Transfer Learning: A brief overview  The goal of transfer learning is to use labeled data from related tasks to make learning easier. Two settings:  Asymmetric transfer: Resource: Large amounts of supervised data for a set of related tasks. Goal: Improve performance on a target task for which training data is scarce.  Symmetric transfer: Resource: Small amount of training data for a large number of related tasks. Goal: Improve average performance over all classifiers.

/ 46 8 Transfer Learning: A brief overview  Three main approaches:  Learning intermediate latent representations: [Thrun 1996, Baxter 1997, Caruana 1997, Argyriou 2006, Amit 2007]  Learning priors over parameters: [Raina 2006, Lawrence et al ]  Learning relevant shared features [Torralba 2004, Obozinsky 2006]

/ 46 9 Feature Sharing Framework:  Work with a rich representation:  Complex features, high dimensional space  Some of them will be very discriminative (hopefully)  Most will be irrelevant  Related problems may share relevant features.  If we knew the relevant features we could:  Learn from fewer examples  Build more efficient classifiers  We can train classifiers from related problems together using a regularization penalty designed to promote joint sparsity.

/ ChurchAirport Grocery StoreFlower-Shop

/ ChurchAirport Grocery StoreFlower-Shop

/ ChurchAirport Grocery StoreFlower-Shop

/ Related Formulations of Joint Sparse Approximation  Obozinski et al. [2006] proposed L 1-2 joint penalty and developed a blockwise boosting scheme based on Boosted-Lasso.  Torralba et al. [2004] developed a joint boosting algorithm based on the idea of learning additive models for each class that share weak learners.

/ Our Contribution  Previous approaches to joint sparse approximation have relied on greedy coordinate descent methods.  We propose a simple an efficient global optimization algorithm with guaranteed convergence rates. A new model and optimization algorithm for training jointly sparse classifiers:  We test our model on real image classification tasks where we observe improvements in both asymmetric and symmetric transfer settings.  Our algorithm can scale to large problems involving hundreds of problems and thousands of examples and features.

/ Outline  An overview of transfer learning methods.  A joint sparse approximation model for transfer learning.  Asymmetric transfer experiments.  An efficient training algorithm.  Symmetric transfer image annotation experiments.

/ Notation Collection of Tasks Joint Sparse Approximation

/ Single Task Sparse Approximation  Consider learning a single sparse linear classifier of the form:  We want a few features with non-zero coefficients  Recent work suggests to use L 1 regularization: Classification error L 1 penalizes non-sparse solutions  Donoho [2004] proved (in a regression setting) that the solution with smallest L 1 norm is also the sparsest solution.

/ Joint Sparse Approximation  Setting : Joint Sparse Approximation Average Loss on Collection D Penalizes solutions that utilize too many features

/ Joint Regularization Penalty  How do we penalize solutions that use too many features? Coefficients for for feature 2 Coefficients for classifier 2  Would lead to a hard combinatorial problem.

/ Joint Regularization Penalty  We will use a L 1-∞ norm [Tropp 2006]  The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.  This norm combines: An L 1 norm on the maximum absolute values of the coefficients across tasks promotes sparsity. Use few features The L ∞ norm on each row promotes non- sparsity on each row. Share features

/ Joint Sparse Approximation  Using the L 1-∞ norm we can rewrite our objective function as:  For the hinge loss: the optimization problem can be expressed as a linear program.  For any convex loss this is a convex objective.

/ Joint Sparse Approximation  Linear program formulation (hinge loss):  Max value constraints: and  Slack variables constraints: and

/ Outline  An overview of transfer learning methods.  A joint sparse approximation model for transfer learning.  Asymmetric transfer experiments.  An efficient training algorithm.  Symmetric transfer image annotation experiments.

/ Setting: Asymmetric Transfer SuperBowlDanish CartoonsSharon Australian Open Trapped MinersGolden globes Grammys Figure Skating Academy Awards Iraq  Learn a representation using labeled data from 9 topics.  Train a classifier for the 10 th held out topic using the relevant features R only.  Define the set of relevant features to be:  Learn the matrix W using our transfer algorithm.

/ Results

/ Outline  An overview of transfer learning methods.  A joint sparse approximation model for transfer learning.  Asymmetric transfer experiments.  An efficient training algorithm.  Symmetric transfer image annotation experiments.

/  The LP formulation is feasible for small problems but becomes intractable for larger data-sets with thousands of examples and dimensions.  We might want a more general optimization algorithm that can handle arbitrary convex losses. Limitations of the LP formulation  The LP formulation can be optimized using standard LP solvers.

/ L 1-∞ Regularization: Constrained Convex Optimization Formulation  We will use a Projected SubGradient method. Main advantages: simple, scalable, guaranteed convergence rates. A convex function Convex constraints  Projected SubGradient methods have been recently proposed:  L 2 regularization, i.e. SVM [Shalev-Shwartz et al. 2007]  L 1 regularization [Duchi et al. 2008]

/ Euclidean Projection into the L 1-∞ ball

/ Characterization of the solution 1

/ Characterization of the solution Feature IFeature IIIFeature II Feature VI

/ Mapping to a simpler problem  We can map the projection problem to the following problem which finds the optimal maximums μ:

/ Efficient Algorithm for:, in pictures 4 Features, 6 problems, C=14

/ Complexity  The total cost of the algorithm is dominated by a sort of  The total cost is in the order of: the entries of A.

/ Outline  An overview of transfer learning methods.  A joint sparse approximation model for transfer learning.  Asymmetric transfer experiments.  An efficient training algorithm.  Symmetric transfer image annotation experiments.

/ Synthetic Experiments  Generate a jointly sparse parameter matrix W:  For every task we generate pairs: where  We compared three different types of regularization (i.e. projections):  L 1−∞ projection  L2 projection  L1 projection

/ Synthetic Experiments Test Error Performance on predicting relevant features

/ Dataset: Image Annotation  40 top content words  Raw image representation: Vocabulary Tree (Grauman and Darrell 2005, Nister and Stewenius 2006)  dimensions president actressteam

/ Results The differences are statistically significant

/ Results

/ Dataset: Indoor Scene Recognition  67 indoor scenes.  Raw image representation: similarities to a set of unlabeled images.  2000 dimensions. bakerybar Train station

/ Results:

43

/ Summary of Thesis Contributions  A method that learns image representations using unlabeled images + meta-data.  A transfer model based on performing a joint loss minimization over the training sets of related tasks with a shared regularization.  Previous approaches used greedy coordinate descent methods. We propose an efficient global optimization algorithm for learning jointly sparse models.  We presented experiments on real image classification tasks for both an asymmetric and symmetric transfer setting.  A tool that makes implementing an L 1−∞ penalty as easy and almost as efficient as implementing the standard L1 and L2 penalties.

/ Future Work  Task Clustering.  Online Optimization.  Generalization properties of L 1−∞ regularized models.  Combining feature representations.

/ Thanks!