Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Self-Paced Learning for Semantic Segmentation
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Learning Specific-Class Segmentation from Diverse Data M. Pawan Kumar, Haitherm Turki, Dan Preston and Daphne Koller at ICCV 2011 VGG reading group, 29.
Learning Shared Body Plans Ian Endres University of Illinois work with Derek Hoiem, Vivek Srikumar and Ming-Wei Chang.
Regularized risk minimization
Curriculum Learning for Latent Structural SVM
Linear Classifiers (perceptrons)
Human Action Recognition by Learning Bases of Action Attributes and Parts Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and.
Loss-based Visual Learning with Weak Supervision M. Pawan Kumar Joint work with Pierre-Yves Baudin, Danny Goodman, Puneet Kumar, Nikos Paragios, Noura.
Max-Margin Latent Variable Models M. Pawan Kumar.
Indian Statistical Institute Kolkata
CMPUT 466/551 Principal Source: CMU
Learning Structural SVMs with Latent Variables Xionghao Liu.
Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS
Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.
Large-Scale Object Recognition with Weak Supervision
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Restrict learning to a model-dependent “easy” set of samples General form of objective: Introduce indicator of “easiness” v i : K determines threshold.
Learning to Segment with Diverse Data M. Pawan Kumar Stanford University.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Sparse vs. Ensemble Approaches to Supervised Learning
Learning to Segment from Diverse Data M. Pawan Kumar Daphne KollerHaithem TurkiDan Preston.
Part I: Classification and Bayesian Learning
Group Norm for Learning Latent Structural SVMs Overview Daozheng Chen (UMD, College Park), Dhruv Batra (TTI Chicago), Bill Freeman (MIT), Micah K. Johnson.
Generic object detection with deformable part-based models
Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet DokaniaPritish MohapatraC. V. Jawahar.
Efficient Model Selection for Support Vector Machines
Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.
Loss-based Learning with Weak Supervision M. Pawan Kumar.
Self-paced Learning for Latent Variable Models
Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.
Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.
“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Ranking with High-Order and Missing Information M. Pawan Kumar Ecole Centrale Paris Aseem BehlPuneet KumarPritish MohapatraC. V. Jawahar.
1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.
Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.
Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.
Object Detection with Discriminatively Trained Part Based Models
Optimizing Average Precision using Weakly Supervised Data Aseem Behl IIIT Hyderabad Under supervision of: Dr. M. Pawan Kumar (INRIA Paris), Prof. C.V.
Lecture 31: Modern recognition CS4670 / 5670: Computer Vision Noah Snavely.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Considering Cost Asymmetry in Learning Classifiers Presented by Chunping Wang Machine Learning Group, Duke University May 21, 2007 by Bach, Heckerman and.
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Recognition Using Visual Phrases
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Learning from Big Data Lecture 5
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Logistic Regression William Cohen.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Optimizing Average Precision using Weakly Supervised Data Aseem Behl 1, C.V. Jawahar 1 and M. Pawan Kumar 2 1 IIIT Hyderabad, India, 2 Ecole Centrale Paris.
Loss-based Learning with Weak Supervision M. Pawan Kumar.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.
Strong Supervision from Weak Annotation: Interactive Training of Deformable Part Models S. Branson, P. Perona, S. Belongie.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Bangpeng Yao1, Xiaoye Jiang2, Aditya Khosla1,
Semi-Supervised Clustering
Evaluating Classifiers
Object detection with deformable part-based models
Boosting and Additive Trees (2)
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Generative Adversarial Networks
Group Norm for Learning Latent Structural SVMs
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Hyperparameters, bias-variance tradeoff, validation
Deep Learning for Non-Linear Control
Presentation transcript:

Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online

Segmentation Information Log (Size) ~ 2000 Computer Vision Data

Segmentation Log (Size) ~ 2000 Information Bounding Box ~ 1 M Computer Vision Data

Segmentation Log (Size) Bounding Box Image-Level ~ 2000 ~ 1 M > 14 M “Car” “Chair” Information Computer Vision Data

Segmentation Log (Size) Image-Level Noisy Label ~ 2000 > 14 M > 6 B Information Bounding Box ~ 1 M Computer Vision Data

Learn with missing information (latent variables) Detailed annotation is expensive Sometimes annotation is impossible Desired annotation keeps changing Computer Vision Data

Annotation Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification Mismatch between desired and available annotations Exact value of latent variable is not “important” Desired output during test time is y

Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification

Output Mismatch Input x Annotation y Latent h x y = “jumping” h Action Detection Mismatch between output and available annotations Exact value of latent variable is important Desired output during test time is (y,h)

Annotation Mismatch Input x Annotation y Latent h x y = “jumping” h Action Classification Output mismatch is out of scope We will focus on this case Desired output during test time is y

Latent SVM Optimization Practice Outline Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Weakly Supervised Data Input x Output y  {-1,+1} Hidden h x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,+1,h) Φ(x,h) 0 = x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,-1,h) 0 Φ(x,h) = x y = +1 h

Weakly Supervised Classification Feature Φ(x,h) Joint Feature Vector Ψ(x,y,h) Score f : Ψ(x,y,h)  (-∞, +∞) Optimize score over all possible y and h x y = +1 h

Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Latent SVM Parameters

Learning Latent SVM  (y i, y i (w)) ΣiΣi Empirical risk minimization min w No restriction on the loss function Annotation mismatch Training data {(x i,y i ), i = 1,2,…,n}

Learning Latent SVM  (y i, y i (w)) ΣiΣi Empirical risk minimization min w Non-convex Parameters cannot be regularized Find a regularization-sensitive upper bound

Learning Latent SVM - w T  (x i,y i (w),h i (w))  (y i, y i (w)) w T  (x i,y i (w),h i (w)) +

Learning Latent SVM  (y i, y i (w)) w T  (x i,y i (w),h i (w)) + - max h i w T  (x i,y i,h i ) y(w),h(w) = argmax y,h w T Ψ(x,y,h)

Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Parameters can be regularized Is this also convex?

Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Convex - Difference of convex (DC) program

min w ||w|| 2 + C Σ i ξ i w T Ψ(x i,y,h) + Δ(y i,y) - max h i w T Ψ(x i,y i,h i ) ≤ ξ i Scoring function w T Ψ(x,y,h) Prediction y(w),h(w) = argmax y,h w T Ψ(x,y,h) Learning Recap

Latent SVM Optimization Practice Outline

Learning Latent SVM  (y i, y) w T  (x i,y,h) + max y,h - max h i w T  (x i,y i,h i ) ≤ ξ i min w ||w|| 2 + C Σ i ξ i Difference of convex (DC) program

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Optimize the convex upper bound

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper-bound of concave part

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Until Convergence

Concave-Convex Procedure +  (y i, y) w T  (x i,y,h) + max y,h wT(xi,yi,hi)wT(xi,yi,hi) - max h i Linear upper bound?

Linear Upper Bound - max h i w T  (x i,y i,h i ) -w T  (x i,y i,h i *) h i * = argmax h i w t T  (x i,y i,h i ) Current estimate = w t ≥ - max h i w T  (x i,y i,h i )

CCCP for Latent SVM Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence

Latent SVM Optimization Practice Outline

Action Classification Input x Output y = “Using Computer” PASCAL VOC /20 Train/Test Split 5 Folds Jumping Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking Train Input x i Output y i

0-1 loss function Poselet-based feature vector 4 seeds for random initialization Code + Data Train/Test scripts with hyperparameter settings Setup

Objective

Train Error

Test Error

Time

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline

Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations

Repeat until convergence ε’ = ε/K and ε’ = ε Start with an initial estimate w 0 Update Update w t+1 as the ε’-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i )

Objective

Train Error

Test Error

Time

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline

Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Repeat until convergence Overfitting in initial iterations

Repeat until convergence C’ = C x K and C’ = C Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C’∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i )

Objective

Train Error

Test Error

Time

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline Kumar, Packer and Koller, NIPS 2010

1 + 1 = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Math is for losers !! FAILURE … BAD LOCAL MINIMUM CCCP for Human Learning

Euler was a Genius!! SUCCESS … GOOD LOCAL MINIMUM = 2 1/3 + 1/6 = 1/2 e iπ +1 = 0 Self-Paced Learning

Start with “easy” examples, then consider “hard” ones Easy vs. Hard Expensive Easy for human  Easy for machine Self-Paced Learning Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances

Start with an initial estimate w 0 Update Update w t+1 as the ε-optimal solution of min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) CCCP for Latent SVM

min ||w|| 2 + C∑ i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Self-Paced Learning

min ||w|| 2 + C∑ i v i  i w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i v i  {0,1} Trivial Solution Self-Paced Learning

v i  {0,1} Large KMedium KSmall K min ||w|| 2 + C∑ i v i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Self-Paced Learning

v i  [0,1] min ||w|| 2 + C∑ i v i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y, h) -  i Large KMedium KSmall K Biconvex Problem Alternating Convex Search Self-Paced Learning

Start with an initial estimate w 0 Update min ||w|| 2 + C∑ i  i - ∑ i v i /K w T  (x i,y i,h i *) - w T  (x i,y,h) ≥  (y i, y) -  i h i * = argmax h i  H w t T  (x i,y i,h i ) Decrease K  K/  SPL for Latent SVM Update w t+1 as the ε-optimal solution of

Objective

Train Error

Test Error

Time

Latent SVM Optimization Practice –Annealing the Tolerance –Annealing the Regularization –Self-Paced Learning –Choice of Loss Function Outline Behl, Mohapatra, Jawahar and Kumar, PAMI 2015

Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1

Ranking Rank 1Rank 2Rank 3 Rank 4Rank 5Rank 6 Average Precision = 1Accuracy = 1 Average Precision = 0.92Accuracy = 0.67Average Precision = 0.81

Ranking During testing, AP is frequently used During training, a surrogate loss is used Contradictory to loss-based learning Optimize AP directly

Results Statistically significant improvement

Questions?