# Curriculum Learning for Latent Structural SVM

## Presentation on theme: "Curriculum Learning for Latent Structural SVM"— Presentation transcript:

Curriculum Learning for Latent Structural SVM
(under submission) M. Pawan Kumar Benjamin Packer Daphne Koller

Aim Input x Output y  Y Hidden Variable h  H
To learn accurate parameters for latent structural SVM Input x Output y  Y Hidden Variable h  H “Deer” Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }

Aim (y*,h*) = maxyY,hH wT(x,y,h) Feature (x,y,h) (HOG, BoW)
To learn accurate parameters for latent structural SVM Feature (x,y,h) (HOG, BoW) Parameters w (y*,h*) = maxyY,hH wT(x,y,h)

Motivation FAILURE … BAD LOCAL MINIMUM Real Numbers Imaginary Numbers
Math is for losers !! Real Numbers Imaginary Numbers eiπ+1 = 0 FAILURE … BAD LOCAL MINIMUM

Motivation SUCCESS … GOOD LOCAL MINIMUM Real Numbers Imaginary Numbers
Euler was a Genius!! Real Numbers Imaginary Numbers eiπ+1 = 0 SUCCESS … GOOD LOCAL MINIMUM Curriculum Learning: Bengio et al, ICML 2009

Motivation Simultaneously estimate easiness and parameters
Start with “easy” examples, then consider “hard” ones Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances Easy vs. Hard Expensive Easy for human  Easy for machine

Outline Latent Structural SVM Concave-Convex Procedure
Curriculum Learning Experiments

Latent Structural SVM Training samples xi Ground-truth label yi
Felzenszwalb et al, 2008, Yu and Joachims, 2009 Training samples xi Ground-truth label yi Loss Function (yi, yi(w), hi(w))

(yi(w),hi(w)) = maxyY,hH wT(x,y,h)
Latent Structural SVM (yi(w),hi(w)) = maxyY,hH wT(x,y,h) min ||w||2 + C∑i(yi, yi(w), hi(w)) Non-convex Objective Minimize an upper bound

Latent Structural SVM (yi(w),hi(w)) = maxyY,hH wT(x,y,h)
min ||w||2 + C∑i i maxhiwT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Still non-convex Difference of convex CCCP Algorithm - converges to a local minimum

Outline Latent Structural SVM Concave-Convex Procedure
Curriculum Learning Experiments

Concave-Convex Procedure
Start with an initial estimate w0 Update hi = maxhH wtT(xi,yi,h) Update wt+1 by solving a convex problem min ||w||2 + C∑i i wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i 12

Concave-Convex Procedure
Looks at all samples simultaneously “Hard” samples will cause confusion Start with “easy” samples, then consider “hard” ones 13

Outline Latent Structural SVM Concave-Convex Procedure
Curriculum Learning Experiments

Curriculum Learning REMINDER
Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances 15

wT(xi,yi,hi) - wT(xi,y,h)
Curriculum Learning Start with an initial estimate w0 Update hi = maxhH wtT(xi,yi,h) Update wt+1 by solving a convex problem min ||w||2 + C∑i i wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i 16

wT(xi,yi,hi) - wT(xi,y,h)
Curriculum Learning min ||w||2 + C∑i i wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i 17

wT(xi,yi,hi) - wT(xi,y,h)
Curriculum Learning vi  {0,1} min ||w||2 + C∑i vii wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Trivial Solution 18

Curriculum Learning min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Large K Medium K Small K 19

Curriculum Learning min ||w||2 + C∑i vii - ∑ivi/K
Biconvex Problem vi  [0,1] min ||w||2 + C∑i vii - ∑ivi/K wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Large K Medium K Small K 20

Curriculum Learning hi = maxhH wtT(xi,yi,h)
Start with an initial estimate w0 hi = maxhH wtT(xi,yi,h) Update Update wt+1 by solving a convex problem min ||w||2 + C∑i vii - ∑i vi/K wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Decrease K  K/ 21

Outline Latent Structural SVM Concave-Convex Procedure
Curriculum Learning Experiments

Object Detection Input x - Image Output y  Y Latent h - Box
 - 0/1 Loss Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” } Feature (x,y,h) - HOG

Object Detection Mammals Dataset 271 images, 6 classes
90/10 train/test split 5 folds

Object Detection CCCP Curriculum

Object Detection CCCP Curriculum

Object Detection CCCP Curriculum

Object Detection CCCP Curriculum

Object Detection Objective value Test error

Handwritten Digit Recognition
Input x - Image Output y  Y Latent h - Rotation  - 0/1 Loss MNIST Dataset Y = {0, 1, … , 9} Feature (x,y,h) - PCA + Projection

Handwritten Digit Recognition
- Significant Difference

Handwritten Digit Recognition
- Significant Difference

Handwritten Digit Recognition
- Significant Difference

Handwritten Digit Recognition
- Significant Difference

Feature (x,y,h) - Ng and Cardie, ACL 2002
Motif Finding Input x - DNA Sequence Output y  Y Y = {0, 1} Latent h - Motif Location  - 0/1 Loss Feature (x,y,h) - Ng and Cardie, ACL 2002

Motif Finding UniProbe Dataset 40,000 sequences 50/50 train/test split
5 folds

Motif Finding Average Hamming Distance of Inferred Motifs

Motif Finding Objective Value

Motif Finding Test Error

Noun Phrase Coreference
Input x - Nouns Output y - Clustering Latent h - Spanning Forest over Nouns Feature (x,y,h) - Yu and Joachims, ICML 2009

Noun Phrase Coreference
MUC6 Dataset 60 documents 50/50 train/test split 1 predefined fold

Noun Phrase Coreference
MITRE Loss Pairwise Loss - Significant Improvement - Significant Decrement

Noun Phrase Coreference
MITRE Loss Pairwise Loss

Noun Phrase Coreference
MITRE Loss Pairwise Loss

Summary Automatic Curriculum Learning Concave-Biconvex Procedure
Generalization to other Latent models Expectation-Maximization E-step remains the same M-step includes indicator variables vi