Download presentation

1
**Efficient Large-Scale Structured Learning**

Steve Branson Oscar Beijbom Serge Belongie Caltech UC San Diego UC San Diego CVPR 2013, Portland, Oregon

2
**Overview Structured prediction Learning from larger datasets**

TINY IMAGES Large Datasets Mammal Primate Hoofed Mammal Odd-toed Gorilla Deformable part models Object detection Orangutan Even-toed Cost sensitive Learning

3
Overview Available tools for structured learning not as refined as tools for binary classification 2 sources of speed improvement Faster stochastic dual optimization algorithms Application-specific importance sampling routine Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed

4
**Summary Usually, train time = 1-10 times test time**

Publicly available software package Fast algorithms for multiclass SVMs, DPMs API to adapt to new applications Support datasets too large to fit in memory Network interface for online & active learning Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed

5
**Summary Deformable part models 50-1000 faster than**

Mammal Primate Hoofed Mammal Odd-toed Gorilla Orangutan Even-toed Deformable part models faster than SVMstruct Mining hard negatives SGD-PEGASOS Cost-sensitive multiclass SVM 10-50 times faster than SVMstruct As fast as 1-vs-all binary SVM

6
**Binary vs. Structured Structured Dataset Binary Learner**

Structured Output BINARY DATASET BINARY OUTPUT SVM, Boosting, Logistic Regression, etc. Object Detection, Pose Registration, Attribute Prediction, etc. 𝑌=−1 𝑌=+1 𝑌=(𝑥,𝑦,𝑤,ℎ)

7
Binary vs. Structured Structured Dataset Binary Learner Structured Output BINARY DATASET BINARY OUTPUT SVM, Boosting, Logistic Regression, etc. Object Detection, Pose Registration, Attribute Prediction, etc. Pros: binary classifier is application independent Cons: what is lost in terms of: Accuracy at convergence? Computational efficiency?

8
**≈ ≈ Binary vs. Structured ∆01 Source of Computational Speed**

Structured Prediction Loss Binary Loss ∆01 𝑋 Convex Upper Bound 𝑒.𝑔. hinge, exponential loss ∆(𝑔(𝑋),𝑌𝑔𝑡) Source of Computational Speed

9
**≈ ≈ ≈ Binary vs. Structured ∆01 ℓ(𝑋;𝑤) ∆(𝑔(𝑋),𝑌)**

Structured Prediction Loss Binary Loss ∆01 𝑋 Convex Upper Bound 𝑒.𝑔. hinge, exponential loss ∆(𝑔(𝑋),𝑌𝑔𝑡) ≈ ℓ(𝑋;𝑤) ∆(𝑔(𝑋),𝑌) Convex Upper Bound on Structured Prediction Loss

10
Binary vs. Structured Application-specific optimization algorithms that: Converge to lower test error than binary solutions Lower test error for all amounts of train time

11
Binary vs. Structured Application-specific optimization algorithms that: Converge to lower test error than binary solutions Lower test error for all amounts of train time

12
**Structured SVM SVMs w/ structured output**

Max-margin MRF [Taskar et al. NIPS’03] [Tsochantaridis et al. ICML’04]

13
**Binary SVM Solvers ≫ > ≥ Faster Linear SVM Solvers Non−Linear**

Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size SVMstruct 𝑂 𝑇𝑛 𝜆𝜖

14
**Binary SVM Solvers ≫ > ≥ Faster Linear SVM Solvers Non−Linear**

Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size Linear to independent in trainset size SVMstruct 𝑂 𝑇𝑛 𝜆𝜖

15
**Binary SVM Solvers ≫ > ≥ Faster Linear SVM Solvers Non−Linear**

Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size Linear to independent in trainset size Faster on multiple passes Detect convergence Less sensitive to regularization/learning rate SVMstruct 𝑂 𝑇𝑛 𝜆𝜖

16
**Structured SVM Solvers**

Faster Linear SVM Solvers Non−Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Applied to SSVMs Cutting Plane > SGD ≥ Sequential or Stochastic Dual SVMstruct [Ratliff et al. AIStats’07] [Shalev-Shwartz et al. JMLR’13]

17
**Structured SVM Solvers**

Faster Linear SVM Solvers Non−Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Applied to SSVMS Regularization: λ Approx. factor: ϵ Trainset size: n Prediction time: T Cutting Plane > SGD ≥ Sequential or Stochastic Dual SVMstruct 𝑂 𝑇𝑛 𝜆𝜖 𝑂 𝑇 𝜆𝜖 𝑂 𝑇 𝜆𝜖 𝑂 𝑇 𝑛 log 1/𝜖 𝜆 [Ratliff et al. AIStats’07] [Shalev-Shwartz et al. JMLR’13]

18
**Our Approach Use faster stochastic dual algorithms**

Incorporate application-specific importance sampling routine Reduce train times when prediction time T is large Incorporate tricks people use for binary methods Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample

19
**Our Approach For t=1… do Choose random training example (Xi,Yi)**

𝑌 1 , 𝑌 2 ,…, 𝑌 𝐾 ←ImportanceSample( 𝑋 𝑖 , 𝑌 𝑖 ; 𝑤 𝑡−1 ) Approx. maximize Dual SSVM objective w.r.t. i end (Provably fast convergence for simple approx. solver) evaluating 1 dot product per sample 𝑌 𝑘 Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample

20
**Recent Papers w/ Similar Ideas**

Augmenting cutting plane SSVM w/ m-best solutions Applying stochastic dual methods to SSVMs A. Guzman-Rivera, P. Kohli, D. Batra. “DivMCuts…” AISTATS’13. S. Lacoste-Julien, et al. “Block-Coordinate Frank-Wolfe…” JMLR’13 .

21
**Applying to New Problems**

Define loss function Δ 𝑌, 𝑌 𝑖 Implement feature extraction routine 𝜓 𝑋,𝑌 Implement importance sampling routine 3. Importance sampling routine 2. Features 1. Loss function

22
**Applying to New Problems**

3. Implement importance sampling routine Is fast Favor samples w/ High loss 𝑤 𝑇 𝜓 𝑋 𝑖 , 𝑌 𝑘 +Δ 𝑌 𝑘 , 𝑌 𝑖 Uncorrelated features: small 𝜓 𝑋 𝑖 , 𝑌 𝑗 ∙ 𝜓 𝑋 𝑖 , 𝑌 𝑘

23
**Example: Object Detection**

2. Features 𝜓 𝑋,𝑌 3. Importance sampling routine Add sliding window & loss into dense score map Greedy NMS 1. Loss function Δ 𝑌, 𝑌 𝑖 = 𝑎𝑟𝑒𝑎(𝑌∩ 𝑌 𝑔𝑡 ) 𝑎𝑟𝑒𝑎(𝑌∪ 𝑌 𝑔𝑡 )

24
**Example: Deformable Part Models**

2. Features 𝜓 𝑋,𝑌 3. Importance sampling routine Dynamic programming Modified NMS to return diverse set of poses 1. Loss function Δ 𝑌, 𝑌 𝑖 = sum of part losses

25
**Cost-Sensitive Multiclass SVM**

cat dog ant fly car bus cat dog ant fly car bus 2. Features e.g., bag-of-words 3. Importance sampling routine Return all classes Exact solution using 1 dot product per class 1. Loss function Class confusion cost Δ 𝑐𝑎𝑡,𝑎𝑛𝑡 = 4

26
**Results: CUB-200-2011 Pose mixture model, 312 part/pose detectors**

Occlusion/visibility model Tree-structured DPM w/ exact inference

27
**Results: CUB-200-2011 5794 training examples 400 training examples**

~100X faster than mining hard negatives and SVMstruct 10-50X faster than stochastic sub-gradient methods Close to convergence at 1 pass through training set

28
**Results: ImageNet Comparison to other fast linear SVM solvers**

Comparison to other methods for cost-sensitive SVMs Faster than LIBLINEAR, PEGASOS 50X faster than SVMstruct

29
**Conclusion Orders of magnitude faster than SVMstruct**

Publicly available software package Fast algorithms for multiclass SVMs, DPMs API to adapt to new applications Support datasets too large to fit in memory Network interface for online & active learning Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed

30
Thanks!

31
**Weaknesses Less easily parallelizable than methods based on 1-vs-all**

Although we do offer multithreaded version Focused on SVM-based learning algorithms

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google