Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Large-Scale Structured Learning

Similar presentations


Presentation on theme: "Efficient Large-Scale Structured Learning"β€” Presentation transcript:

1 Efficient Large-Scale Structured Learning
Steve Branson Oscar Beijbom Serge Belongie Caltech UC San Diego UC San Diego CVPR 2013, Portland, Oregon

2 Overview Structured prediction Learning from larger datasets
TINY IMAGES Large Datasets Mammal Primate Hoofed Mammal Odd-toed Gorilla Deformable part models Object detection Orangutan Even-toed Cost sensitive Learning

3 Overview Available tools for structured learning not as refined as tools for binary classification 2 sources of speed improvement Faster stochastic dual optimization algorithms Application-specific importance sampling routine Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed

4 Summary Usually, train time = 1-10 times test time
Publicly available software package Fast algorithms for multiclass SVMs, DPMs API to adapt to new applications Support datasets too large to fit in memory Network interface for online & active learning Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed

5 Summary Deformable part models 50-1000 faster than
Mammal Primate Hoofed Mammal Odd-toed Gorilla Orangutan Even-toed Deformable part models faster than SVMstruct Mining hard negatives SGD-PEGASOS Cost-sensitive multiclass SVM 10-50 times faster than SVMstruct As fast as 1-vs-all binary SVM

6 Binary vs. Structured Structured Dataset Binary Learner
Structured Output BINARY DATASET BINARY OUTPUT SVM, Boosting, Logistic Regression, etc. Object Detection, Pose Registration, Attribute Prediction, etc. π‘Œ=βˆ’1 π‘Œ=+1 π‘Œ=(π‘₯,𝑦,𝑀,β„Ž)

7 Binary vs. Structured Structured Dataset Binary Learner Structured Output BINARY DATASET BINARY OUTPUT SVM, Boosting, Logistic Regression, etc. Object Detection, Pose Registration, Attribute Prediction, etc. Pros: binary classifier is application independent Cons: what is lost in terms of: Accuracy at convergence? Computational efficiency?

8 β‰ˆ β‰ˆ Binary vs. Structured βˆ†01 Source of Computational Speed
Structured Prediction Loss Binary Loss βˆ†01 𝑋 Convex Upper Bound 𝑒.𝑔. hinge, exponential loss βˆ†(𝑔(𝑋),π‘Œπ‘”π‘‘) Source of Computational Speed

9 β‰ˆ β‰ˆ β‰ˆ Binary vs. Structured βˆ†01 β„“(𝑋;𝑀) βˆ†(𝑔(𝑋),π‘Œ)
Structured Prediction Loss Binary Loss βˆ†01 𝑋 Convex Upper Bound 𝑒.𝑔. hinge, exponential loss βˆ†(𝑔(𝑋),π‘Œπ‘”π‘‘) β‰ˆ β„“(𝑋;𝑀) βˆ†(𝑔(𝑋),π‘Œ) Convex Upper Bound on Structured Prediction Loss

10 Binary vs. Structured Application-specific optimization algorithms that: Converge to lower test error than binary solutions Lower test error for all amounts of train time

11 Binary vs. Structured Application-specific optimization algorithms that: Converge to lower test error than binary solutions Lower test error for all amounts of train time

12 Structured SVM SVMs w/ structured output
Max-margin MRF [Taskar et al. NIPS’03] [Tsochantaridis et al. ICML’04]

13 Binary SVM Solvers ≫ > β‰₯ Faster Linear SVM Solvers Nonβˆ’Linear
Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD β‰₯ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size SVMstruct 𝑂 𝑇𝑛 πœ†πœ–

14 Binary SVM Solvers ≫ > β‰₯ Faster Linear SVM Solvers Nonβˆ’Linear
Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD β‰₯ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size Linear to independent in trainset size SVMstruct 𝑂 𝑇𝑛 πœ†πœ–

15 Binary SVM Solvers ≫ > β‰₯ Faster Linear SVM Solvers Nonβˆ’Linear
Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD β‰₯ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size Linear to independent in trainset size Faster on multiple passes Detect convergence Less sensitive to regularization/learning rate SVMstruct 𝑂 𝑇𝑛 πœ†πœ–

16 Structured SVM Solvers
Faster Linear SVM Solvers Nonβˆ’Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD β‰₯ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Applied to SSVMs Cutting Plane > SGD β‰₯ Sequential or Stochastic Dual SVMstruct [Ratliff et al. AIStats’07] [Shalev-Shwartz et al. JMLR’13]

17 Structured SVM Solvers
Faster Linear SVM Solvers Nonβˆ’Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD β‰₯ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Applied to SSVMS Regularization: Ξ» Approx. factor: Ο΅ Trainset size: n Prediction time: T Cutting Plane > SGD β‰₯ Sequential or Stochastic Dual SVMstruct 𝑂 𝑇𝑛 πœ†πœ– 𝑂 𝑇 πœ†πœ– 𝑂 𝑇 πœ†πœ– 𝑂 𝑇 𝑛 log⁑ 1/πœ– πœ† [Ratliff et al. AIStats’07] [Shalev-Shwartz et al. JMLR’13]

18 Our Approach Use faster stochastic dual algorithms
Incorporate application-specific importance sampling routine Reduce train times when prediction time T is large Incorporate tricks people use for binary methods Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample

19 Our Approach For t=1… do Choose random training example (Xi,Yi)
π‘Œ 1 , π‘Œ 2 ,…, π‘Œ 𝐾 ←ImportanceSample( 𝑋 𝑖 , π‘Œ 𝑖 ; 𝑀 π‘‘βˆ’1 ) Approx. maximize Dual SSVM objective w.r.t. i end (Provably fast convergence for simple approx. solver) evaluating 1 dot product per sample π‘Œ π‘˜ Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample

20 Recent Papers w/ Similar Ideas
Augmenting cutting plane SSVM w/ m-best solutions Applying stochastic dual methods to SSVMs A. Guzman-Rivera, P. Kohli, D. Batra. β€œDivMCuts…” AISTATS’13. S. Lacoste-Julien, et al. β€œBlock-Coordinate Frank-Wolfe…” JMLR’13 .

21 Applying to New Problems
Define loss function Ξ” π‘Œ, π‘Œ 𝑖 Implement feature extraction routine πœ“ 𝑋,π‘Œ Implement importance sampling routine 3. Importance sampling routine 2. Features 1. Loss function

22 Applying to New Problems
3. Implement importance sampling routine Is fast Favor samples w/ High loss 𝑀 𝑇 πœ“ 𝑋 𝑖 , π‘Œ π‘˜ +Ξ” π‘Œ π‘˜ , π‘Œ 𝑖 Uncorrelated features: small πœ“ 𝑋 𝑖 , π‘Œ 𝑗 βˆ™ πœ“ 𝑋 𝑖 , π‘Œ π‘˜

23 Example: Object Detection
2. Features πœ“ 𝑋,π‘Œ 3. Importance sampling routine Add sliding window & loss into dense score map Greedy NMS 1. Loss function Ξ” π‘Œ, π‘Œ 𝑖 = π‘Žπ‘Ÿπ‘’π‘Ž(π‘Œβˆ© π‘Œ 𝑔𝑑 ) π‘Žπ‘Ÿπ‘’π‘Ž(π‘Œβˆͺ π‘Œ 𝑔𝑑 )

24 Example: Deformable Part Models
2. Features πœ“ 𝑋,π‘Œ 3. Importance sampling routine Dynamic programming Modified NMS to return diverse set of poses 1. Loss function Ξ” π‘Œ, π‘Œ 𝑖 = sum of part losses

25 Cost-Sensitive Multiclass SVM
cat dog ant fly car bus cat dog ant fly car bus 2. Features e.g., bag-of-words 3. Importance sampling routine Return all classes Exact solution using 1 dot product per class 1. Loss function Class confusion cost Ξ” π‘π‘Žπ‘‘,π‘Žπ‘›π‘‘ = 4

26 Results: CUB-200-2011 Pose mixture model, 312 part/pose detectors
Occlusion/visibility model Tree-structured DPM w/ exact inference

27 Results: CUB-200-2011 5794 training examples 400 training examples
~100X faster than mining hard negatives and SVMstruct 10-50X faster than stochastic sub-gradient methods Close to convergence at 1 pass through training set

28 Results: ImageNet Comparison to other fast linear SVM solvers
Comparison to other methods for cost-sensitive SVMs Faster than LIBLINEAR, PEGASOS 50X faster than SVMstruct

29 Conclusion Orders of magnitude faster than SVMstruct
Publicly available software package Fast algorithms for multiclass SVMs, DPMs API to adapt to new applications Support datasets too large to fit in memory Network interface for online & active learning Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed

30 Thanks!

31 Weaknesses Less easily parallelizable than methods based on 1-vs-all
Although we do offer multithreaded version Focused on SVM-based learning algorithms


Download ppt "Efficient Large-Scale Structured Learning"

Similar presentations


Ads by Google