Download presentation
1
Efficient Large-Scale Structured Learning
Steve Branson Oscar Beijbom Serge Belongie Caltech UC San Diego UC San Diego CVPR 2013, Portland, Oregon
2
Overview Structured prediction Learning from larger datasets
TINY IMAGES Large Datasets Mammal Primate Hoofed Mammal Odd-toed Gorilla Deformable part models Object detection Orangutan Even-toed Cost sensitive Learning
3
Overview Available tools for structured learning not as refined as tools for binary classification 2 sources of speed improvement Faster stochastic dual optimization algorithms Application-specific importance sampling routine Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed
4
Summary Usually, train time = 1-10 times test time
Publicly available software package Fast algorithms for multiclass SVMs, DPMs API to adapt to new applications Support datasets too large to fit in memory Network interface for online & active learning Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed
5
Summary Deformable part models 50-1000 faster than
Mammal Primate Hoofed Mammal Odd-toed Gorilla Orangutan Even-toed Deformable part models faster than SVMstruct Mining hard negatives SGD-PEGASOS Cost-sensitive multiclass SVM 10-50 times faster than SVMstruct As fast as 1-vs-all binary SVM
6
Binary vs. Structured Structured Dataset Binary Learner
Structured Output BINARY DATASET BINARY OUTPUT SVM, Boosting, Logistic Regression, etc. Object Detection, Pose Registration, Attribute Prediction, etc. 𝑌=−1 𝑌=+1 𝑌=(𝑥,𝑦,𝑤,ℎ)
7
Binary vs. Structured Structured Dataset Binary Learner Structured Output BINARY DATASET BINARY OUTPUT SVM, Boosting, Logistic Regression, etc. Object Detection, Pose Registration, Attribute Prediction, etc. Pros: binary classifier is application independent Cons: what is lost in terms of: Accuracy at convergence? Computational efficiency?
8
≈ ≈ Binary vs. Structured ∆01 Source of Computational Speed
Structured Prediction Loss Binary Loss ∆01 𝑋 Convex Upper Bound 𝑒.𝑔. hinge, exponential loss ∆(𝑔(𝑋),𝑌𝑔𝑡) Source of Computational Speed
9
≈ ≈ ≈ Binary vs. Structured ∆01 ℓ(𝑋;𝑤) ∆(𝑔(𝑋),𝑌)
Structured Prediction Loss Binary Loss ∆01 𝑋 Convex Upper Bound 𝑒.𝑔. hinge, exponential loss ∆(𝑔(𝑋),𝑌𝑔𝑡) ≈ ℓ(𝑋;𝑤) ∆(𝑔(𝑋),𝑌) Convex Upper Bound on Structured Prediction Loss
10
Binary vs. Structured Application-specific optimization algorithms that: Converge to lower test error than binary solutions Lower test error for all amounts of train time
11
Binary vs. Structured Application-specific optimization algorithms that: Converge to lower test error than binary solutions Lower test error for all amounts of train time
12
Structured SVM SVMs w/ structured output
Max-margin MRF [Taskar et al. NIPS’03] [Tsochantaridis et al. ICML’04]
13
Binary SVM Solvers ≫ > ≥ Faster Linear SVM Solvers Non−Linear
Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size SVMstruct 𝑂 𝑇𝑛 𝜆𝜖
14
Binary SVM Solvers ≫ > ≥ Faster Linear SVM Solvers Non−Linear
Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size Linear to independent in trainset size SVMstruct 𝑂 𝑇𝑛 𝜆𝜖
15
Binary SVM Solvers ≫ > ≥ Faster Linear SVM Solvers Non−Linear
Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Quadratic to linear in trainset size Linear to independent in trainset size Faster on multiple passes Detect convergence Less sensitive to regularization/learning rate SVMstruct 𝑂 𝑇𝑛 𝜆𝜖
16
Structured SVM Solvers
Faster Linear SVM Solvers Non−Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Applied to SSVMs Cutting Plane > SGD ≥ Sequential or Stochastic Dual SVMstruct [Ratliff et al. AIStats’07] [Shalev-Shwartz et al. JMLR’13]
17
Structured SVM Solvers
Faster Linear SVM Solvers Non−Linear Kernel Sequential or Stochastic Dual ≫ > Cutting Plane SGD ≥ LIBSVM, SVMlight SVMperf PEGASOS LIBLINEAR Applied to SSVMS Regularization: λ Approx. factor: ϵ Trainset size: n Prediction time: T Cutting Plane > SGD ≥ Sequential or Stochastic Dual SVMstruct 𝑂 𝑇𝑛 𝜆𝜖 𝑂 𝑇 𝜆𝜖 𝑂 𝑇 𝜆𝜖 𝑂 𝑇 𝑛 log 1/𝜖 𝜆 [Ratliff et al. AIStats’07] [Shalev-Shwartz et al. JMLR’13]
18
Our Approach Use faster stochastic dual algorithms
Incorporate application-specific importance sampling routine Reduce train times when prediction time T is large Incorporate tricks people use for binary methods Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample
19
Our Approach For t=1… do Choose random training example (Xi,Yi)
𝑌 1 , 𝑌 2 ,…, 𝑌 𝐾 ←ImportanceSample( 𝑋 𝑖 , 𝑌 𝑖 ; 𝑤 𝑡−1 ) Approx. maximize Dual SSVM objective w.r.t. i end (Provably fast convergence for simple approx. solver) evaluating 1 dot product per sample 𝑌 𝑘 Maximize Dual SSVM objective w.r.t. samples Random Example Importance Sample
20
Recent Papers w/ Similar Ideas
Augmenting cutting plane SSVM w/ m-best solutions Applying stochastic dual methods to SSVMs A. Guzman-Rivera, P. Kohli, D. Batra. “DivMCuts…” AISTATS’13. S. Lacoste-Julien, et al. “Block-Coordinate Frank-Wolfe…” JMLR’13 .
21
Applying to New Problems
Define loss function Δ 𝑌, 𝑌 𝑖 Implement feature extraction routine 𝜓 𝑋,𝑌 Implement importance sampling routine 3. Importance sampling routine 2. Features 1. Loss function
22
Applying to New Problems
3. Implement importance sampling routine Is fast Favor samples w/ High loss 𝑤 𝑇 𝜓 𝑋 𝑖 , 𝑌 𝑘 +Δ 𝑌 𝑘 , 𝑌 𝑖 Uncorrelated features: small 𝜓 𝑋 𝑖 , 𝑌 𝑗 ∙ 𝜓 𝑋 𝑖 , 𝑌 𝑘
23
Example: Object Detection
2. Features 𝜓 𝑋,𝑌 3. Importance sampling routine Add sliding window & loss into dense score map Greedy NMS 1. Loss function Δ 𝑌, 𝑌 𝑖 = 𝑎𝑟𝑒𝑎(𝑌∩ 𝑌 𝑔𝑡 ) 𝑎𝑟𝑒𝑎(𝑌∪ 𝑌 𝑔𝑡 )
24
Example: Deformable Part Models
2. Features 𝜓 𝑋,𝑌 3. Importance sampling routine Dynamic programming Modified NMS to return diverse set of poses 1. Loss function Δ 𝑌, 𝑌 𝑖 = sum of part losses
25
Cost-Sensitive Multiclass SVM
cat dog ant fly car bus cat dog ant fly car bus 2. Features e.g., bag-of-words 3. Importance sampling routine Return all classes Exact solution using 1 dot product per class 1. Loss function Class confusion cost Δ 𝑐𝑎𝑡,𝑎𝑛𝑡 = 4
26
Results: CUB-200-2011 Pose mixture model, 312 part/pose detectors
Occlusion/visibility model Tree-structured DPM w/ exact inference
27
Results: CUB-200-2011 5794 training examples 400 training examples
~100X faster than mining hard negatives and SVMstruct 10-50X faster than stochastic sub-gradient methods Close to convergence at 1 pass through training set
28
Results: ImageNet Comparison to other fast linear SVM solvers
Comparison to other methods for cost-sensitive SVMs Faster than LIBLINEAR, PEGASOS 50X faster than SVMstruct
29
Conclusion Orders of magnitude faster than SVMstruct
Publicly available software package Fast algorithms for multiclass SVMs, DPMs API to adapt to new applications Support datasets too large to fit in memory Network interface for online & active learning Mammal Primate Hoofed Mammal Orangutan Gorilla Odd-toed Even-toed
30
Thanks!
31
Weaknesses Less easily parallelizable than methods based on 1-vs-all
Although we do offer multithreaded version Focused on SVM-based learning algorithms
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.