Presentation on theme: "Taiji Suzuki Ryota Tomioka The University of Tokyo"— Presentation transcript:
1SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian Taiji SuzukiRyota TomiokaThe University of TokyoGraduate School of Information Science and TechnologyDepartment of Mathematical Informatics23th Oct.,
13Rank 1 decompositionIf g is monotonically increasing and f` is diff’ble at the optimum,the solution of MKL is rank 1:such thatconvex combination of kernelsProofDerivative w.r.t.
14Existing approach 1 Lanckriet et al. 2004: SDP Constraints based formulation:[Lanckriet et al. 2004, JMLR] [Bach, Lanckriet & Jordan 2004, ICML]primalDualLanckriet et al. 2004: SDPBach, Lanckriet & Jordan 2004: SOCP
15Existing approach 2 Sonnenburg et al. 2006: Cutting plane (SILP) Upper bound based formulation:[Sonnenburg et al. 2006, JMLR] [Rakotomamonjy et al. 2008, JMLR][Chapelle & Rakotomamonjy 2008, NIPS workshop]primal(Jensen’s inequality)Sonnenburg et al. 2006: Cutting plane (SILP)Rakotomamonjy et al. 2008: Gradient descent (SimpleMKL)Chapelle & Rakotomamonjy 2008: Newton method (HessianMKL)
16Problem of existing methods Do not make use of sparsityduring the optimization.→ Do not perform efficiently when# of kernels is large.
17Outline Introduction Details of Multiple Kernel Learning Our method Proximal minimizationSkipping inactive kernelsNumerical ExperimentsBench-mark datasetsSparse or dense
19DAL (Tomioka & Sugiyama 09) Dual Augmented LagrangianLasso, group Lasso, trace norm regularizationSpicyMKL = DAL + MKLKernelization of DAL
20Proximal Minimization Difficulty of MKL（Lasso）is non-diff’ble at 0.Relax byProximal Minimization(Rockafellar, 1976)
21Our algorithm super linearly ! SpicyMKL: Proximal Minimization (Rockafellar, 1976)1Set some initial solutionChoose increasing sequence2Iterate the following update rule until convergence:Regularization toward the last update.It can be shown thatsuper linearly !Taking its dual, the update step can be solved efficiently
28What we have observed Differentiable We arrived at …Dual of update rule in proximal minimizationDifferentiableOnly active kernels need to be calculated.(next slide)
29Skipping inactive kernels Soft thresholdHard thresholdIf (inactive), ,and its gradient also vanishes.
30Derivatives We can apply Newton method for the dual optimization. The objective function:Its derivatives:whereOnly active kernels contributes their computations.→ Efficient for large number of kernels.
31Convergence result For any proper convex functions and , Thm.For any proper convex functions and ,and any non-decreasing sequence ,Thm.If is strongly convex and is sparse reg.,for any increasing sequence ,the convergence rate is super linear.
32Non-differentiable loss If loss is non-differentiable, almost the same algorithm is available by utilizing Moreau envelope of .primaldualMoreau envelope of dual
33Dual of elastic net Naïve optimization in the dual is applicable. primaldualSmooth !Naïve optimization in the dual is applicable.But we applied prox. minimization methodfor small to avoid numerical instability.
34Why ‘Spicy’ ? SpicyMKL = DAL + MKL. DAL also means a major Indian cuisine. → hot, spicy → SpicyMKL“Dal”from Wikipedia
35Outline Introduction Details of Multiple Kernel Learning Our method Proximal minimizationSkipping inactive kernelsNumerical ExperimentsBench-mark datasetsSparse or dense
41Sparse or Dense ? Elastic Net Sparse Dense where1SparseDenseCompare performances of sparse and dense solutionsin a computer vision task.
42caltech101 dataset (Fei-Fei et al., 2004) object recognition taskanchor, ant, cannon, chair, cup10 classification problems (10=5×(5-1)/2)# of kernels: 1760 = 4 (features)×22 (regions) × あああああ2 (kernel functions) × 10 (kernel parameters)sift features [Lowe99]: colorDescriptor [van de Sande et al.10] hsvsift, sift (scale = 4px, 8px, automatic scale)regions: whole region, 4 division, 16 division, spatial pyramid. (computed histogram of features on each region.)kernel functions: Gaussian kernels, kernels with10 parameters.
43Performances are averaged over all classification problems. BestPerformance of sparse solution iscomparable with dense solution.denseMedium density
44Conclusion and Future works まとめと今後の課題ConclusionWe proposed a new MKL algorithm that is efficient when # of kernels is large.proximal minimizationneglect ‘in-active’ kernelsMedium density showed the best performance, but sparse solution also works well.Future workSecond order update
45T. Suzuki & R. Tomioka: SpicyMKL. Technical reportT. Suzuki & R. Tomioka: SpicyMKL.arXiv:DALR. Tomioka & M. Sugiyama: Dual Augmented Lagrangian Method for Efficient Sparse Reconstruction. IEEE Signal Proccesing Letters, 16 (12) pp , 2009.