Taiji Suzuki Ryota Tomioka The University of Tokyo

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

Chapter 4 Sampling Distributions and Data Descriptions.

3.6 Support Vector Machines

AP STUDY SESSION 2.

1 Geometric Methods for Learning and Memory A thesis presented by Dimitri Nowicki To Universite Paul Sabatier in partial fulfillment for the degree of.

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

Author: Julia Richards and R. Scott Hawley

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

Objectives: Generate and describe sequences. Vocabulary:

UNITED NATIONS Shipment Details Report – January 2006.

David Burdett May 11, 2004 Package Binding for WS CDL.

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×

Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.

Spectral Clustering Eyal David Image Processing seminar May 2008.

1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.

1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.

Solve Multi-step Equations

Break Time Remaining 10:00.

Turing Machines.

Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.

PP Test Review Sections 6-1 to 6-6

The Weighted Proportional Resource Allocation Milan Vojnović Microsoft Research Joint work with Thành Nguyen Microsoft Research Asia, Beijing, April, 2011.

Bright Futures Guidelines Priorities and Screening Tables

EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.

Bellwork Do the following problem on a ½ sheet of paper and turn in.

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

The challenge ahead: Ocean Predictions in the Arctic Region Lars Petter Røed * Presented at the OPNet Workshop May 2008, Geilo, Norway * Also affiliated.

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1 Section 5.5 Dividing Polynomials Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1.

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.

Adding Up In Chunks.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

Artificial Intelligence

A Novel Hemispherical Basis for Accurate and Efficient Rendering P. Gautron J. Křivánek S. Pattanaik K. Bouatouch Eurographics Symposium on Rendering 2004.

1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

Analyzing Genes and Genomes

1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Essential Cell Biology

Converting a Fraction to %

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Clock will move after 1 minute

PSSA Preparation.

Essential Cell Biology

Immunobiology: The Immune System in Health & Disease Sixth Edition

Physics for Scientists & Engineers, 3rd Edition

Energy Generation in Mitochondria and Chlorplasts

Select a time to count down from the clock above

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

Copyright Tim Morris/St Stephen's School

9. Two Functions of Two Random Variables

1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.

How to create Magic Squares

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.

Presentation transcript:

SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School of Information Science and Technology Department of Mathematical Informatics 23th Oct., 2009@ISM

Multiple Kernel Learning Fast Algorithm SpicyMKL

Kernel Learning Regression, Classification Kernel Function

Thousands of kernels Multiple Kernel Learning Gaussian, polynomial, chi-square, …. Parameters：Gaussian width, polynomial degree Features Computer Vision：color, gradient, sift (sift, hsvsift, huesift, scaling of sift), Geometric Blur, image regions, ．．． Thousands of kernels Multiple Kernel Learning (Lanckriet et al. 2004) Select important kernels and combine them

Multiple Kernel Learning (MKL) Single Kernel Multiple Kernel dm:convex combination, sparse (many 0 components)

→sparse Lasso Group Lasso Multiple Kernel Learning (MKL) Sparse learning grouping Lasso Group Lasso kernelize Multiple Kernel Learning (MKL) L1-regularization →sparse ：N samples ： M kernels ：RKHS associated with km ：Convex loss（hinge, square, logistic）

：Gram matrix of mth kernel Representer theorem Finite dimensional problem ：Gram matrix of mth kernel

SpicyMKL Scales well against # of kernels. Proposed: SpicyMKL Scales well against # of kernels. Formulated in a general framework.

Outline Introduction Details of Multiple Kernel Learning Our method Proximal minimization Skipping inactive kernels Numerical Experiments Bench-mark datasets Sparse or dense

Details of Multiple Kernel Learning(MKL)

Generalized Formulation MKL : sparse regularization

hinge loss elastic net logistic loss L1regularization hinge: elastic net: logistic: L1regularization:

Rank 1 decomposition If g is monotonically increasing and f` is diff’ble at the optimum, the solution of MKL is rank 1: such that convex combination of kernels Proof Derivative w.r.t.

Existing approach 1 Lanckriet et al. 2004: SDP Constraints based formulation: [Lanckriet et al. 2004, JMLR] [Bach, Lanckriet & Jordan 2004, ICML] primal Dual Lanckriet et al. 2004: SDP Bach, Lanckriet & Jordan 2004: SOCP

Existing approach 2 Sonnenburg et al. 2006: Cutting plane (SILP) Upper bound based formulation: [Sonnenburg et al. 2006, JMLR] [Rakotomamonjy et al. 2008, JMLR] [Chapelle & Rakotomamonjy 2008, NIPS workshop] primal (Jensen’s inequality) Sonnenburg et al. 2006: Cutting plane (SILP) Rakotomamonjy et al. 2008: Gradient descent (SimpleMKL) Chapelle & Rakotomamonjy 2008: Newton method (HessianMKL)

Problem of existing methods Do not make use of sparsity during the optimization. →　Do not perform efficiently when # of kernels is large.

Outline Introduction Details of Multiple Kernel Learning Our method Proximal minimization Skipping inactive kernels Numerical Experiments Bench-mark datasets Sparse or dense

Our method SpicyMKL

DAL (Tomioka & Sugiyama 09) Dual Augmented Lagrangian Lasso, group Lasso, trace norm regularization SpicyMKL = DAL + MKL Kernelization of DAL

Proximal Minimization Difficulty of MKL（Lasso） is non-diff’ble at 0. Relax by Proximal Minimization (Rockafellar, 1976)

Our algorithm super linearly ! SpicyMKL: Proximal Minimization (Rockafellar, 1976) 1 Set some initial solution . Choose increasing sequence . 2 Iterate the following update rule until convergence: Regularization toward the last update. It can be shown that super linearly ! Taking its dual, the update step can be solved efficiently

Fenchel’s duality theorem Split into two parts

Non-diff’ble dual dual Smooth ! Moreau envelope Non-diff’ble

Inner optimization Newton method is applicable. Twice Differentiable Twice Differentiable (a.e.) Even if is not differentiable, we can apply almost the same algorithm.

Rigorous Derivation of Convolution and duality

Moreau envelope Moreau envelope (Moreau 65) For a convex function , we define its Moreau envelope as

Moreau envelope L1-penalty dual Smooth !

What we have observed Differentiable We arrived at … Dual of update rule in proximal minimization Differentiable Only active kernels need to be calculated. (next slide)

Skipping inactive kernels Soft threshold Hard threshold If (inactive), , and its gradient also vanishes.

Derivatives We can apply Newton method for the dual optimization. The objective function: Its derivatives: where Only active kernels contributes their computations. → Efficient for large number of kernels.

Convergence result For any proper convex functions and , Thm. For any proper convex functions and , and any non-decreasing sequence , Thm. If is strongly convex and is sparse reg., for any increasing sequence , the convergence rate is super linear.

Non-differentiable loss If loss is non-differentiable, almost the same algorithm is available by utilizing Moreau envelope of . primal dual Moreau envelope of dual

Dual of elastic net Naïve optimization in the dual is applicable. primal dual Smooth ! Naïve optimization in the dual is applicable. But we applied prox. minimization method for small to avoid numerical instability.

Why ‘Spicy’ ? SpicyMKL = DAL + MKL. DAL also means a major Indian cuisine. 　→ hot, spicy 　→ SpicyMKL “Dal” from Wikipedia

Outline Introduction Details of Multiple Kernel Learning Our method Proximal minimization Skipping inactive kernels Numerical Experiments Bench-mark datasets Sparse or dense

Numerical Experiments

CPU time against # of kernels IDA data set, L1 regularization. CPU time against # of kernels SpicyMKL Splice Ringnorm Scales well against # of kernels.

CPU time against # of samples IDA data set, L1 regularization. CPU time against # of samples SpicyMKL Splice Ringnorm Scaling against # of samples is almost the same as the existing methods.

UCI dataset # of kernels 189 243 918 891 1647

Comparison in elastic net (Sparse v.s. Dense)

Sparse or Dense ? Elastic Net Sparse Dense where . 1 Sparse Dense Compare performances of sparse and dense solutions in a computer vision task.

caltech101 dataset (Fei-Fei et al., 2004) object recognition task anchor, ant, cannon, chair, cup 10 classification problems　 (10=5×(5-1)/2) # of kernels: 1760 = 4 (features)×22 (regions) × 　あああああ2 (kernel functions) × 10 (kernel parameters) sift features [Lowe99]: colorDescriptor [van de Sande et al.10] hsvsift, sift (scale = 4px, 8px, automatic scale) regions: whole region, 4 division, 16 division, spatial pyramid. (computed histogram of features on each region.) kernel functions: Gaussian kernels, kernels with 10 parameters.

Performances are averaged over all classification problems. Best Performance of sparse solution is comparable with dense solution. dense Medium density

Conclusion and Future works まとめと今後の課題 Conclusion We proposed a new MKL algorithm that is efficient when # of kernels is large. proximal minimization neglect ‘in-active’ kernels Medium density showed the best performance, but sparse solution also works well. Future work Second order update

T. Suzuki & R. Tomioka: SpicyMKL. Technical report T. Suzuki & R. Tomioka: SpicyMKL. arXiv: http://arxiv.org/abs/0909.5026 DAL R. Tomioka & M. Sugiyama: Dual Augmented Lagrangian Method for Efficient Sparse Reconstruction. IEEE Signal Proccesing Letters, 16 (12) pp. 1067--1070, 2009.

Thank you for your attention!

Some properties Moreau envelope proximal operator (Fenchel duality th.) (Fenchel duality th.) Differentiable !