Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Slides:



Advertisements
Similar presentations
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Advertisements

Lecture 9 Support Vector Machines
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Support Vector Machines
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Universal Learning over Related Distributions and Adaptive Graph Transduction Erheng Zhong †, Wei Fan ‡, Jing Peng*, Olivier Verscheure ‡, and Jiangtao.
Optimization Tutorial
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Distributed Optimization with Arbitrary Local Solvers
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Cross Domain Distribution Adaptation via Kernel Mapping Erheng Zhong † Wei Fan ‡ Jing Peng* Kun Zhang # Jiangtao Ren † Deepak Turaga ‡ Olivier Verscheure.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Graph-based Iterative Hybrid Feature Selection Erheng Zhong † Sihong Xie † Wei Fan ‡ Jiangtao Ren † Jing Peng # Kun Zhang $ † Sun Yat-sen University ‡
Support Vector Machines
Cross Validation Framework to Choose Amongst Models and Datasets for Transfer Learning Erheng Zhong ¶, Wei Fan ‡, Qiang Yang ¶, Olivier Verscheure ‡, Jiangtao.
Lecture 10: Support Vector Machines
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.
Trading Convexity for Scalability Marco A. Alvarez CS7680 Department of Computer Science Utah State University.
Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.
Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2.
Efficient Model Selection for Support Vector Machines
Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Online Learning for Collaborative Filtering
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
An Introduction to Support Vector Machines (M. Law)
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Linear Discrimination Reading: Chapter 2 of textbook.
Linear Classification with Perceptrons
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Presented by: Mingkui Tan, Li Wang, Ivor W. Tsang School of Computer Engineering June 21-24, ICML2010 Haifa, Israel Learning Sparse SVM.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Ultra-high dimensional feature selection Yun Li
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Quadratic Perceptron Learning with Applications
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Support vector machines
Large-scale Machine Learning
Multiplicative updates for L1-regularized regression
Lecture 07: Soft-margin SVM
Cross Domain Distribution Adaptation via Kernel Mapping
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Classification with Perceptrons Reading:
Asymmetric Correlation Regularized Matrix Factorization for Web Service Recommendation Qi Xie1, Shenglin Zhao2, Zibin Zheng3, Jieming Zhu2 and Michael.
Jan Rupnik Jozef Stefan Institute
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Lecture 07: Soft-margin SVM
Logistic Regression & Parallel SGD
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Support vector machines
CS639: Data Management for Data Science
Logistic Regression Geoff Hulten.
Presentation transcript:

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China

Sparse Linear Model Input: Output: sparse linear model Learning formulation Sparse regularization Large Scale Contest

Objectives Sparsity Accuracy Numerical Stability limited precision friendly Scalability Large scale training data (rows and columns)

Outline “Numerical un-stability” of two popular approaches Propose sparse linear model online numerically stable parallelizable good sparcity – don’t take features unless necesseary Experiments results

Stability in Sparse learning Numerical Problems of Direct Iterative Methods Numerical Problems of Mirror Descent

Stability in Sparse learning Iterative Hard Thresholding (IHT) Solve the following optimization problem Linear model Label vector Data Matrix Sparse Degree The error to minimize L-0 regularization

Stability in Sparse learning Iterative Hard Thresholding (IHT) Incorporating gradient descent with hard thresholding At each iteration: Hard Thresholding: Keep s top significant elements Negative of Gradient 1 2

Iterative Hard Thresholding (IHT) Advantages: Simple and scalable Stability in Sparse learning Convergence of IHT

Stability in Sparse learning For IHT algorithm to converge, the iteration matrix should have its spectral radius less than 1 Spectral radius Iterative Hard Thresholding (IHT)

Experiments Divergence of IHT

Stability in Sparse learning Example of error growth of IHT

Stability in Sparse learning Numerical Problems of Mirror Descent Numerical Problems of Direct Iterative Methods Numerical Problems of Mirror Descent

Stability in Sparse learning Mirror Descent Algorithm (MDA) Solve the L-1 regularized formulation Maintain two vectors: primal and dual

Stability in Sparse learning Dual space Primal space Illustration adopted from Peter Bartlett’s lecture slide link function soft-thresholding Sparse Dual vector p is a parameter for MDA

Stability in Sparse learning MDA link function Floating number system significant digits base exponent × Example: A computer with only 4 significant digits = 0.1

Stability in Sparse learning The difference between elements is amplified via the link function, when comparing elements in dual and primal vectors, respectively

Experiments Numerical problem of MDA Experimental settings Train models with 40% density. Parameter p is set to 2ln(d) (p=33) and 0.5 ln(d) respectively [ ST2009] [ST2009] Shai S. Shwartz and Ambuj Tewari. Stochastic methods for ℓ1 regularized loss minimization. In ICML, pages 929–936. ACM, 2009.

Performance Criteria Percentage of elements that are truncated during prediction Dynamical range Experiments Numerical problem of MDA

Objectives of a Simple Approach Numerically Stable Computationally efficient Online, parallelizable Accurate models with higher sparsity Costly to obtain too many features (e.g. medical diagnostics) For an excellent theoretical treatment of trading off between accuracy and sparsity see S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTIC, May 2009.

The proposed method algorithm SVM like margin

Numerical Stability and Scalability Considerations numerical stability Less conditions on data matrix such as spectral radius and no change of scales Less precision demanding (works under limited precision, theorem 1) Under mild conditions, the proposed method converges even for a large number of iterations proportional to Machine precision

Numerical Stability and Scalability Considerations Online fashion: one example at a time Parallelization for intensive data access Data can be distributed to computers, where parts of the inner product can be obtained. Small network communication (only parts of inner product and signals to update model)

The proposed method properties Soft-thresholding L1-regularization for sparse model Perceptron: avoids updates when the current features are able to predict well – sparcity Convergence under soft-thresholding and limited precision (Lemma 2 and Theorem 1) – numerical stability Generalization error bound (Theorem 3) Don’t complicate the model when unnecessary

The proposed method A toy example f1f1 f2f2 f3f3 label A B C w1w1 w2w2 w3w w1w1 w2w2 w3w w1w1 w2w2 w3w Relatively dense model The proposed method: a sparse model is enough to predict well (margin indicates good-enough model, so enough features) TG: truncated descent 1 st update 2 nd update 3 rd update

Experiments Overall comparison The proposed algorithm + 3 baseline sparse learning algorithms ( all with logistic loss function ) SMIDAS (MDA based [ST2009]): p = 0.5log(d) (cannot run with bigger p due to numerical problem) TG (Truncated Gradient [LLZ2009]) SCD (Stochastic Coordinate Descent [ST2009]) [ST2009] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l1 regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning, pages , [LLZ2009] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009.

Experiments Overall comparison Accuracy under the same model density First 7 datasets: maximum 40% of features Webspam: select maximum 0.1% of features Stop running the program when maximum percentage of features are selected

Experiments Overall comparison Accuracy vs. sparsity The proposed algorithm works consistently better than other baselines. On 5 out of 8 tasks, stopped updating model before reaching the maximum density (40% of features) On task 1, outperforms others with 10% features On task 3, ties with the best baseline using 20% features Sparse Convergence

Conclusion Numerical Stability of Sparse Learning Gradient Descent using matrix iteration may diverge without the spectral radius assumption. When dimensionality is high, MDA produces many infinitesimal elements. Trading off Sparsity and Accuracy Other methods (TG, SCD) are unable to train accurate models with high sparsity. Proposed approach is numerically stable, online parallelizable and converges. Controlled by margin L-1 regularization and soft threshold Experimental codes are available