CS639: Data Management for Data Science

Slides:



Advertisements
Similar presentations
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Advertisements

Neural networks Introduction Fitting neural networks
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Linear Regression.
Regularization David Kauchak CS 451 – Fall 2013.
Engineering Optimization
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.
Optimization Tutorial
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Separating Hyperplanes
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
1cs542g-term Notes  Extra class this Friday 1-2pm  If you want to receive s about the course (and are auditing) send me .
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Linear Discriminant Functions Chapter 5 (Duda et al.)
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.
Collaborative Filtering Matrix Factorization Approach
Learning with large datasets Machine Learning Large scale machine learning.
1 Hybrid methods for solving large-scale parameter estimation problems Carlos A. Quintero 1 Miguel Argáez 1 Hector Klie 2 Leticia Velázquez 1 Mary Wheeler.
Biointelligence Laboratory, Seoul National University
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Machine learning optimization Usman Roshan. Machine learning Two components: – Modeling – Optimization Modeling – Generative: we assume a probabilistic.
Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Fall 2004 Backpropagation CS478 - Machine Learning.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Large Margin classifiers
Large-scale Machine Learning
Multiplicative updates for L1-regularized regression
Dan Roth Department of Computer and Information Science
Lecture 07: Soft-margin SVM
Computational Optimization
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Probabilistic Models for Linear Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
Logistic Regression & Parallel SGD
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Large Scale Support Vector Machines
Lecture 08: Soft-margin SVM
Biointelligence Laboratory, Seoul National University
Deep Learning for Non-Linear Control
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Lecture 2 CMS 165 Optimization.
Recap: Naïve Bayes classifier
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
The Updated experiment based on LSTM
ADMM and DSO.
Linear regression with one variable
CS249: Neural Language Model
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

CS639: Data Management for Data Science Lecture 20: Optimization/Gradient Descent Theodoros Rekatsinas

Today Optimization Gradient Descent

What is Optimization? Find the minimum or maximum of an objective function given a set of constraints:

Why Do We Care? Linear Classification Maximum Likelihood K-Means

Prefer Convex Problems Local (non global) minima and maxima:

Convex Functions and Sets

Important Convex Functions

Convex Optimization Problem

Lagrangian Dual

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Gradient Descent

Single Step Illustration

Full Gradient Descent Illustration Elipses are level curves (function is quadratic). If we start on ---- line –gradient goes towards minimum (middle), if not, bounce back and forth

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Newton’s Method Inverse Hessian Gradient Need Hessian to be invertible --- positive definite Inverse Hessian Gradient

Newton’s Method Picture

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Subgradient Descent Motivation

Subgradient Descent – Algorithm

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012) A brief introduction to Stochastic gradient descent

Online learning and optimization Goal of machine learning : Minimize expected loss given samples This is Stochastic Optimization Assume loss function is convex

Batch (sub)gradient descent for ML Process all examples together in each step Entire training set examined at each step Very slow when n is very large

Stochastic (sub)gradient descent “Optimize” one example at a time Choose examples randomly (or reorder and choose in order) Learning representative of example distribution

Stochastic (sub)gradient descent Equivalent to online learning (the weight vector w changes with every example) Convergence guaranteed for convex functions (to local minimum)

Hybrid! Stochastic – 1 example per iteration Batch – All the examples! Sample Average Approximation (SAA): Sample m examples at each step and perform SGD on them Allows for parallelization, but choice of m based on heuristics

SGD - Issues Convergence very sensitive to learning rate ( ) (oscillations near solution due to probabilistic nature of sampling) Might need to decrease with time to ensure the algorithm converges eventually Basically – SGD good for machine learning with large data sets!

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Statistical Learning Crash Course Staggering amount of machine learning/stats can be written as: N (number of yi’s, data) typically in the billions Ex: Classification, Recommendation, Deep Learning. De facto iteration to solve large-scale problems: SGD. Billions of tiny iterations Select one term, j, and estimate gradient.

Parallel SGD (Centralized) Centralized parameter updates guarantee convergence Parameter Server is a bottleneck Parameter Server: w = w - αΔw Δw w Model Workers Data Shards

Parallel SGD (HogWild! - asynchronous) Data Systems Perspective of SGD Insane conflicts: Billions of tiny jobs (~100 instructions), RW conflicts on x Multiple workers need to communicate! HogWild!: For sparse convex models (e.g., logistic regression) run without locks; SGD still converges (answer is statistically correct)