CS639: Data Management for Data Science

Slides:

Advertisements

Similar presentations

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Advertisements

Neural networks Introduction Fitting neural networks

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Linear Regression.

Regularization David Kauchak CS 451 – Fall 2013.

Engineering Optimization

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Optimization Tutorial

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Separating Hyperplanes

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University

1cs542g-term Notes  Extra class this Friday 1-2pm  If you want to receive s about the course (and are auditing) send me .

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Linear Discriminant Functions Chapter 5 (Duda et al.)

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.

Collaborative Filtering Matrix Factorization Approach

Learning with large datasets Machine Learning Large scale machine learning.

1 Hybrid methods for solving large-scale parameter estimation problems Carlos A. Quintero 1 Miguel Argáez 1 Hector Klie 2 Leticia Velázquez 1 Mary Wheeler.

Biointelligence Laboratory, Seoul National University

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

Mathematical formulation XIAO LIYING. Mathematical formulation.

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Machine learning optimization Usman Roshan. Machine learning Two components: – Modeling – Optimization Modeling – Generative: we assume a probabilistic.

Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Fall 2004 Backpropagation CS478 - Machine Learning.

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Deep Feedforward Networks

Large Margin classifiers

Large-scale Machine Learning

Multiplicative updates for L1-regularized regression

Dan Roth Department of Computer and Information Science

Lecture 07: Soft-margin SVM

Computational Optimization

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Probabilistic Models for Linear Regression

Statistical Learning Dong Liu Dept. EEIS, USTC.

Collaborative Filtering Matrix Factorization Approach

Lecture 07: Soft-margin SVM

CSCI B609: “Foundations of Data Science”

Logistic Regression & Parallel SGD

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Large Scale Support Vector Machines

Lecture 08: Soft-margin SVM

Biointelligence Laboratory, Seoul National University

Deep Learning for Non-Linear Control

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Lecture 2 CMS 165 Optimization.

Recap: Naïve Bayes classifier

Multiple features Linear Regression with multiple variables

Multiple features Linear Regression with multiple variables

The Updated experiment based on LSTM

Linear regression with one variable

CS249: Neural Language Model

Patterson: Chap 1 A Review of Machine Learning

Presentation transcript:

CS639: Data Management for Data Science Lecture 20: Optimization/Gradient Descent Theodoros Rekatsinas

Today Optimization Gradient Descent

What is Optimization? Find the minimum or maximum of an objective function given a set of constraints:

Why Do We Care? Linear Classification Maximum Likelihood K-Means

Prefer Convex Problems Local (non global) minima and maxima:

Convex Functions and Sets

Important Convex Functions

Convex Optimization Problem

Lagrangian Dual

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Gradient Descent

Single Step Illustration

Full Gradient Descent Illustration Elipses are level curves (function is quadratic). If we start on ---- line –gradient goes towards minimum (middle), if not, bounce back and forth

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Newton’s Method Inverse Hessian Gradient Need Hessian to be invertible --- positive definite Inverse Hessian Gradient

Newton’s Method Picture

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Subgradient Descent Motivation

Subgradient Descent – Algorithm

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012) A brief introduction to Stochastic gradient descent

Online learning and optimization Goal of machine learning : Minimize expected loss given samples This is Stochastic Optimization Assume loss function is convex

Batch (sub)gradient descent for ML Process all examples together in each step Entire training set examined at each step Very slow when n is very large

Stochastic (sub)gradient descent “Optimize” one example at a time Choose examples randomly (or reorder and choose in order) Learning representative of example distribution

Stochastic (sub)gradient descent Equivalent to online learning (the weight vector w changes with every example) Convergence guaranteed for convex functions (to local minimum)

Hybrid! Stochastic – 1 example per iteration Batch – All the examples! Sample Average Approximation (SAA): Sample m examples at each step and perform SGD on them Allows for parallelization, but choice of m based on heuristics

SGD - Issues Convergence very sensitive to learning rate ( ) (oscillations near solution due to probabilistic nature of sampling) Might need to decrease with time to ensure the algorithm converges eventually Basically – SGD good for machine learning with large data sets!

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Statistical Learning Crash Course Staggering amount of machine learning/stats can be written as: N (number of yi’s, data) typically in the billions Ex: Classification, Recommendation, Deep Learning. De facto iteration to solve large-scale problems: SGD. Billions of tiny iterations Select one term, j, and estimate gradient.

Parallel SGD (Centralized) Centralized parameter updates guarantee convergence Parameter Server is a bottleneck Parameter Server: w = w - αΔw Δw w Model Workers Data Shards

Parallel SGD (HogWild! - asynchronous) Data Systems Perspective of SGD Insane conflicts: Billions of tiny jobs (~100 instructions), RW conflicts on x Multiple workers need to communicate! HogWild!: For sparse convex models (e.g., logistic regression) run without locks; SGD still converges (answer is statistically correct)