Efficient Logistic Regression with Stochastic Gradient Descent

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Regularization David Kauchak CS 451 – Fall 2013.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CMPUT 466/551 Principal Source: CMU
Randomized Algorithms William Cohen. Outline SGD with the hash trick (review) Background on other randomized algorithms – Bloom filters – Locality sensitive.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Lecture 14 – Neural Networks
Maps, Dictionaries, Hashtables
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Distributed Representations of Sentences and Documents
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Machine learning Image source:
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial Basis Function Networks
Online Learning Algorithms
Machine learning Image source:
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Linear Discrimination Reading: Chapter 2 of textbook.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
The Summary of My Work In Graduate Grade One Reporter: Yuanshuai Sun
Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
Logistic Regression William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
Efficient Logistic Regression with Stochastic Gradient Descent: The Continuing Saga William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Matrix Factorization and Collaborative Filtering
Neural networks and support vector machines
Matt Gormley Lecture 4 September 12, 2016
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Announcements Next week:
A Fast Trust Region Newton Method for Logistic Regression
Efficient Logistic Regression with Stochastic Gradient Descent
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Machine Learning Basics
Logistic Regression & Parallel SGD
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Overfitting and Underfitting
CSE 491/891 Lecture 25 (Mahout).
Softmax Classifier.
Speech recognition, machine learning
The Updated experiment based on LSTM
Introduction to Neural Networks
Speech recognition, machine learning
Logistic Regression Geoff Hulten.
Presentation transcript:

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

SGD for Logistic Regression

SGD for Logistic regression Start with Rocchio-like linear classifier: Replace sign(...) with something differentiable: Also scale from 0-1 not -1 to +1 Decide to optimize: Differentiate….

Magically, when we differentiate, we end up with something very simple and elegant….. The update for gradient descent with rate λ is just:

An observation: sparsity! Key computational point: if xj=0 then the gradient of wj is zero so when processing an example you only need to update weights for the non-zero features of an example.

SGD for logistic regression The algorithm: -- do this in random order

Adding regularization Replace LCL with LCL + penalty for large weights, eg So: becomes:

Regularized logistic regression Replace LCL with LCL + penalty for large weights, eg So the update for wj becomes: Or

Naively this is not a sparse update Algorithm: Time goes from O(nT) to O(mVT) where n = number of non-zero entries, m = number of examples V = number of features T = number of passes over data

Sparse regularized logistic regression Final algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each non-zero feature W[j] W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k k = “clock” reading A[j] = clock reading last time feature j was “active” we implement the “weight decay” update using a “lazy” strategy: weights are decayed in one shot when a feature is “active”

Sparse regularized logistic regression (v2) Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) k++ For each non-zero feature W[j] W[j] *= (1 - λ2μ)k-A[j] pi = … W[j] = W[j] + λ(yi - pi)xj A[j] = k Finally: Before you write out each W[j] do ** k = “clock” reading A[j] = clock reading last time feature j was “active” we implement the “weight decay” update using a “lazy” strategy: weights are decayed in one shot when a feature is “active” **

Summary What’s happened here: Our update involves a sparse part and a dense part Sparse: empirical loss on this example Dense: regularization loss – not affected by the example We remove the dense part of the update Old example update: for each feature { do something example-independent} For each active feature { do something example-dependent} New example update: For each active feature : {simulate the prior example-independent updates} {do something example-dependent}

Summary We can separate the LCL update and weight decay updates but remember: we need to apply “weight decay” to just the active features in an example xi before we compute the prediction pi we need to apply “weight decay” to all features before we save the classifier my suggestion: an abstraction for a logistic regression classifier

A possible SGD implementation class SGDLogistic Regression { /** Predict using current weights **/ double predict(Map features); /** Apply weight decay to a single feature and record when in A[ ]**/ void regularize(string feature, int currentK); /** Regularize all features then save to disk **/ void save(string fileName,int currentK); /** Load a saved classifier **/ static SGDClassifier load(String fileName); /** Train on one example **/ void train1(Map features, double trueLabel, int k) { // regularize each feature // predict and apply update } // main ‘train’ program assumes a stream of randomly-ordered examples and outputs classifier to disk; main ‘test’ program prints predictions for each test case in input.

A possible SGD implementation class SGDLogistic Regression { … } // main ‘train’ program assumes a stream of randomly-ordered examples and outputs classifier to disk; main ‘test’ program prints predictions for each test case in input. <100 lines (in python) Other mains: A “shuffler:” stream thru a training file T times and output instances output is randomly ordered, as much as possible, given a buffer of size B Something to collect predictions + true labels and produce error rates, etc.

A possible SGD implementation Parameter settings: W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj I didn’t tune especially but used μ=0.1 λ=η* E-2 where E is “epoch”, η=½ epoch: number of times you’ve iterated over the dataset, starting at E=1

BOUNDED-MEMORY LOGISTIC REGRESSION

Outline Logistic regression and SGD Learning as optimization a linear classifier optimizing P(y|x) Stochastic gradient descent “streaming optimization” for ML problems Regularized logistic regression Sparse regularized logistic regression Memory-saving logistic regression

Question In text classification most words are rare not correlated with any class given low weights in the LR classifier unlikely to affect classification not very interesting

Question In text classification most bigrams are rare not correlated with any class given low weights in the LR classifier unlikely to affect classification not very interesting

Question Most of the weights in a classifier are important not important

How can we exploit this? One idea: combine uncommon words together randomly Examples: replace all occurrances of “humanitarianism” or “biopsy” with “humanitarianismOrBiopsy” replace all occurrances of “schizoid” or “duchy” with “schizoidOrDuchy” replace all occurrances of “gynecologist” or “constrictor” with “gynecologistOrConstrictor” … For Naïve Bayes this breaks independence assumptions it’s not obviously a problem for logistic regression, though I could combine two low-weight words (won’t matter much) a low-weight and a high-weight word (won’t matter much) two high-weight words (not very likely to happen) How much of this can I get away with? certainly a little is it enough to make a difference? how much memory does it save?

How can we exploit this? Another observation: the values in my hash table are weights the keys in my hash table are strings for the feature names We need them to avoid collisions But maybe we don’t care about collisions? Allowing “schizoid” & “duchy” to collide is equivalent to replacing all occurrences of “schizoid” or “duchy” with “schizoidOrDuchy”

Learning as optimization for regularized logistic regression Algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature j: xij>0: W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k

Learning as optimization for regularized logistic regression Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ For each hash value h: V[h]>0: W[h] *= (1 - λ2μ)k-A[j] W[h] = W[h] + λ(yi - pi)V[h] A[j] = k

Learning as optimization for regularized logistic regression Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ ???

SOME EXPERIMENTS

ICML 2009

An interesting example Spam filtering for Yahoo mail Lots of examples and lots of users Two options: one filter for everyone—but users disagree one filter for each user—but some users are lazy and don’t label anything Third option: classify (msg,user) pairs features of message i are words wi,1,…,wi,ki feature of user is his/her id u features of pair are: wi,1,…,wi,ki and uwi,1,…,uwi,ki based on an idea by Hal Daumé

An example E.g., this email to wcohen features: dear, madam, sir,…. investment, broker,…, wcohendear, wcohenmadam, wcohen,…, idea: the learner will figure out how to personalize my spam filter by using the wcohenX features

An example Compute personalized features and multiple hashes on-the-fly: a great opportunity to use several processors and speed up i/o

Experiments 3.2M emails 40M tokens 430k users 16T unique features – after personalization

An example 2^26 entries = 1 Gb @ 8bytes/weight 3,2M emails 400k users 40M tokens