Efficient Logistic Regression with Stochastic Gradient Descent

Slides:



Advertisements
Similar presentations
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Advertisements

Regularization David Kauchak CS 451 – Fall 2013.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Randomized Algorithms William Cohen. Outline SGD with the hash trick (review) Background on other randomized algorithms – Bloom filters – Locality sensitive.
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Support Vector Machines and The Kernel Trick William Cohen
Support Vector Machines
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Radial Basis Function Networks
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Midterm Exam Review Notes William Cohen 1. General hints in studying Understand what you’ve done and why – There will be questions that test your understanding.
Logistic Regression William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
KERNELS AND PERCEPTRONS. The perceptron A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Efficient Logistic Regression with Stochastic Gradient Descent
Efficient Logistic Regression with Stochastic Gradient Descent: The Continuing Saga William Cohen.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Matt Gormley Lecture 4 September 12, 2016
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
Dan Roth Department of Computer and Information Science
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Announcements Next week:
Lecture 07: Soft-margin SVM
Support Vector Machines
Support Vector Machines
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Basic machine learning background with Python scikit-learn

CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
Linear Discriminators
Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
Logistic Regression & Parallel SGD
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Large Scale Support Vector Machines
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Support Vector Machines
Recitation 6: Kernel SVM
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.
Support Vector Machines
Support Vector Machines and Kernels
CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)
Introduction to Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
CS249: Neural Language Model
Introduction to Machine Learning
Presentation transcript:

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Outline Logistic regression and SGD Learning as optimization a linear classifier optimizing P(y|x) Stochastic gradient descent “streaming optimization” for ML problems Regularized logistic regression Sparse regularized logistic regression Memory-saving logistic regression

Summary on perceptron We showed that If : exists a u with unit norm that has margin γ on examples in the seq (x1,y1),(x2,y2),…. Then : the perceptron algorithm makes < R2/ γ2 mistakes on the sequence (where R >= ||xi||) Independent of dimension of the data or classifier (!) This doesn’t follow from M(C)<=VCDim(C) Here: a small-norm classifier and a high margin  success Question: what if you explicitly look for a classifier with smallest norm and largest margin? maximize margin subject to constraint on norm minimize norm subject to constraint on margin Designing ML algorithms as optimization

Learning as optimization: general procedure Goal: Learn the parameter θ of … Dataset: D={x1,…,xn} or maybe D={(x1,y1)…,(xn , yn)} Write down Pr(D|θ) as a function of θ Maybe easier to work with log Pr(D|θ) Maximize by differentiating and setting to zero Or, use gradient descent: repeatedly take a small step in the direction of the gradient

Learning as optimization: general procedure for SGD Goal: Learn the parameter θ of … Dataset: D={x1,…,xn} or maybe D={(x1,y1)…,(xn , yn)} Write down Pr(D|θ) as a function of θ Big-data problem: we don’t want to load all the data D into memory, and the gradient depends on all the data Solution: pick a small subset of examples B<<D approximate the gradient using them “on average” this is the right direction take a step in that direction repeat…. B = one example is a very popular choice

Learning as optimization for logistic regression Goal: Learn the parameter w of the classifier Probability of a single example (y,x) would be Or with logs:

p

An observation Key computational point: if xj=0 then the gradient of wj is zero so when processing an example you only need to update weights for the non-zero features of an example.

Learning as optimization for logistic regression The algorithm: -- do this in random order

Another observation Consider averaging the gradient over all the examples D={(x1,y1)…,(xn , yn)}

Learning as optimization for logistic regression Goal: Learn the parameter θ of a classifier Which classifier? We’ve seen y = sign(x . w) but sign is not continuous… Convenient alternative: replace sign with the logistic function Practical problem: this overfits badly with sparse features e.g., if wj is only in positive examples, its gradient is always positive !

Outline Logistic regression and SGD Learning as optimization a linear classifier optimizing P(y|x) Stochastic gradient descent “streaming optimization” for ML problems Regularized logistic regression Sparse regularized logistic regression Memory-saving logistic regression

Regularized logistic regression Replace LCL with LCL + penalty for large weights, eg So: becomes:

Regularized logistic regression Replace LCL with LCL + penalty for large weights, eg So the update for wj becomes: Or

Learning as optimization for logistic regression Algorithm: -- do this in random order

Learning as optimization for regularized logistic regression Algorithm: Time goes from O(nT) to O(nmT) where n = number of non-zero entries, m = number of features, T = number of passes over data

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtable W For each iteration t=1,…T For each example (xi,yi) pi = … For each feature W[j] W[j] = W[j] - λ2μW[j] If xij>0 then W[j] = W[j] + λ(yi - pi)xj

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtable W For each iteration t=1,…T For each example (xi,yi) pi = … For each feature W[j] W[j] *= (1 - λ2μ) If xij>0 then W[j] = W[j] + λ(yi - pi)xj

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtable W For each iteration t=1,…T For each example (xi,yi) pi = … For each feature W[j] If xij>0 then W[j] *= (1 - λ2μ)A W[j] = W[j] + λ(yi - pi)xj A is number of examples seen since the last time we did an x>0 update on W[j]

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature W[j] If xij>0 then W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k k-A[j] is number of examples seen since the last time we did an x>0 update on W[j]

Learning as optimization for regularized logistic regression Final algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature W[j] If xij>0 then W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k Time goes from O(nmT) back to to O(nT) where n = number of non-zero entries, m = number of features, T = number of passes over data memory use doubles (m  2m)

Outline [other stuff] Logistic regression and SGD Learning as optimization Logistic regression: a linear classifier optimizing P(y|x) Stochastic gradient descent “streaming optimization” for ML problems Regularized logistic regression Sparse regularized logistic regression Memory-saving logistic regression

Pop quiz In text classification most words are rare not correlated with any class given low weights in the LR classifier unlikely to affect classification not very interesting

Pop quiz In text classification most bigrams are rare not correlated with any class given low weights in the LR classifier unlikely to affect classification not very interesting

Pop quiz Most of the weights in a classifier are important not important

How can we exploit this? One idea: combine uncommon words together randomly Examples: replace all occurrances of “humanitarianism” or “biopsy” with “humanitarianismOrBiopsy” replace all occurrances of “schizoid” or “duchy” with “schizoidOrDuchy” replace all occurrances of “gynecologist” or “constrictor” with “gynecologistOrConstrictor” … For Naïve Bayes this breaks independence assumptions it’s not obviously a problem for logistic regression, though I could combine two low-weight words (won’t matter much) a low-weight and a high-weight word (won’t matter much) two high-weight words (not very likely to happen) How much of this can I get away with? certainly a little is it enough to make a difference? how much memory does it save?

How can we exploit this? Another observation: the values in my hash table are weights the keys in my hash table are strings for the feature names We need them to avoid collisions But maybe we don’t care about collisions? Allowing “schizoid” & “duchy” to collide is equivalent to replacing all occurrences of “schizoid” or “duchy” with “schizoidOrDuchy”

Learning as optimization for regularized logistic regression Algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature j: xij>0: W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k

Learning as optimization for regularized logistic regression Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ For each hash value h: V[h]>0: W[h] *= (1 - λ2μ)k-A[j] W[h] = W[h] + λ(yi - pi)V[h] A[j] = k

Learning as optimization for regularized logistic regression Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ ???

Fixes and optimizations This is the basic idea but we need to apply “weight decay” to features in an example before we compute the prediction we need to apply “weight decay” before we save the learned classifier my suggestion: an abstraction for a logistic regression classifier

A possible SGD implementation class SGDLogistic Regression { /** Predict using current weights **/ double predict(Map features); /** Apply weight decay to a single feature and record when in A[ ]**/ void regularize(string feature, int currentK); /** Regularize all features then save to disk **/ void save(string fileName,int currentK); /** Load a saved classifier **/ static SGDClassifier load(String fileName); /** Train on one example **/ void train1(Map features, double trueLabel, int k) { // regularize each feature // predict and apply update } // main ‘train’ program assumes a stream of randomly-ordered examples and outputs classifier to disk; main ‘test’ program prints predictions for each test case in input.

A possible SGD implementation class SGDLogistic Regression { … } // main ‘train’ program assumes a stream of randomly-ordered examples and outputs classifier to disk; main ‘test’ program prints predictions for each test case in input. <100 lines (in python) Other mains: A “shuffler:” stream thru a training file T times and output instances output is randomly ordered, as much as possible, given a buffer of size B Something to collect predictions + true labels and produce error rates, etc.

A possible SGD implementation Parameter settings: W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj I didn’t tune especially but used μ=0.1 λ=η* E-2 where E is “epoch”, η=½ epoch: number of times you’ve iterated over the dataset, starting at E=1

ICML 2009

An interesting example Spam filtering for Yahoo mail Lots of examples and lots of users Two options: one filter for everyone—but users disagree one filter for each user—but some users are lazy and don’t label anything Third option: classify (msg,user) pairs features of message i are words wi,1,…,wi,ki feature of user is his/her id u features of pair are: wi,1,…,wi,ki and uwi,1,…,uwi,ki based on an idea by Hal Daumé

An example E.g., this email to wcohen features: dear, madam, sir,…. investment, broker,…, wcohendear, wcohenmadam, wcohen,…, idea: the learner will figure out how to personalize my spam filter by using the wcohenX features

An example Compute personalized features and multiple hashes on-the-fly: a great opportunity to use several processors and speed up i/o

Experiments 3.2M emails 40M tokens 430k users 16T unique features – after personalization

An example 2^26 entries = 1 Gb @ 8bytes/weight 3,2M emails 400k users 40M tokens

Review of Kernels

The kernel perceptron B A instance xi ^ yi yi ^ Compute: yi = vk . xi If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the kernel trick

The kernel perceptron B A instance xi ^ yi yi ^ Compute: yi = vk . xi If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the “kernel trick” Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.

Some common kernels Linear kernel: Polynomial kernel: Gaussian kernel: More later….

Stopped …

Kernels 101 Duality and computational properties Reproducing Kernel Hilbert Space (RKHS) Gram matrix Positive semi-definite Closure properties

Kernels 101 Duality: two ways to look at this Implicitly map from x to φ(x) by changing the kernel function K Kernels 101 Explicitly map from x to φ(x) – i.e. to the point corresponding to x in the Hilbert space Duality: two ways to look at this Two different computational ways of getting the same behavior

Kernels 101 Duality Gram matrix: K: kij = K(xi,xj) K(x,x’) = K(x’,x)  Gram matrix is symmetric K(x,x) > 0  diagonal of K is positive  K is “positive semi-definite”  zT K z >= 0 for all z Duality Gram matrix: K: kij = K(xi,xj)

Kernels 101 Duality Gram matrix: K: kij = K(xi,xj) K(x,x’) = K(x’,x)  Gram matrix is symmetric K(x,x) > 0  diagonal of K is positive  K is “positive semi-definite”  …  zT K z >= 0 for all z Fun fact: Gram matrix positive semi-definite  K(xi,xj)=φ(xi), φ(xj) for some φ Proof: φ(x) uses the eigenvectors of K to represent x

Hash Kernels “The Hash Trick”

Kernels 101 Duality Gram matrix Positive semi-definite Closure properties: if K and K’ are kernels K + K’ is a kernel cK is a kernel, if c>0 aK + bK’ is a kernel, for a,b>0 …

Pop quiz Most of the weights in a classifier are small and not important

How can we exploit this? One idea: combine uncommon words together randomly Examples: replace all occurrances of “humanitarianism” or “biopsy” with “humanitarianismOrBiopsy” replace all occurrances of “schizoid” or “duchy” with “schizoidOrDuchy” replace all occurrances of “gynecologist” or “constrictor” with “gynecologistOrConstrictor” … For Naïve Bayes this breaks independence assumptions it’s not obviously a problem for logistic regression, though I could combine two low-weight words (won’t matter much) a low-weight and a high-weight word (won’t matter much) two high-weight words (not very likely to happen) How much of this can I get away with? certainly a little is it enough to make a difference? how much memory does it save?

How can we exploit this? Another observation: the values in my hash table are weights the keys in my hash table are strings for the feature names We need them to avoid collisions But maybe we don’t care about collisions? Allowing “schizoid” & “duchy” to collide is equivalent to replacing all occurrences of “schizoid” or “duchy” with “schizoidOrDuchy”

Learning as optimization for regularized logistic regression Algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature j: xij>0: W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k

Learning as optimization for regularized logistic regression Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ For each hash value h: V[h]>0: W[h] *= (1 - λ2μ)k-A[j] W[h] = W[h] + λ(yi - pi)V[h] A[j] = k

Learning as optimization for regularized logistic regression Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ For each hash value h: V[h]>0: W[h] *= (1 - λ2μ)k-A[j] W[h] = W[h] + λ(yi - pi)V[h] A[j] = k ???

Learning as optimization for regularized logistic regression Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ ???

ICML 2009

An interesting example Spam filtering for Yahoo mail Lots of examples and lots of users Two options: one filter for everyone—but users disagree one filter for each user—but some users are lazy and don’t label anything Third option: classify (msg,user) pairs features of message i are words wi,1,…,wi,ki feature of user is his/her id u features of pair are: wi,1,…,wi,ki and uwi,1,…,uwi,ki based on an idea by Hal Daumé, used for transfer learning

An example E.g., this email to wcohen features: dear, madam, sir,…. investment, broker,…, wcohendear, wcohenmadam, wcohen,…, idea: the learner will figure out how to personalize my spam filter by using the wcohenX features

An example Compute personalized features and multiple hashes on-the-fly: a great opportunity to use several processors and speed up i/o

Experiments 3.2M emails 40M tokens 430k users 16T unique features – after personalization

An example 2^26 entries = 1 Gb @ 8bytes/weight 3,2M emails 400k users 40M tokens

m is the number of buckets you hash into (R in my discussion) Some details Slightly different hash to avoid systematic bias m is the number of buckets you hash into (R in my discussion)

m is the number of buckets you hash into (R in my discussion) Some details Slightly different hash to avoid systematic bias m is the number of buckets you hash into (R in my discussion)

I.e. – a hashed vector is probably close to the original vector Some details I.e. – a hashed vector is probably close to the original vector

Some details I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work

Some details should read this proof

The hash kernel: implementation One problem: debugging is harder Features are no longer meaningful There’s a new way to ruin a classifier Change the hash function  You can separately compute the set of all words that hash to h and guess what features mean Build an inverted index hw1,w2,…,

A variant of feature hashing Hash each feature multiple times with different hash functions Now, each w has k chances to not collide with another useful w’ An easy way to get multiple hash functions Generate some random strings s1,…,sL Let the k-th hash function for w be the ordinary hash of concatenation wsk

A variant of feature hashing Why would this work? Claim: with 100,000 features and 100,000,000 buckets: k=1 Pr(any duplication) ≈1 k=2 Pr(any duplication) ≈0.4 k=3  Pr(any duplication) ≈0.01

Experiments RCV1 dataset Shi et al, JMLR 2009 http://jmlr.csail.mit.edu/papers/v10/shi09a.html RCV1 dataset Use hash kernels with TF-approx-IDF representation TF and IDF done at level of hashed features, not words IDF is approximated from subset of data Not sure of value of “c” or data subset

Results

Results

Results