Efficient Logistic Regression with Stochastic Gradient Descent

Efficient Logistic Regression with Stochastic Gradient Descent
William Cohen

Outline Logistic regression and SGD Learning as optimization
a linear classifier optimizing P(y|x) Stochastic gradient descent “streaming optimization” for ML problems Regularized logistic regression Sparse regularized logistic regression Memory-saving logistic regression

Summary on perceptron We showed that
If : exists a u with unit norm that has margin γ on examples in the seq (x1,y1),(x2,y2),…. Then : the perceptron algorithm makes < R2/ γ2 mistakes on the sequence (where R >= ||xi||) Independent of dimension of the data or classifier (!) This doesn’t follow from M(C)<=VCDim(C) Here: a small-norm classifier and a high margin  success Question: what if you explicitly look for a classifier with smallest norm and largest margin? maximize margin subject to constraint on norm minimize norm subject to constraint on margin Designing ML algorithms as optimization

Learning as optimization: general procedure
Goal: Learn the parameter θ of … Dataset: D={x1,…,xn} or maybe D={(x1,y1)…,(xn , yn)} Write down Pr(D|θ) as a function of θ Maybe easier to work with log Pr(D|θ) Maximize by differentiating and setting to zero Or, use gradient descent: repeatedly take a small step in the direction of the gradient

Learning as optimization: general procedure for SGD
Goal: Learn the parameter θ of … Dataset: D={x1,…,xn} or maybe D={(x1,y1)…,(xn , yn)} Write down Pr(D|θ) as a function of θ Big-data problem: we don’t want to load all the data D into memory, and the gradient depends on all the data Solution: pick a small subset of examples B<<D approximate the gradient using them “on average” this is the right direction take a step in that direction repeat…. B = one example is a very popular choice

Learning as optimization for logistic regression
Goal: Learn the parameter w of the classifier Probability of a single example (y,x) would be Or with logs:

An observation Key computational point:
if xj=0 then the gradient of wj is zero so when processing an example you only need to update weights for the non-zero features of an example.

The algorithm: -- do this in random order

Another observation Consider averaging the gradient over all the examples D={(x1,y1)…,(xn , yn)}

Goal: Learn the parameter θ of a classifier Which classifier? We’ve seen y = sign(x . w) but sign is not continuous… Convenient alternative: replace sign with the logistic function Practical problem: this overfits badly with sparse features e.g., if wj is only in positive examples, its gradient is always positive !

Outline Logistic regression and SGD Learning as optimization
a linear classifier optimizing P(y|x) Stochastic gradient descent “streaming optimization” for ML problems Regularized logistic regression Sparse regularized logistic regression Memory-saving logistic regression

Regularized logistic regression
Replace LCL with LCL + penalty for large weights, eg So: becomes:

Regularized logistic regression
Replace LCL with LCL + penalty for large weights, eg So the update for wj becomes: Or

Algorithm: -- do this in random order

Learning as optimization for regularized logistic regression
Algorithm: Time goes from O(nT) to O(nmT) where n = number of non-zero entries, m = number of features, T = number of passes over data

Final algorithm: Initialize hashtable W For each iteration t=1,…T For each example (xi,yi) pi = … For each feature W[j] W[j] = W[j] - λ2μW[j] If xij>0 then W[j] = W[j] + λ(yi - pi)xj

Final algorithm: Initialize hashtable W For each iteration t=1,…T For each example (xi,yi) pi = … For each feature W[j] W[j] *= (1 - λ2μ) If xij>0 then W[j] = W[j] + λ(yi - pi)xj

Final algorithm: Initialize hashtable W For each iteration t=1,…T For each example (xi,yi) pi = … For each feature W[j] If xij>0 then W[j] *= (1 - λ2μ)A W[j] = W[j] + λ(yi - pi)xj A is number of examples seen since the last time we did an x>0 update on W[j]

Final algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature W[j] If xij>0 then W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k k-A[j] is number of examples seen since the last time we did an x>0 update on W[j]

Final algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature W[j] If xij>0 then W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k Time goes from O(nmT) back to to O(nT) where n = number of non-zero entries, m = number of features, T = number of passes over data memory use doubles (m  2m)

Outline [other stuff] Logistic regression and SGD
Learning as optimization Logistic regression: a linear classifier optimizing P(y|x) Stochastic gradient descent “streaming optimization” for ML problems Regularized logistic regression Sparse regularized logistic regression Memory-saving logistic regression

Pop quiz In text classification most words are rare
not correlated with any class given low weights in the LR classifier unlikely to affect classification not very interesting

Pop quiz In text classification most bigrams are rare
not correlated with any class given low weights in the LR classifier unlikely to affect classification not very interesting

Pop quiz Most of the weights in a classifier are important
not important

How can we exploit this? One idea: combine uncommon words together randomly Examples: replace all occurrances of “humanitarianism” or “biopsy” with “humanitarianismOrBiopsy” replace all occurrances of “schizoid” or “duchy” with “schizoidOrDuchy” replace all occurrances of “gynecologist” or “constrictor” with “gynecologistOrConstrictor” … For Naïve Bayes this breaks independence assumptions it’s not obviously a problem for logistic regression, though I could combine two low-weight words (won’t matter much) a low-weight and a high-weight word (won’t matter much) two high-weight words (not very likely to happen) How much of this can I get away with? certainly a little is it enough to make a difference? how much memory does it save?

How can we exploit this? Another observation:
the values in my hash table are weights the keys in my hash table are strings for the feature names We need them to avoid collisions But maybe we don’t care about collisions? Allowing “schizoid” & “duchy” to collide is equivalent to replacing all occurrences of “schizoid” or “duchy” with “schizoidOrDuchy”

Algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature j: xij>0: W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k

Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ For each hash value h: V[h]>0: W[h] *= (1 - λ2μ)k-A[j] W[h] = W[h] + λ(yi - pi)V[h] A[j] = k

Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ ???

Fixes and optimizations
This is the basic idea but we need to apply “weight decay” to features in an example before we compute the prediction we need to apply “weight decay” before we save the learned classifier my suggestion: an abstraction for a logistic regression classifier

A possible SGD implementation
class SGDLogistic Regression { /** Predict using current weights **/ double predict(Map features); /** Apply weight decay to a single feature and record when in A[ ]**/ void regularize(string feature, int currentK); /** Regularize all features then save to disk **/ void save(string fileName,int currentK); /** Load a saved classifier **/ static SGDClassifier load(String fileName); /** Train on one example **/ void train1(Map features, double trueLabel, int k) { // regularize each feature // predict and apply update } // main ‘train’ program assumes a stream of randomly-ordered examples and outputs classifier to disk; main ‘test’ program prints predictions for each test case in input.

class SGDLogistic Regression { … } // main ‘train’ program assumes a stream of randomly-ordered examples and outputs classifier to disk; main ‘test’ program prints predictions for each test case in input. <100 lines (in python) Other mains: A “shuffler:” stream thru a training file T times and output instances output is randomly ordered, as much as possible, given a buffer of size B Something to collect predictions + true labels and produce error rates, etc.

Parameter settings: W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj I didn’t tune especially but used μ=0.1 λ=η* E-2 where E is “epoch”, η=½ epoch: number of times you’ve iterated over the dataset, starting at E=1

ICML 2009

An interesting example
Spam filtering for Yahoo mail Lots of examples and lots of users Two options: one filter for everyone—but users disagree one filter for each user—but some users are lazy and don’t label anything Third option: classify (msg,user) pairs features of message i are words wi,1,…,wi,ki feature of user is his/her id u features of pair are: wi,1,…,wi,ki and uwi,1,…,uwi,ki based on an idea by Hal Daumé

An example E.g., this email to wcohen features:
dear, madam, sir,…. investment, broker,…, wcohendear, wcohenmadam, wcohen,…, idea: the learner will figure out how to personalize my spam filter by using the wcohenX features

An example Compute personalized features and multiple hashes on-the-fly: a great opportunity to use several processors and speed up i/o

Experiments 3.2M emails 40M tokens 430k users
16T unique features – after personalization

An example 2^26 entries = 1 Gb @ 8bytes/weight 3,2M emails 400k users
40M tokens

Review of Kernels

The kernel perceptron B A instance xi ^ yi yi ^ Compute: yi = vk . xi
If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the kernel trick

The kernel perceptron B A instance xi ^ yi yi ^ Compute: yi = vk . xi
If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the “kernel trick” Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.

Some common kernels Linear kernel: Polynomial kernel: Gaussian kernel:
More later….

Stopped …

Kernels 101 Duality and computational properties
Reproducing Kernel Hilbert Space (RKHS) Gram matrix Positive semi-definite Closure properties

Kernels 101 Duality: two ways to look at this
Implicitly map from x to φ(x) by changing the kernel function K Kernels 101 Explicitly map from x to φ(x) – i.e. to the point corresponding to x in the Hilbert space Duality: two ways to look at this Two different computational ways of getting the same behavior

Kernels 101 Duality Gram matrix: K: kij = K(xi,xj)
K(x,x’) = K(x’,x)  Gram matrix is symmetric K(x,x) > 0  diagonal of K is positive  K is “positive semi-definite”  zT K z >= 0 for all z Duality Gram matrix: K: kij = K(xi,xj)

Kernels 101 Duality Gram matrix: K: kij = K(xi,xj)
K(x,x’) = K(x’,x)  Gram matrix is symmetric K(x,x) > 0  diagonal of K is positive  K is “positive semi-definite”  …  zT K z >= 0 for all z Fun fact: Gram matrix positive semi-definite  K(xi,xj)=φ(xi), φ(xj) for some φ Proof: φ(x) uses the eigenvectors of K to represent x

Hash Kernels “The Hash Trick”

Kernels 101 Duality Gram matrix Positive semi-definite
Closure properties: if K and K’ are kernels K + K’ is a kernel cK is a kernel, if c>0 aK + bK’ is a kernel, for a,b>0 …

Pop quiz Most of the weights in a classifier are
small and not important

How can we exploit this? One idea: combine uncommon words together randomly Examples: replace all occurrances of “humanitarianism” or “biopsy” with “humanitarianismOrBiopsy” replace all occurrances of “schizoid” or “duchy” with “schizoidOrDuchy” replace all occurrances of “gynecologist” or “constrictor” with “gynecologistOrConstrictor” … For Naïve Bayes this breaks independence assumptions it’s not obviously a problem for logistic regression, though I could combine two low-weight words (won’t matter much) a low-weight and a high-weight word (won’t matter much) two high-weight words (not very likely to happen) How much of this can I get away with? certainly a little is it enough to make a difference? how much memory does it save?

How can we exploit this? Another observation:
the values in my hash table are weights the keys in my hash table are strings for the feature names We need them to avoid collisions But maybe we don’t care about collisions? Allowing “schizoid” & “duchy” to collide is equivalent to replacing all occurrences of “schizoid” or “duchy” with “schizoidOrDuchy”

Algorithm: Initialize hashtables W, A and set k=0 For each iteration t=1,…T For each example (xi,yi) pi = … ; k++ For each feature j: xij>0: W[j] *= (1 - λ2μ)k-A[j] W[j] = W[j] + λ(yi - pi)xj A[j] = k

Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ For each hash value h: V[h]>0: W[h] *= (1 - λ2μ)k-A[j] W[h] = W[h] + λ(yi - pi)V[h] A[j] = k

Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ For each hash value h: V[h]>0: W[h] *= (1 - λ2μ)k-A[j] W[h] = W[h] + λ(yi - pi)V[h] A[j] = k ???

Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1,…T For each example (xi,yi) Let V be hash table so that pi = … ; k++ ???

ICML 2009

An interesting example
Spam filtering for Yahoo mail Lots of examples and lots of users Two options: one filter for everyone—but users disagree one filter for each user—but some users are lazy and don’t label anything Third option: classify (msg,user) pairs features of message i are words wi,1,…,wi,ki feature of user is his/her id u features of pair are: wi,1,…,wi,ki and uwi,1,…,uwi,ki based on an idea by Hal Daumé, used for transfer learning

An example E.g., this email to wcohen features:
dear, madam, sir,…. investment, broker,…, wcohendear, wcohenmadam, wcohen,…, idea: the learner will figure out how to personalize my spam filter by using the wcohenX features

An example Compute personalized features and multiple hashes on-the-fly: a great opportunity to use several processors and speed up i/o

Experiments 3.2M emails 40M tokens 430k users
16T unique features – after personalization

An example 2^26 entries = 1 Gb @ 8bytes/weight 3,2M emails 400k users
40M tokens

m is the number of buckets you hash into (R in my discussion)
Some details Slightly different hash to avoid systematic bias m is the number of buckets you hash into (R in my discussion)

I.e. – a hashed vector is probably close to the original vector
Some details I.e. – a hashed vector is probably close to the original vector

Some details I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work

Some details should read this proof

The hash kernel: implementation
One problem: debugging is harder Features are no longer meaningful There’s a new way to ruin a classifier Change the hash function  You can separately compute the set of all words that hash to h and guess what features mean Build an inverted index hw1,w2,…,

A variant of feature hashing
Hash each feature multiple times with different hash functions Now, each w has k chances to not collide with another useful w’ An easy way to get multiple hash functions Generate some random strings s1,…,sL Let the k-th hash function for w be the ordinary hash of concatenation wsk

A variant of feature hashing
Why would this work? Claim: with 100,000 features and 100,000,000 buckets: k=1 Pr(any duplication) ≈1 k=2 Pr(any duplication) ≈0.4 k=3  Pr(any duplication) ≈0.01

Experiments RCV1 dataset
Shi et al, JMLR 2009 RCV1 dataset Use hash kernels with TF-approx-IDF representation TF and IDF done at level of hashed features, not words IDF is approximated from subset of data Not sure of value of “c” or data subset

Results

Efficient Logistic Regression with Stochastic Gradient Descent

Similar presentations

Presentation on theme: "Efficient Logistic Regression with Stochastic Gradient Descent"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Logistic Regression with Stochastic Gradient Descent

Similar presentations

Presentation on theme: "Efficient Logistic Regression with Stochastic Gradient Descent"— Presentation transcript:

Similar presentations

About project

Feedback