Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD)

Today’s Class Stochastic Gradient Descent (SGD) SGD Recap
Regression vs Classification Generalization / Overfitting / Underfitting Regularization Momentum Updates / ADAM Updates

Our function L(w) 𝐿 𝑤 = 3+(𝑤 −4) 2

Our function L(w) 𝐿 𝑤 = 3+(𝑤 −4) 2 𝜕 𝜕𝑤 𝐿 𝑤 =0
Easy way to find minimum (and max): Find where 𝜕 𝜕𝑤 𝐿 𝑤 =2 𝑤−4 This is zero when: 𝑤=4

Our function L(w) 𝐿 𝑤 = 3+(𝑤 −4) 2
But this is not easy for complex functions: L 𝑤 1 , 𝑤 2 ,.., 𝑤 12 =−𝑙𝑜𝑔𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔 𝑤 1 , 𝑤 2 ,.., 𝑤 12 , 𝑥 1 𝑙𝑎𝑏𝑒 𝑙 −𝑙𝑜𝑔𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔 𝑤 1 , 𝑤 2 ,.., 𝑤 12 , 𝑥 2 𝑙𝑎𝑏𝑒 𝑙 2 … −𝑙𝑜𝑔𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔 𝑤 1 , 𝑤 2 ,.., 𝑤 12 , 𝑥 𝑛 𝑙𝑎𝑏𝑒 𝑙 𝑛

Our function L(w) 𝐿 𝑤 = 3+(𝑤 −4) 2 Or even for simpler functions:
𝐿 𝑤 = 𝑒 −𝑥 + 𝑥 2 𝜕𝐿(𝑤) 𝜕𝑤 = −𝑒 −𝑥 +2𝑥 0= −𝑒 −𝑥 +2𝑥 How do you find x?

Gradient Descent (GD) (idea)
1. Start with a random value of w (e.g. w = 12) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑤

𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 𝑤

𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8 𝑤

Gradient Descent (GD) 𝜆=0.01 for e = 0, num_epochs do end
Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

Gradient Descent (GD) expensive 𝜆=0.01 for e = 0, num_epochs do end
Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

(mini-batch) Stochastic Gradient Descent (SGD)
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do end

(mini-batch) Stochastic Gradient Descent (SGD)
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do for |B| = 1 end

Regression vs Classification
Labels are continuous variables – e.g. distance. Losses: Distance-based losses, e.g. sum of distances to true values. Evaluation: Mean distances, correlation coefficients, etc. Classification Labels are discrete variables (1 out of K categories) Losses: Cross-entropy loss, margin losses, logistic regression (binary cross entropy) Evaluation: Classification accuracy, etc.

Linear Regression – 1 output, 1 input
𝑦 ( 𝑥 8 , 𝑦 8 ) ( 𝑥 6 , 𝑦 6 ) ( 𝑥 7 , 𝑦 7 ) ( 𝑥 4 , 𝑦 4 ) ( 𝑥 5 , 𝑦 5 ) ( 𝑥 2 , 𝑦 2 ) ( 𝑥 3 , 𝑦 3 ) ( 𝑥 1 , 𝑦 1 ) 𝑥

𝑦 ( 𝑥 8 , 𝑦 8 ) ( 𝑥 6 , 𝑦 6 ) ( 𝑥 7 , 𝑦 7 ) ( 𝑥 4 , 𝑦 4 ) ( 𝑥 5 , 𝑦 5 ) ( 𝑥 2 , 𝑦 2 ) ( 𝑥 3 , 𝑦 3 ) ( 𝑥 1 , 𝑦 1 ) 𝑥 Model: 𝑦 =𝑤𝑥+𝑏

𝑦 ( 𝑥 8 , 𝑦 8 ) ( 𝑥 6 , 𝑦 6 ) ( 𝑥 7 , 𝑦 7 ) ( 𝑥 4 , 𝑦 4 ) ( 𝑥 5 , 𝑦 5 ) ( 𝑥 2 , 𝑦 2 ) ( 𝑥 3 , 𝑦 3 ) ( 𝑥 1 , 𝑦 1 ) 𝑥 𝐿 𝑤,𝑏 = 𝑖=1 𝑖=8 𝑦 𝑖 − 𝑦 𝑖 2 Model: 𝑦 =𝑤𝑥+𝑏 Loss:

Quadratic Regression 𝑦 𝑥 𝐿 𝑤,𝑏 = 𝑖=1 𝑖=8 𝑦 𝑖 − 𝑦 𝑖 2 Model: Loss:
( 𝑥 8 , 𝑦 8 ) ( 𝑥 6 , 𝑦 6 ) ( 𝑥 7 , 𝑦 7 ) ( 𝑥 4 , 𝑦 4 ) ( 𝑥 5 , 𝑦 5 ) ( 𝑥 2 , 𝑦 2 ) ( 𝑥 3 , 𝑦 3 ) ( 𝑥 1 , 𝑦 1 ) 𝑥 𝐿 𝑤,𝑏 = 𝑖=1 𝑖=8 𝑦 𝑖 − 𝑦 𝑖 2 Model: Loss: 𝑦 = 𝑤 1 𝑥 2 + 𝑤 2 𝑥+𝑏

n-polynomial Regression
𝑦 ( 𝑥 8 , 𝑦 8 ) ( 𝑥 6 , 𝑦 6 ) ( 𝑥 7 , 𝑦 7 ) ( 𝑥 4 , 𝑦 4 ) ( 𝑥 5 , 𝑦 5 ) ( 𝑥 2 , 𝑦 2 ) ( 𝑥 3 , 𝑦 3 ) ( 𝑥 1 , 𝑦 1 ) 𝑥 𝐿 𝑤,𝑏 = 𝑖=1 𝑖=8 𝑦 𝑖 − 𝑦 𝑖 2 Model: 𝑦 = 𝑤 𝑛 𝑥 𝑛 +…+ 𝑤 1 𝑥+𝑏 Loss:

𝑓 is a polynomial of degree 9
Overfitting 𝑓 is a polynomial of degree 9 𝑓 is linear 𝑓 is cubic 𝐿𝑜𝑠𝑠 𝑤 is high 𝐿𝑜𝑠𝑠 𝑤 is low 𝐿𝑜𝑠𝑠 𝑤 is zero! Overfitting Underfitting High Bias High Variance Christopher M. Bishop – Pattern Recognition and Machine Learning Credit: C. Bishop. Pattern Recognition and Mach. Learning.

Regularization Large weights lead to large variance. i.e. model fits to the training data too strongly. Solution: Minimize the loss but also try to keep the weight values small by doing the following: 𝐿 𝑤,𝑏 + 𝑖 | 𝑤 𝑖 | 2 minimize Credit: C. Bishop. Pattern Recognition and Mach. Learning.

Regularization Large weights lead to large variance. i.e. model fits to the training data too strongly. Solution: Minimize the loss but also try to keep the weight values small by doing the following: 𝐿 𝑤,𝑏 +𝛼 𝑖 | 𝑤 𝑖 | 2 minimize Regularizer term e.g. L2- regularizer Credit: C. Bishop. Pattern Recognition and Mach. Learning.

SGD with Regularization (L-2)
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤−𝜆𝛼𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏−𝜆𝛼𝑤 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙 𝑤,𝑏 =𝑙 𝑤,𝑏 +𝛼 𝑖 | 𝑤 𝑖 | 2 for b = 0, num_batches do end

Revisiting Another Problem with SGD
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤−𝜆𝛼𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏−𝜆𝛼𝑤 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙 𝑤,𝑏 =𝑙 𝑤,𝑏 +𝛼 𝑖 | 𝑤 𝑖 | 2 These are only approximations to the true gradient with respect to 𝐿(𝑤,𝑏) for b = 0, num_batches do end

Revisiting Another Problem with SGD
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤−𝜆𝛼𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏−𝜆𝛼𝑤 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙 𝑤,𝑏 =𝑙 𝑤,𝑏 +𝛼 𝑖 | 𝑤 𝑖 | 2 This could lead to “un-learning” what has been learned in some previous steps of training. for b = 0, num_batches do end

Solution: Momentum Updates
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤−𝜆𝛼𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏−𝜆𝛼𝑤 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙 𝑤,𝑏 =𝑙 𝑤,𝑏 +𝛼 𝑖 | 𝑤 𝑖 | 2 Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient. for b = 0, num_batches do end

Solution: Momentum Updates
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 Compute: Update w: 𝑤=𝑤 −𝜆 𝑣 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝜏=0.9 𝑙 𝑤,𝑏 =𝑙 𝑤,𝑏 +𝛼 𝑖 | 𝑤 𝑖 | 2 global 𝑣 Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient. for b = 0, num_batches do Compute: 𝑣=𝜏𝑣+𝑑𝑙(𝑤,𝑏)/𝑑𝑤+𝛼𝑤 end

Stochastic Gradient Descent (SGD)

Similar presentations

Presentation on theme: "Stochastic Gradient Descent (SGD)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stochastic Gradient Descent (SGD)

Similar presentations

Presentation on theme: "Stochastic Gradient Descent (SGD)"— Presentation transcript:

Similar presentations

About project

Feedback