Presentation is loading. Please wait.

Presentation is loading. Please wait.

Softmax Classifier.

Similar presentations


Presentation on theme: "Softmax Classifier."— Presentation transcript:

1 Softmax Classifier

2 Today’s Class Softmax Classifier
Inference / Making Predictions / Test Time Training a Softmax Classifier Stochastic Gradient Descent (SGD)

3 Supervised Learning - Classification
Training Data Test Data cat dog cat . . bear

4 Supervised Learning - Classification
Training Data 𝑦 𝑛 =[ ] 𝑦 3 =[ ] 𝑦 2 =[ ] 𝑦 1 =[ ] 𝑥 1 =[ ] 𝑥 2 =[ ] 𝑥 3 =[ ] 𝑥 𝑛 =[ ] cat dog cat . bear

5 Supervised Learning - Classification
Training Data inputs targets / labels / ground truth 𝑦 𝑖 =𝑓( 𝑥 𝑖 ;𝜃) We need to find a function that maps x and y for any of them. 1 2 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 𝑥 𝑥 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 𝑥 𝑥 𝑥 24 ] 𝑥 3 =[ 𝑥 𝑥 𝑥 𝑥 34 ] How do we ”learn” the parameters of this function? We choose ones that makes the following quantity small: 𝑖=1 𝑛 𝐶𝑜𝑠𝑡( 𝑦 𝑖 , 𝑦 𝑖 ) . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

6 Supervised Learning – Linear Softmax
Training Data inputs targets / labels / ground truth 𝑥 1 =[ 𝑥 𝑥 𝑥 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 𝑥 𝑥 𝑥 24 ] 𝑥 3 =[ 𝑥 𝑥 𝑥 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

7 Supervised Learning – Linear Softmax
Training Data inputs targets / labels / ground truth [ ] [ ] [ ] [ ] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 𝑥 𝑥 𝑥 14 ] [ ] [ ] [ ] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 𝑥 𝑥 𝑥 24 ] 𝑥 3 =[ 𝑥 𝑥 𝑥 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

8 Supervised Learning – Linear Softmax
𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [ ] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

9 How do we find a good w and b?
𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [ ] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗=1 3 − 𝑦 𝑖,𝑗 log⁡( 𝑦 𝑖,𝑗 ) = 𝑖=1 𝑛 −log⁡( 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 ) = 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) Why?

10 Gradient Descent (GD) 𝜆=0.01 for e = 0, num_epochs do end
Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

11 Gradient Descent (GD) (idea)
1. Start with a random value of w (e.g. w = 12) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑤

12 Gradient Descent (GD) (idea)
𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 𝑤

13 Gradient Descent (GD) (idea)
𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8 𝑤

14 Our function L(w) 𝐿 𝑤 = 3+(1 − 𝑤) 2

15 Our function L(w) 𝐿 𝑤 = 3+(1 − 𝑤) 2 𝐿(𝑊,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑊,𝑏)

16 Our function L(w) 𝐿 𝑤 = 3+(1 − 𝑤) 2
−𝑙𝑜𝑔𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔 𝑤 1 , 𝑤 2 ,.., 𝑤 12 , 𝑥 𝑛 𝑙𝑎𝑏𝑒 𝑙 𝑛

17 Gradient Descent (GD) expensive 𝜆=0.01 for e = 0, num_epochs do end
Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

18 (mini-batch) Stochastic Gradient Descent (SGD)
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do end

19 Source: Andrew Ng

20 (mini-batch) Stochastic Gradient Descent (SGD)
𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do for |B| = 1 end

21 Computing Analytic Gradients
This is what we have:

22 Computing Analytic Gradients
This is what we have: 𝑎 𝑖 =( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 + 𝑤 𝑖,3 + 𝑤 𝑖,4 )+ 𝑏 𝑖 Reminder:

23 Computing Analytic Gradients
This is what we have:

24 Computing Analytic Gradients
This is what we have: This is what we need: for each for each

25 Computing Analytic Gradients
This is what we have: Step 1: Chain Rule of Calculus

26 Computing Analytic Gradients
This is what we have: Step 1: Chain Rule of Calculus Let’s do these first

27 Computing Analytic Gradients
𝑎 𝑖 =( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 + 𝑤 𝑖,3 + 𝑤 𝑖,4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 3 = 𝜕 𝜕 𝑤 𝑖, 3 ( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 𝑥 2 + 𝑤 𝑖,3 𝑥 3 + 𝑤 𝑖,4 𝑥 4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 3 = 𝑥 3 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗

28 Computing Analytic Gradients
𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗 𝑎 𝑖 =( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 + 𝑤 𝑖,3 + 𝑤 𝑖,4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 = 𝜕 𝜕 𝑏 𝑖 ( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 𝑥 2 + 𝑤 𝑖,3 𝑥 3 + 𝑤 𝑖,4 𝑥 4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 =1

29 Computing Analytic Gradients
𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 =1

30 Computing Analytic Gradients
This is what we have: Step 1: Chain Rule of Calculus Now let’s do this one (same for both!)

31 Computing Analytic Gradients
In our cat, dog, bear classification example: i = {0, 1, 2}

32 Computing Analytic Gradients
In our cat, dog, bear classification example: i = {0, 1, 2} 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 1 𝜕ℓ 𝜕 𝑎 2 Let’s say: label = 1 We need:

33 Computing Analytic Gradients
𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 2 = 𝑦 𝑖

34 Remember this slide? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [ ] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

35 Computing Analytic Gradients
𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 2 = 𝑦 𝑖

36 Computing Analytic Gradients
𝜕ℓ 𝜕 𝑎 1 = 𝑦 𝑖 −1

37 Computing Analytic Gradients
label = 1 𝜕ℓ 𝜕 𝑎 0 = 𝑦 0 𝜕ℓ 𝜕 𝑎 1 = 𝑦 1 −1 𝜕ℓ 𝜕 𝑎 1 = 𝑦 2 𝜕ℓ 𝜕𝑎 = 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 1 𝜕ℓ 𝜕 𝑎 2 = 𝑦 𝑦 1 −1 𝑦 2 = 𝑦 𝑦 𝑦 2 − = 𝑦 −𝑦 𝜕ℓ 𝜕 𝑎 𝑖 = 𝑦 𝑖 − 𝑦 𝑖

38 Computing Analytic Gradients
𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 =1 𝜕ℓ 𝜕 𝑎 𝑖 = 𝑦 𝑖 − 𝑦 𝑖 𝜕ℓ 𝜕 𝑤 𝑖, 𝑗 = 𝑦 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝜕ℓ 𝜕 𝑏 𝑖 = 𝑦 𝑖 − 𝑦 𝑖

39 Supervised Learning –Softmax Classifier
𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] Get predictions 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] Extract features 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) Run features through classifier

40 𝑓 is a polynomial of degree 9
Overfitting 𝑓 is a polynomial of degree 9 𝑓 is linear 𝑓 is cubic 𝐿𝑜𝑠𝑠 𝑤 is high 𝐿𝑜𝑠𝑠 𝑤 is low 𝐿𝑜𝑠𝑠 𝑤 is zero! Overfitting Underfitting High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.

41 More … Regularization Momentum updates
Hinge Loss, Least Squares Loss, Logistic Regression Loss

42 Assignment 2 – Linear Margin-Classifier
Training Data inputs targets / labels / ground truth [ ] [ ] [ ] [ ] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 𝑥 𝑥 𝑥 14 ] [ ] [ ] [ ] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 𝑥 𝑥 𝑥 24 ] 𝑥 3 =[ 𝑥 𝑥 𝑥 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

43 Supervised Learning – Linear Softmax
𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [ ] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑓 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑓 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑓 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏

44 How do we find a good w and b?
𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [ ] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗≠𝑙𝑎𝑏𝑒𝑙 max⁡( 0, 𝑦 𝑖𝑗 − 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 +Δ) Why?

45 Questions?


Download ppt "Softmax Classifier."

Similar presentations


Ads by Google