Presentation is loading. Please wait.

Presentation is loading. Please wait.

Softmax Classifier + Generalization

Similar presentations


Presentation on theme: "Softmax Classifier + Generalization"β€” Presentation transcript:

1 Softmax Classifier + Generalization
Various slides from previous courses by: D.A. Forsyth (Berkeley / UIUC), I. Kokkinos (Ecole Centrale / UCL). S. Lazebnik (UNC / UIUC), S. Seitz (MSR / Facebook), J. Hays (Brown / Georgia Tech), A. Berg (Stony Brook / UNC), D. Samaras (Stony Brook) . J. M. Frahm (UNC), V. Ordonez (UVA), Steve Seitz (UW).

2 Last Class Introduction to Machine Learning
Unsupervised Learning: Clustering (e.g. k-means clustering) Supervised Learning: Classification (e.g. k-nearest neighbors)

3 Today’s Class Softmax Classifier (Linear Classifiers)
Generalization / Overfitting / Regularization Global Features

4 Supervised Learning vs Unsupervised Learning
π‘₯ β†’ 𝑦 π‘₯ cat dog bear

5 Supervised Learning vs Unsupervised Learning
π‘₯ β†’ 𝑦 π‘₯ cat dog bear

6 Supervised Learning vs Unsupervised Learning
π‘₯ β†’ 𝑦 π‘₯ cat dog bear Classification Clustering

7 Supervised Learning – k-Nearest Neighbors
cat k=3 dog bear cat, cat, dog cat cat dog bear dog bear

8 Supervised Learning – k-Nearest Neighbors
cat dog k=3 bear cat bear, dog, dog cat dog bear dog bear

9 Supervised Learning – k-Nearest Neighbors
How do we choose the right K? How do we choose the right features? How do we choose the right distance metric?

10 Supervised Learning – k-Nearest Neighbors
How do we choose the right K? How do we choose the right features? How do we choose the right distance metric? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a ”Training set” and a ”Validation set” (also called ”Development set”)

11 Supervised Learning - Classification
Training Data Test Data dog cat bear dog cat bear cat dog bear

12 Supervised Learning - Classification
Training Data Test Data cat dog cat . . bear

13 Supervised Learning - Classification
Training Data 𝑦 𝑛 =[ ] 𝑦 3 =[ ] 𝑦 2 =[ ] 𝑦 1 =[ ] π‘₯ 1 =[ ] π‘₯ 2 =[ ] π‘₯ 3 =[ ] π‘₯ 𝑛 =[ ] cat dog cat . bear

14 Supervised Learning - Classification
Training Data inputs targets / labels / ground truth 𝑦 𝑖 =𝑓( π‘₯ 𝑖 ;πœƒ) We need to find a function that maps x and y for any of them. 1 2 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions π‘₯ 1 =[ π‘₯ π‘₯ π‘₯ π‘₯ 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = π‘₯ 2 =[ π‘₯ π‘₯ π‘₯ π‘₯ 24 ] π‘₯ 3 =[ π‘₯ π‘₯ π‘₯ π‘₯ 34 ] How do we ”learn” the parameters of this function? We choose ones that makes the following quantity small: 𝑖=1 𝑛 πΆπ‘œπ‘ π‘‘( 𝑦 𝑖 , 𝑦 𝑖 ) . π‘₯ 𝑛 =[ π‘₯ 𝑛1 π‘₯ 𝑛2 π‘₯ 𝑛3 π‘₯ 𝑛4 ]

15 Supervised Learning –Softmax Classifier
Training Data inputs targets / labels / ground truth π‘₯ 1 =[ π‘₯ π‘₯ π‘₯ π‘₯ 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = π‘₯ 2 =[ π‘₯ π‘₯ π‘₯ π‘₯ 24 ] π‘₯ 3 =[ π‘₯ π‘₯ π‘₯ π‘₯ 34 ] . π‘₯ 𝑛 =[ π‘₯ 𝑛1 π‘₯ 𝑛2 π‘₯ 𝑛3 π‘₯ 𝑛4 ]

16 Supervised Learning –Softmax Classifier
Training Data inputs targets / labels / ground truth [ ] [ ] [ ] [ ] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions π‘₯ 1 =[ π‘₯ π‘₯ π‘₯ π‘₯ 14 ] [ ] [ ] [ ] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = π‘₯ 2 =[ π‘₯ π‘₯ π‘₯ π‘₯ 24 ] π‘₯ 3 =[ π‘₯ π‘₯ π‘₯ π‘₯ 34 ] . π‘₯ 𝑛 =[ π‘₯ 𝑛1 π‘₯ 𝑛2 π‘₯ 𝑛3 π‘₯ 𝑛4 ]

17 Supervised Learning –Softmax Classifier
π‘₯ 𝑖 =[ π‘₯ 𝑖1 π‘₯ 𝑖2 π‘₯ 𝑖3 π‘₯ 𝑖4 ] [ ] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑀 𝑐1 π‘₯ 𝑖1 + 𝑀 𝑐2 π‘₯ 𝑖2 + 𝑀 𝑐3 π‘₯ 𝑖3 + 𝑀 𝑐4 π‘₯ 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑀 𝑑1 π‘₯ 𝑖1 + 𝑀 𝑑2 π‘₯ 𝑖2 + 𝑀 𝑑3 π‘₯ 𝑖3 + 𝑀 𝑑4 π‘₯ 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑀 𝑏1 π‘₯ 𝑖1 + 𝑀 𝑏2 π‘₯ 𝑖2 + 𝑀 𝑏3 π‘₯ 𝑖3 + 𝑀 𝑏4 π‘₯ 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

18 How do we find a good w and b?
π‘₯ 𝑖 =[ π‘₯ 𝑖1 π‘₯ 𝑖2 π‘₯ 𝑖3 π‘₯ 𝑖4 ] [ ] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑀,𝑏) 𝑓 𝑑 (𝑀,𝑏) 𝑓 𝑏 (𝑀,𝑏)] We need to find w, and b that minimize the following function L: 𝐿 𝑀,𝑏 = 𝑖=1 𝑛 𝑗=1 3 βˆ’ 𝑦 𝑖,𝑗 log⁑( 𝑦 𝑖,𝑗 ) = 𝑖=1 𝑛 βˆ’log⁑( 𝑦 𝑖,π‘™π‘Žπ‘π‘’π‘™ ) = 𝑖=1 𝑛 βˆ’log 𝑓 𝑖,π‘™π‘Žπ‘π‘’π‘™ (𝑀,𝑏) Why?

19 How do we find a good w and b?
Problem statement: Find 𝑀 and 𝑏 such that 𝐿 𝑀,𝑏 is minimal. πœ• πœ•π‘€ 𝐿 𝑀,𝑏 =0 Solution from calculus. and solve for 𝑀 πœ• πœ•π‘ 𝐿 𝑀,𝑏 =0 𝑏

20 https://courses. lumenlearning

21 How do we find a good w and b?
Problem statement: Find 𝑀 and 𝑏 such that 𝐿 𝑀,𝑏 is minimal. πœ• πœ•π‘€ 𝐿 𝑀,𝑏 =0 Solution from calculus. and solve for 𝑀 πœ• πœ•π‘ 𝐿 𝑀,𝑏 =0 𝑏

22 Problems with this approach:
Some functions L(w, b) are very complicated and compositions of many functions. So finding its analytical derivative is tedious. Even if the function is simple to derivate, it might not be easy to solve for w. e.g. πœ• πœ•π‘€ 𝐿 𝑀,𝑏 = 𝑒 𝑀 +𝑀= 0 How do you find w in that equation?

23 Solution: Iterative Approach: Gradient Descent (GD)
1. Start with a random value of w (e.g. w = 12) 𝐿 𝑀 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑀

24 Solution: Iterative Approach: Gradient Descent (GD)
𝐿 𝑀 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 𝑀

25 Gradient Descent (GD) Problem: expensive! πœ†=0.01
for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝐿(𝑀,𝑏)/𝑑𝑀 𝑑𝐿(𝑀,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑀=𝑀 βˆ’πœ† 𝑑𝐿(𝑀,𝑏)/𝑑𝑀 𝑏=𝑏 βˆ’πœ† 𝑑𝐿(𝑀,𝑏)/𝑑𝑏 Print: 𝐿(𝑀,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑀,𝑏)= 𝑖=1 𝑛 βˆ’log 𝑓 𝑖,π‘™π‘Žπ‘π‘’π‘™ (𝑀,𝑏)

26 Solution: (mini-batch) Stochastic Gradient Descent (SGD)
πœ†=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑀,𝑏)/𝑑𝑀 𝑑𝑙(𝑀,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑀=𝑀 βˆ’πœ† 𝑑𝑙(𝑀,𝑏)/𝑑𝑀 𝑏=𝑏 βˆ’πœ† 𝑑𝑙(𝑀,𝑏)/𝑑𝑏 Print: 𝑙(𝑀,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑀,𝑏)= π‘–βˆˆπ΅ βˆ’log 𝑓 𝑖,π‘™π‘Žπ‘π‘’π‘™ (𝑀,𝑏) B is a small set of training examples. for b = 0, num_batches do end

27 Source: Andrew Ng

28 Three more things How to compute the gradient Regularization
Momentum updates

29 SGD Gradient for the Softmax Function

30 SGD Gradient for the Softmax Function

31 SGD Gradient for the Softmax Function

32 Supervised Learning –Softmax Classifier
𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] Get predictions π‘₯ 𝑖 =[ π‘₯ 𝑖1 π‘₯ 𝑖2 π‘₯ 𝑖3 π‘₯ 𝑖4 ] Extract features 𝑔 𝑐 = 𝑀 𝑐1 π‘₯ 𝑖1 + 𝑀 𝑐2 π‘₯ 𝑖2 + 𝑀 𝑐3 π‘₯ 𝑖3 + 𝑀 𝑐4 π‘₯ 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑀 𝑑1 π‘₯ 𝑖1 + 𝑀 𝑑2 π‘₯ 𝑖2 + 𝑀 𝑑3 π‘₯ 𝑖3 + 𝑀 𝑑4 π‘₯ 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑀 𝑏1 π‘₯ 𝑖1 + 𝑀 𝑏2 π‘₯ 𝑖2 + 𝑀 𝑏3 π‘₯ 𝑖3 + 𝑀 𝑏4 π‘₯ 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) Run features through classifier

33 Supervised Machine Learning Steps
Training Training Labels Training Images Image Features Training Learned model Learned model Testing Image Features Prediction Test Image Slide credit: D. Hoiem

34 Test set (labels unknown)
Generalization Generalization refers to the ability to correctly classify never before seen examples Can be controlled by turning β€œknobs” that affect the complexity of the model Test set (labels unknown) Training set (labels known)

35 𝑓 is a polynomial of degree 9
Overfitting 𝑓 is a polynomial of degree 9 𝑓 is linear 𝑓 is cubic πΏπ‘œπ‘ π‘  𝑀 is high πΏπ‘œπ‘ π‘  𝑀 is low πΏπ‘œπ‘ π‘  𝑀 is zero! Overfitting Underfitting High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.

36 Questions?


Download ppt "Softmax Classifier + Generalization"

Similar presentations


Ads by Google