Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jia-Bin Huang Virginia Tech

Similar presentations


Presentation on theme: "Jia-Bin Huang Virginia Tech"— Presentation transcript:

1 Jia-Bin Huang Virginia Tech
Linear Regression Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

2 BRACE YOURSELVES WINTER IS COMING

3 BRACE YOURSELVES HOMEWORK IS COMING

4 Administrative Office hour Feedback (Thanks!) Chen Gao Shih-Yang Su
Notation? More descriptive slides? Video/audio recording? TA hours (uniformly spread over the week)?

5 Recap: Machine learning algorithms
Supervised Learning Unsupervised Learning Discrete Classification Clustering Continuous Regression Dimensionality reduction

6 Recap: Nearest neighbor classifier
Training data 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑁 , 𝑦 𝑁 Learning Do nothing. Testing ℎ 𝑥 = 𝑦 (𝑘) , where 𝑘= argmin i 𝐷(𝑥, 𝑥 (𝑖) )

7 Recap: Instance/Memory-based Learning
1. A distance metric Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? 1? 3? 5? 15? 3. A weighting function (optional) Closer neighbors matter more 4. How to fit with the local points? Kernel regression Slide credit: Carlos Guestrin

8 Slide credit: CS231 @ Stanford
Validation set Spliting training set: A fake test set to tune hyper-parameters Slide credit: Stanford

9 Slide credit: CS231 @ Stanford
Cross-validation 5-fold cross-validation -> split the training data into 5 equal folds 4 of them for training and 1 for validation Slide credit: Stanford

10 Things to remember Supervised Learning k-NN
Training/testing data; classification/regression; Hypothesis k-NN Simplest learning algorithm With sufficient data, very hard to beat “strawman” approach Kernel regression/classification Set k to n (number of data points) and chose kernel width Smoother than k-NN Problems with k-NN Curse of dimensionality Not robust to irrelevant features Slow NN search: must remember (very large) dataset for prediction

11 Today’s plan: Linear Regression
Model representation Cost function Gradient descent Features and polynomial regression Normal equation

12 Linear Regression Model representation Cost function Gradient descent
Features and polynomial regression Normal equation

13 Training set Learning Algorithm ℎ 𝑥 𝑦 Regression real-valued output
Size of house Hypothesis Estimated price

14 House pricing prediction
Price ($) in 1000’s 500 1000 1500 2000 2500 100 200 300 400 Size in feet^2

15 Slide credit: Andrew Ng
Training set Size in feet^2 (x) Price ($) in 1000’s (y) 2104 460 1416 232 1534 315 852 178 Notation: 𝑚 = Number of training examples 𝑥 = Input variable / features 𝑦 = Output variable / target variable (𝑥, 𝑦) = One training example ( 𝑥 (𝑖) , 𝑦 (𝑖) ) = 𝑖 𝑡ℎ training example 𝑚 = 47 Examples: 𝑥 (1) = 2104 𝑥 (2) = 1416 𝑦 (1) = 460 Slide credit: Andrew Ng

16 Slide credit: Andrew Ng
Model representation 𝑦= ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 Shorthand ℎ 𝑥 Training set Learning Algorithm 𝑥 𝑦 Hypothesis Size of house Estimated price Price ($) in 1000’s 500 1000 1500 2000 2500 100 200 300 400 Size in feet^2 Univariate linear regression Slide credit: Andrew Ng

17 Linear Regression Model representation Cost function Gradient descent
Features and polynomial regression Normal equation

18 Slide credit: Andrew Ng
Training set Size in feet^2 (x) Price ($) in 1000’s (y) 2104 460 1416 232 1534 315 852 178 Hypothesis ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 𝜃 0 , 𝜃 1 : parameters/weights How to choose 𝜃 𝑖 ’s? 𝑚 = 47 Slide credit: Andrew Ng

19 Slide credit: Andrew Ng
ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 𝑦 𝑦 𝑦 1 2 3 1 2 3 1 2 3 𝑥 𝑥 𝑥 𝜃 0 =1.5 𝜃 1 =0 𝜃 0 =0 𝜃 1 =0.5 𝜃 0 =1 𝜃 1 =0.5 Slide credit: Andrew Ng

20 Slide credit: Andrew Ng
Cost function minimize 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 𝜃 0 , 𝜃 1 Idea: Choose 𝜃 0 , 𝜃 1 so that ℎ 𝜃 𝑥 is close to 𝑦 for our training example (𝑥,𝑦) ℎ 𝜃 𝑥 𝑖 = 𝜃 0 + 𝜃 1 𝑥 (𝑖) 𝐽 𝜃 0 , 𝜃 1 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 Price ($) in 1000’s 500 1000 1500 2000 2500 100 200 300 400 Size in feet^2 𝑥 𝑦 minimize 𝐽 𝜃 0 , 𝜃 1 Cost function 𝜃 0 , 𝜃 1 Slide credit: Andrew Ng

21 Slide credit: Andrew Ng
Simplified Hypothesis: ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 Parameters: 𝜃 0 , 𝜃 1 Cost function: 𝐽 𝜃 0 , 𝜃 1 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 Goal: minimize 𝐽 𝜃 0 , 𝜃 1 Hypothesis: ℎ 𝜃 𝑥 = 𝜃 1 𝑥 𝜃 0 =0 Parameters: 𝜃 1 Cost function: 𝐽 𝜃 1 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 Goal: minimize 𝐽 𝜃 1 𝜃 0 , 𝜃 1 𝜃 0 , 𝜃 1 Slide credit: Andrew Ng

22 Slide credit: Andrew Ng
ℎ 𝜃 𝑥 , function of 𝑥 𝐽 𝜃 1 , function of 𝜃 1 1 2 3 𝑥 𝑦 1 2 3 𝐽 𝜃 1 𝜃 1 Slide credit: Andrew Ng

23 Slide credit: Andrew Ng
ℎ 𝜃 𝑥 , function of 𝑥 𝐽 𝜃 1 , function of 𝜃 1 1 2 3 𝑥 𝑦 1 2 3 𝐽 𝜃 1 𝜃 1 Slide credit: Andrew Ng

24 Slide credit: Andrew Ng
ℎ 𝜃 𝑥 , function of 𝑥 𝐽 𝜃 1 , function of 𝜃 1 1 2 3 𝑥 𝑦 1 2 3 𝐽 𝜃 1 𝜃 1 Slide credit: Andrew Ng

25 Slide credit: Andrew Ng
ℎ 𝜃 𝑥 , function of 𝑥 𝐽 𝜃 1 , function of 𝜃 1 1 2 3 𝑥 𝑦 1 2 3 𝐽 𝜃 1 𝜃 1 Slide credit: Andrew Ng

26 Slide credit: Andrew Ng
ℎ 𝜃 𝑥 , function of 𝑥 𝐽 𝜃 1 , function of 𝜃 1 1 2 3 𝑥 𝑦 1 2 3 𝐽 𝜃 1 𝜃 1 Slide credit: Andrew Ng

27 Slide credit: Andrew Ng
Hypothesis: ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 Parameters: 𝜃 0 , 𝜃 1 Cost function: 𝐽 𝜃 0 , 𝜃 1 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 Goal: minimize 𝐽 𝜃 0 , 𝜃 1 𝜃 0 , 𝜃 1 Slide credit: Andrew Ng

28 Slide credit: Andrew Ng
Cost function Slide credit: Andrew Ng

29 How do we find good 𝜃 0 , 𝜃 1 that minimize 𝐽 𝜃 0 , 𝜃 1 ?
Slide credit: Andrew Ng

30 Linear Regression Model representation Cost function Gradient descent
Features and polynomial regression Normal equation

31 Slide credit: Andrew Ng
Gradient descent Have some function 𝐽 𝜃 0 , 𝜃 1 Want argmin 𝐽 𝜃 0 , 𝜃 1 Outline: Start with some 𝜃 0 , 𝜃 1 Keep changing 𝜃 0 , 𝜃 1 to reduce 𝐽 𝜃 0 , 𝜃 1 until we hopefully end up at minimum 𝜃 0 , 𝜃 1 Slide credit: Andrew Ng

32 Slide credit: Andrew Ng

33 Slide credit: Andrew Ng
Gradient descent Repeat until convergence{ 𝜃 𝑗 ≔ 𝜃 𝑗 −𝛼 𝜕 𝜕 𝜃 𝑗 𝐽 𝜃 0 , 𝜃 1 (for 𝑗=0 and 𝑗=1) } 𝛼: Learning rate (step size) 𝜕 𝜕 𝜃 𝑗 𝐽 𝜃 0 , 𝜃 1 : derivative (rate of change) Slide credit: Andrew Ng

34 Slide credit: Andrew Ng
Gradient descent Correct: simultaneous update temp0 ≔𝜃 0 −𝛼 𝜕 𝜕 𝜃 0 𝐽 𝜃 0 , 𝜃 1 temp1 ≔𝜃 1 −𝛼 𝜕 𝜕 𝜃 1 𝐽 𝜃 0 , 𝜃 1 𝜃 0 ≔ temp0 𝜃 1 ≔ temp1 Incorrect: temp0 ≔𝜃 0 −𝛼 𝜕 𝜕 𝜃 0 𝐽 𝜃 0 , 𝜃 1 𝜃 0 ≔ temp0 temp1 ≔𝜃 1 −𝛼 𝜕 𝜕 𝜃 1 𝐽 𝜃 0 , 𝜃 1 𝜃 1 ≔ temp1 Slide credit: Andrew Ng

35 Slide credit: Andrew Ng
𝜃 1 ≔ 𝜃 1 −𝛼 𝜕 𝜕 𝜃 1 𝐽 𝜃 1 1 2 3 𝐽 𝜃 1 𝜃 1 𝜕 𝜕 𝜃 1 𝐽 𝜃 1 <0 𝜕 𝜕 𝜃 1 𝐽 𝜃 1 >0 Slide credit: Andrew Ng

36 Learning rate

37 Gradient descent for linear regression
Repeat until convergence{ 𝜃 𝑗 ≔ 𝜃 𝑗 −𝛼 𝜕 𝜕 𝜃 𝑗 𝐽 𝜃 0 , 𝜃 (for 𝑗=0 and 𝑗=1) } Linear regression model ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 𝐽 𝜃 0 , 𝜃 1 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 Slide credit: Andrew Ng

38 Computing partial derivative
𝜕 𝜕 𝜃 𝑗 𝐽 𝜃 0 , 𝜃 1 = 𝜕 𝜕 𝜃 𝑗 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 = 𝜕 𝜕 𝜃 𝑗 1 2𝑚 𝑖=1 𝑚 𝜃 0 + 𝜃 1 𝑥 (𝑖) − 𝑦 𝑖 2 𝑗=0: 𝜕 𝜕 𝜃 0 𝐽 𝜃 0 , 𝜃 1 = 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑗=1: 𝜕 𝜕 𝜃 1 𝐽 𝜃 0 , 𝜃 1 = 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖 Slide credit: Andrew Ng

39 Gradient descent for linear regression
Repeat until convergence{ 𝜃 0 ≔ 𝜃 0 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝜃 1 ≔ 𝜃 1 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖 } Update 𝜃 0 and 𝜃 1 simultaneously Slide credit: Andrew Ng

40 Batch gradient descent
“Batch”: Each step of gradient descent uses all the training examples Repeat until convergence{ 𝜃 0 ≔ 𝜃 0 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝜃 1 ≔ 𝜃 1 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖 } 𝑚: Number of training examples Slide credit: Andrew Ng

41 Linear Regression Model representation Cost function Gradient descent
Features and polynomial regression Normal equation

42 Slide credit: Andrew Ng
Training dataset Size in feet^2 (x) Price ($) in 1000’s (y) 2104 460 1416 232 1534 315 852 178 ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 Slide credit: Andrew Ng

43 Multiple features (input variables)
Size in feet^2 ( 𝑥 1 ) Number of bedrooms ( 𝑥 2 ) Number of floors ( 𝑥 3 ) Age of home (years) ( 𝑥 4 ) Price ($) in 1000’s (y) 2104 5 1 45 460 1416 3 2 40 232 1534 30 315 852 36 178 Notation: 𝑛 = Number of features 𝑥 (𝑖) = Input features of 𝑖 𝑡ℎ training example 𝑥 𝑗 (𝑖) = Value of feature 𝑗 in 𝑖 𝑡ℎ training example 𝑥 3 (2) =? 𝑥 3 (4) =? Slide credit: Andrew Ng

44 Slide credit: Andrew Ng
Hypothesis Previously: ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 Now: ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 +𝜃 3 𝑥 3 + 𝜃 4 𝑥 4 Slide credit: Andrew Ng

45 Slide credit: Andrew Ng
ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 +…+ 𝜃 𝑛 𝑥 𝑛 For convenience of notation, define 𝑥 0 =1 ( 𝑥 0 (𝑖) =1 for all examples) 𝒙= 𝑥 0 𝑥 1 𝑥 2 ⋮ 𝑥 𝑛 ∈ 𝑅 𝑛 𝜽= 𝜃 0 𝜃 1 𝜃 2 ⋮ 𝜃 𝑛 ∈ 𝑅 𝑛+1 ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 +…+ 𝜃 𝑛 𝑥 𝑛 = 𝜽 ⊤ 𝒙 Slide credit: Andrew Ng

46 Slide credit: Andrew Ng
Gradient descent Previously (𝑛=1) Repeat until convergence{ 𝜃 0 ≔ 𝜃 0 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝜃 1 ≔ 𝜃 1 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑖 } New algorithm (𝑛≥1) Repeat until convergence{ 𝜃 𝑗 ≔ 𝜃 𝑗 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝑖 } Simultaneously update 𝜃 𝑗 , for 𝑗=0, 1, ⋯,𝑛 Slide credit: Andrew Ng

47 Gradient descent in practice: Feature scaling
Idea: Make sure features are on a similar scale (e.g,. −1≤𝑥 𝑖 ≤1) E.g. 𝑥 1 = size ( feat^2) 𝑥 2 = number of bedrooms (1-5) 1 2 3 𝜃 1 𝜃 2 1 2 3 𝜃 1 𝜃 2 Slide credit: Andrew Ng

48 Gradient descent in practice: Learning rate
Automatic convergence test 𝛼 too small: slow convergence 𝛼 too large: may not converge To choose 𝛼, try 0.001, … 0.01, …, 0.1, … , 1 Image credit:

49 House prices prediction
ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 ×frontage+ 𝜃 2 ×depth Area 𝑥= frontage×depth ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 Slide credit: Andrew Ng

50 Polynomial regression
Price ($) in 1000’s 500 1000 1500 2000 2500 100 200 300 400 Size in feet^2 ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 + 𝜃 3 𝑥 3 = 𝜃 0 + 𝜃 1 𝑠𝑖𝑧𝑒 + 𝜃 2 𝑠𝑖𝑧𝑒 2 + 𝜃 3 𝑠𝑖𝑧𝑒 3 𝑥 1 = (size) 𝑥 2 = (size)^2 𝑥 3 = (size)^3 Slide credit: Andrew Ng

51 Linear Regression Model representation Cost function Gradient descent
Features and polynomial regression Normal equation

52 Slide credit: Andrew Ng
( 𝑥 0 ) Size in feet^2 ( 𝑥 1 ) Number of bedrooms ( 𝑥 2 ) Number of floors ( 𝑥 3 ) Age of home (years) ( 𝑥 4 ) Price ($) in 1000’s (y) 1 2104 5 45 460 1416 3 2 40 232 1534 30 315 852 36 178 𝑦= 𝜃= ( 𝑋 ⊤ 𝑋) −1 𝑋 ⊤ 𝑦 Slide credit: Andrew Ng

53 Least square solution 𝐽 𝜃 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 = 1 2𝑚 𝑖=1 𝑚 𝜃 ⊤ 𝑥 (𝑖) − 𝑦 𝑖 = 1 2𝑚 𝑋𝜃−𝑦 2 2 𝜕 𝜕𝜃 𝐽 𝜃 =0 𝜃= ( 𝑋 ⊤ 𝑋) −1 𝑋 ⊤ 𝑦

54 Justification/interpretation 1
Loss minimization 𝐽 𝜃 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 = 1 𝑚 𝑖=1 𝑚 𝐿 𝑙𝑠 ℎ 𝜃 𝑥 𝑖 , 𝑦 𝑖 𝐿 𝑙𝑠 𝑦, 𝑦 = 𝑦 − 𝑦 : Least squares loss Empirical Risk Minimization (ERM) 1 𝑚 𝑖=1 𝑚 𝐿 𝑙𝑠 𝑦 𝑖 , 𝑦

55 Justification/interpretation 2
Probabilistic model Assume linear model with Gaussian errors 𝑝 𝜃 𝑦 𝑖 𝑥 𝑖 = 1 2𝜋 𝜎 2 exp⁡(− 1 2 𝜎 2 ( 𝑦 𝑖 − 𝜃 ⊤ 𝑥 𝑖 ) Solving maximum likelihood argmin 𝜃 𝑖=1 𝑚 𝑝 𝜃 𝑦 𝑖 𝑥 𝑖 argmin 𝜃 log⁡( 𝑖=1 𝑚 𝑝 𝑦 𝑖 𝑥 𝑖 ) = argmin 𝜃 𝜎 2 𝑖=1 𝑚 𝜃 ⊤ 𝑥 𝑖 − 𝑦 𝑖 2 Image credit: CS

56 Justification/interpretation 3
𝒚 𝑿𝜽−𝒚 Geometric interpretation 𝑿= 1 ← 𝒙 (1) → 1 ⋮ 1 ← 𝒙 (2) → ⋮ ← 𝒙 (𝑚) → = ↑ 𝒛 𝟎 ↑ 𝒛 𝟏 ↑ 𝒛 𝟐 ⋯ ↑ 𝒛 𝒏 ↓ ↓ ↓ ↓ 𝑿𝜽: column space of 𝑿 or span({ 𝒛 𝟎 , 𝒛 𝟏 , ⋯, 𝒛 𝒏 }) Residual 𝑿𝜽−𝐲 is orthogonal to the column space of 𝑿 𝑿 ⊤ 𝑿𝜽−𝐲 =0→ (𝑿 ⊤ 𝑿)𝜽= 𝑿 ⊤ 𝒚 𝑿𝜽 column space of 𝑿

57 𝑚 training examples, 𝑛 features
Gradient Descent Need to choose 𝛼 Need many iterations Works well even when 𝑛 is large Normal Equation No need to choose 𝛼 Don’t need to iterate Need to compute ( 𝑋 ⊤ 𝑋) −1 Slow if 𝑛 is very large Slide credit: Andrew Ng

58 Things to remember Model representation
ℎ 𝜃 𝑥 = 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 +…+ 𝜃 𝑛 𝑥 𝑛 = 𝜃 ⊤ 𝑥 Cost function 𝐽 𝜃 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 Gradient descent for linear regression Repeat until convergence { 𝜃 𝑗 ≔ 𝜃 𝑗 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝑖 } Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) Normal equation 𝜃= ( 𝑋 ⊤ 𝑋) −1 𝑋 ⊤ 𝑦

59 Next Naïve Bayes, Logistic regression, Regularization


Download ppt "Jia-Bin Huang Virginia Tech"

Similar presentations


Ads by Google