Presentation is loading. Please wait.

Presentation is loading. Please wait.

Contact: Wilson.McKerrow@nyumc.org Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact: Wilson.McKerrow@nyumc.org.

Similar presentations


Presentation on theme: "Contact: Wilson.McKerrow@nyumc.org Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact: Wilson.McKerrow@nyumc.org."— Presentation transcript:

1 Contact: Wilson.McKerrow@nyumc.org
Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact:

2 Linear Regression – one independent variable
Data: Want: (residual) is “small” 2 Predictor Independent Response Dependent

3 Linear Regression – one independent variable
3 What is “small”? Define a loss function: To measure how far off the model is. Want:

4 Linear Regression – one independent variable
4 Standard choice: Sum of square errors (least squares)

5 Linear Regression – one independent variable
5 Advantages of least squares: Gauss-Markov theorem: Lowest variance among unbiased methods. IF: Residuals have mean 0: 𝐸 𝜖 𝑖 =0 Residuals have constant variance: 𝑉 𝜖 𝑖 = 𝜎 2 Errors are uncorrelated: 𝑐𝑜𝑣 𝜖 𝑖 , 𝜖 𝑗 =0

6 Linear Regression – one independent variable
6 Advantages of least squares: Gauss-Markov theorem: Lowest variance among unbiased methods. Equivalent to MLE for

7 Linear Regression – one independent variable
7 Advantages of least squares: Gauss-Markov theorem: Lowest variance among unbiased methods. Equivalent to MLE for Easy to calculate

8 Linear Regression – one independent variable
8 Disadvantages of least squares: Variance can still be large Pays attention to outliers

9 Minimizing the loss function, L (sum of squared errors):
Linear Regression – One Independent Variable Minimizing the loss function, L (sum of squared errors): 9

10 Minimizing the loss function, L (sum of squared errors):
Linear Regression – One Independent Variable Minimizing the loss function, L (sum of squared errors): 10

11 Linear Regression – Vector notation
11 Relationship: 𝑦= 𝑤 1 𝑥+ 𝑤 0 +𝜖 𝒙=(1, 𝑥 1 ) 𝒘=( 𝑤 0 , 𝑤 1 ) 𝑦=𝒙∙𝒘+𝜖

12 Linear Regression - Multiple Independent Variables
𝑦=𝒙∙𝒘+𝜖 12 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 )

13 𝒚=𝑿𝒘+𝝐 Linear Regression – Matrix notation
13 𝒚=𝑿𝒘+𝝐 Data: 𝑦 𝑗 , 𝑥 1𝑗 , 𝑥 2𝑗 ,…, 𝑥 𝑘𝑗 for j=1..n 𝒚= ( 𝑦 1 , 𝑦 1 , 𝑦 2 , 𝑦 3 ,… , 𝑦 𝑛 ) 𝑇 𝑿= 1 𝑥 11 … 𝑥 𝑘1 1 𝑥 21 …𝑥 𝑘2 ⋮ 1 ⋮ 𝑥 𝑘1 ⋮ … 𝑥 𝑘𝑛 𝒘= ( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 ) 𝑇

14 𝒚=𝑿𝒘+𝝐 Linear Regression – Matrix notation
14 𝒚= 1 𝑥 11 … 𝑥 𝑘1 1 𝑥 21 …𝑥 𝑘2 ⋮ 1 ⋮ 𝑥 2𝑛 ⋮ … 𝑥 𝑘𝑛 𝑤 0 𝑤 1 … 𝑤 𝑘 +𝜖= 𝑤 0 + 𝑗 𝑥 𝑖𝑗 𝑤 𝑗 +𝜖 Row by column

15 Multiple Linear Regression
𝒚=𝑿𝒘+𝝐 Minimizing the loss function, L (sum of squared errors): 𝑑𝐿 𝑑𝒘 = 𝑑 𝑑𝒘 𝝐 𝑻 𝝐 =0 ⇒ 𝑑 𝑑𝒘 𝒚−𝑿𝒘 𝑻 𝒚−𝑿𝒘 = 𝑑 𝑑𝒘 𝒚 𝑻 𝒚− 𝒚 𝑻 𝑿𝒘− 𝑿𝒘 𝑻 𝒚+ 𝑿𝒘 𝑻 𝑿𝒘 𝒚 𝑻 𝒚− 𝒚 𝑻 𝑿𝒘− 𝑿𝒘 𝑻 𝒚+ 𝑿𝒘 𝑻 𝑿𝒘 =−𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿 𝒘 =0 ⇒ 𝒘 = (𝑿 𝑻 𝑿) −𝟏 𝑿 𝑻 𝒚

16 Non-constant variance
𝜎~𝑥^2 Truth Fit

17 Non-constant variance
100 repetitions

18 𝒘 = (𝑿 𝑻 𝑺𝑿) −𝟏 𝑿 𝑻 𝑺𝒚 𝑆= 1/𝜎 1 2 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ 1/𝜎 𝑛 2
Weighted least squares 𝑠𝑑( 𝜖 𝑖 )~ 𝜎 𝑖 but 𝑠𝑑( 𝜖 𝑖 / 𝜎 𝑖 )~1, constant New loss function: 𝐿( 𝑤 1 , 𝑤 0 )= 𝜖 𝑖 𝜎 𝑖 2 Minimized when: 𝒘 = (𝑿 𝑻 𝑺𝑿) −𝟏 𝑿 𝑻 𝑺𝒚 𝑆= 1/𝜎 1 2 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ 1/𝜎 𝑛 2

19 Weighted least squares
𝜎~𝑥^2 100 repetitions

20 𝒙=(1, 𝑓 1 (𝑥), 𝑓 2 (𝑥), 𝑓 3 (𝑥),…, 𝑓 𝑙 (𝑥))
(Non)Linear Regression – Sum of Functions 𝑦=𝒙∙𝒘+𝜖 20 𝒙=(1, 𝑓 1 (𝑥), 𝑓 2 (𝑥), 𝑓 3 (𝑥),…, 𝑓 𝑙 (𝑥)) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 ) 𝒙∙𝒘= 𝑤 0 + 𝑓 1 𝑥 𝑤 1 +…+ 𝑓 𝑙 𝑥 𝑤 𝑙

21 (Non)Linear Regression – Polynomial
𝑦=𝒙∙𝒘+𝜖 21 𝒙=(1, 𝑥, 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑙 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 ) 𝒙∙𝒘= 𝑤 0 + 𝑤 1 𝑥+ 𝑤 2 𝑥 2 +…+ 𝑤 𝑙 𝑥 𝑙

22 Model Capacity: Overfitting and Underfitting
22

23 Model Capacity: Overfitting and Underfitting
23

24 Model Capacity: Overfitting and Underfitting
24

25 Model Capacity: Overfitting and Underfitting
25 Training Error Error on Training Set Degree of polynomial

26 Model Capacity: Overfitting and Underfitting
26 With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. John von Neumann

27 Training and Testing Data Set Test Training

28 Training and Testing – Linear relationship
Error Testing Error Training Error Degree of polynomial

29 Testing Error and Training Set Size
Low Variance High Variance 10 10 10 30 30 100 30 300 Error 100 1000 100 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial

30 Coefficients and Training Set Size
Degree of polynomial = 9 10 100 1000 Absolute Value of coefficient Coefficient Coefficient Coefficient

31 Training and Testing – Non-linear relationship
Error Testing Error Training Error Degree of polynomial

32 Testing Error and Training Set Size
Low Variance High Variance 10 30 30 30 100 Log(Error) 100 100 300 300 1000 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial

33 Multiple Linear Regression
Correlation of independent variables increase the uncertainty in parameter determination Standard deviation R2 = 0.9 Uncorrelated Number of Data Points

34 min 𝒘 𝑳 𝒘 min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2 Regularization
No Regularization: min 𝒘 𝑳 𝒘 Regularization: min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) Ridge Regression: min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2

35 𝑑𝐿 𝑑𝒘 =−𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿𝒘 Ridge Regression
Recall: 𝑑𝐿 𝑑𝒘 =−𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿𝒘 So: 𝑑 𝑑𝑤 𝐿+𝜆 𝒘 𝑻 𝒘 =−𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿𝒘+𝟐𝝀𝒘 and: −𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿 𝒘 +𝟐𝝀 𝒘 =𝟎 𝑿 𝑇 𝑿+2𝜆𝑰 𝒘= 𝑿 𝑇 𝒚 yielding: 𝒘 =( 𝑿 𝑇 𝑿+2𝜆𝑰) −1 𝑿 𝑇 𝒚

36 Regularization: Coefficients and Training Set Size
Degree of polynomial = 9 10 100 1000 Coefficient Coefficient Coefficient

37 Nearest Neighbor Regression – Fixed Distance

38 Nearest Neighbor Regression – Fixed Number (kNN)

39 Nearest Neighbor Regression
Estimate by taking the average over NNs Or use a distance weighted average

40 Nearest Neighbor Regression
Linear Data Non-Linear Data 10 10 30 30 Error Log(Error) 100 300 1000 100 3000 Number of Neighbors Number of Neighbors

41


Download ppt "Contact: Wilson.McKerrow@nyumc.org Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact: Wilson.McKerrow@nyumc.org."

Similar presentations


Ads by Google