Contact: Wilson.McKerrow@nyumc.org Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact: Wilson.McKerrow@nyumc.org.

Contact: Wilson.McKerrow@nyumc.org
Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact:

Linear Regression – one independent variable
Data: Want: (residual) is “small” 2 Predictor Independent Response Dependent

3 What is “small”? Define a loss function: To measure how far off the model is. Want:

4 Standard choice: Sum of square errors (least squares)

5 Advantages of least squares: Gauss-Markov theorem: Lowest variance among unbiased methods. IF: Residuals have mean 0: 𝐸 𝜖 𝑖 =0 Residuals have constant variance: 𝑉 𝜖 𝑖 = 𝜎 2 Errors are uncorrelated: 𝑐𝑜𝑣 𝜖 𝑖 , 𝜖 𝑗 =0

6 Advantages of least squares: Gauss-Markov theorem: Lowest variance among unbiased methods. Equivalent to MLE for

7 Advantages of least squares: Gauss-Markov theorem: Lowest variance among unbiased methods. Equivalent to MLE for Easy to calculate

8 Disadvantages of least squares: Variance can still be large Pays attention to outliers

Minimizing the loss function, L (sum of squared errors):
Linear Regression – One Independent Variable Minimizing the loss function, L (sum of squared errors): 9

Minimizing the loss function, L (sum of squared errors):
Linear Regression – One Independent Variable Minimizing the loss function, L (sum of squared errors): 10

Linear Regression – Vector notation
11 Relationship: 𝑦= 𝑤 1 𝑥+ 𝑤 0 +𝜖 𝒙=(1, 𝑥 1 ) 𝒘=( 𝑤 0 , 𝑤 1 ) 𝑦=𝒙∙𝒘+𝜖

Linear Regression - Multiple Independent Variables
𝑦=𝒙∙𝒘+𝜖 12 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 )

𝒚=𝑿𝒘+𝝐 Linear Regression – Matrix notation
13 𝒚=𝑿𝒘+𝝐 Data: 𝑦 𝑗 , 𝑥 1𝑗 , 𝑥 2𝑗 ,…, 𝑥 𝑘𝑗 for j=1..n 𝒚= ( 𝑦 1 , 𝑦 1 , 𝑦 2 , 𝑦 3 ,… , 𝑦 𝑛 ) 𝑇 𝑿= 1 𝑥 11 … 𝑥 𝑘1 1 𝑥 21 …𝑥 𝑘2 ⋮ 1 ⋮ 𝑥 𝑘1 ⋮ … 𝑥 𝑘𝑛 𝒘= ( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 ) 𝑇

𝒚=𝑿𝒘+𝝐 Linear Regression – Matrix notation
14 𝒚= 1 𝑥 11 … 𝑥 𝑘1 1 𝑥 21 …𝑥 𝑘2 ⋮ 1 ⋮ 𝑥 2𝑛 ⋮ … 𝑥 𝑘𝑛 𝑤 0 𝑤 1 … 𝑤 𝑘 +𝜖= 𝑤 0 + 𝑗 𝑥 𝑖𝑗 𝑤 𝑗 +𝜖 Row by column

Multiple Linear Regression
𝒚=𝑿𝒘+𝝐 Minimizing the loss function, L (sum of squared errors): 𝑑𝐿 𝑑𝒘 = 𝑑 𝑑𝒘 𝝐 𝑻 𝝐 =0 ⇒ 𝑑 𝑑𝒘 𝒚−𝑿𝒘 𝑻 𝒚−𝑿𝒘 = 𝑑 𝑑𝒘 𝒚 𝑻 𝒚− 𝒚 𝑻 𝑿𝒘− 𝑿𝒘 𝑻 𝒚+ 𝑿𝒘 𝑻 𝑿𝒘 𝒚 𝑻 𝒚− 𝒚 𝑻 𝑿𝒘− 𝑿𝒘 𝑻 𝒚+ 𝑿𝒘 𝑻 𝑿𝒘 =−𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿 𝒘 =0 ⇒ 𝒘 = (𝑿 𝑻 𝑿) −𝟏 𝑿 𝑻 𝒚

Non-constant variance
𝜎~𝑥^2 Truth Fit

Non-constant variance
100 repetitions

𝒘 = (𝑿 𝑻 𝑺𝑿) −𝟏 𝑿 𝑻 𝑺𝒚 𝑆= 1/𝜎 1 2 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ 1/𝜎 𝑛 2
Weighted least squares 𝑠𝑑( 𝜖 𝑖 )~ 𝜎 𝑖 but 𝑠𝑑( 𝜖 𝑖 / 𝜎 𝑖 )~1, constant New loss function: 𝐿( 𝑤 1 , 𝑤 0 )= 𝜖 𝑖 𝜎 𝑖 2 Minimized when: 𝒘 = (𝑿 𝑻 𝑺𝑿) −𝟏 𝑿 𝑻 𝑺𝒚 𝑆= 1/𝜎 1 2 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ 1/𝜎 𝑛 2

Weighted least squares
𝜎~𝑥^2 100 repetitions

𝒙=(1, 𝑓 1 (𝑥), 𝑓 2 (𝑥), 𝑓 3 (𝑥),…, 𝑓 𝑙 (𝑥))
(Non)Linear Regression – Sum of Functions 𝑦=𝒙∙𝒘+𝜖 20 𝒙=(1, 𝑓 1 (𝑥), 𝑓 2 (𝑥), 𝑓 3 (𝑥),…, 𝑓 𝑙 (𝑥)) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 ) 𝒙∙𝒘= 𝑤 0 + 𝑓 1 𝑥 𝑤 1 +…+ 𝑓 𝑙 𝑥 𝑤 𝑙

(Non)Linear Regression – Polynomial
𝑦=𝒙∙𝒘+𝜖 21 𝒙=(1, 𝑥, 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑙 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 ) 𝒙∙𝒘= 𝑤 0 + 𝑤 1 𝑥+ 𝑤 2 𝑥 2 +…+ 𝑤 𝑙 𝑥 𝑙

Model Capacity: Overfitting and Underfitting
22

23

24

25 Training Error Error on Training Set Degree of polynomial

26 With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. John von Neumann

Training and Testing Data Set Test Training

Training and Testing – Linear relationship
Error Testing Error Training Error Degree of polynomial

Testing Error and Training Set Size
Low Variance High Variance 10 10 10 30 30 100 30 300 Error 100 1000 100 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial

Coefficients and Training Set Size
Degree of polynomial = 9 10 100 1000 Absolute Value of coefficient Coefficient Coefficient Coefficient

Training and Testing – Non-linear relationship
Error Testing Error Training Error Degree of polynomial

Testing Error and Training Set Size
Low Variance High Variance 10 30 30 30 100 Log(Error) 100 100 300 300 1000 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial

Multiple Linear Regression
Correlation of independent variables increase the uncertainty in parameter determination Standard deviation R2 = 0.9 Uncorrelated Number of Data Points

min 𝒘 𝑳 𝒘 min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2 Regularization
No Regularization: min 𝒘 𝑳 𝒘 Regularization: min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) Ridge Regression: min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2

𝑑𝐿 𝑑𝒘 =−𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿𝒘 Ridge Regression
Recall: 𝑑𝐿 𝑑𝒘 =−𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿𝒘 So: 𝑑 𝑑𝑤 𝐿+𝜆 𝒘 𝑻 𝒘 =−𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿𝒘+𝟐𝝀𝒘 and: −𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿 𝒘 +𝟐𝝀 𝒘 =𝟎 𝑿 𝑇 𝑿+2𝜆𝑰 𝒘= 𝑿 𝑇 𝒚 yielding: 𝒘 =( 𝑿 𝑇 𝑿+2𝜆𝑰) −1 𝑿 𝑇 𝒚

Regularization: Coefficients and Training Set Size
Degree of polynomial = 9 10 100 1000 Coefficient Coefficient Coefficient

Nearest Neighbor Regression – Fixed Distance

Nearest Neighbor Regression – Fixed Number (kNN)

Nearest Neighbor Regression
Estimate by taking the average over NNs Or use a distance weighted average

Nearest Neighbor Regression
Linear Data Non-Linear Data 10 10 30 30 Error Log(Error) 100 300 1000 100 3000 Number of Neighbors Number of Neighbors

Contact: Wilson.McKerrow@nyumc.org Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact: Wilson.McKerrow@nyumc.org.

Similar presentations

Presentation on theme: "Contact: Wilson.McKerrow@nyumc.org Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact: Wilson.McKerrow@nyumc.org."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Contact: Wilson.McKerrow@nyumc.org Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact: Wilson.McKerrow@nyumc.org.

Similar presentations

Presentation on theme: "Contact: Wilson.McKerrow@nyumc.org Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact: Wilson.McKerrow@nyumc.org."— Presentation transcript:

Similar presentations

About project

Feedback