Introduction to Predictive Modeling with Examples David A. Dickey.

Slides:



Advertisements
Similar presentations
Managerial Economics in a Global Economy
Advertisements

Topic 12: Multiple Linear Regression
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Introduction to Predictive Modeling with Examples Nationwide Insurance Company, November 2 D. A. Dickey.
Statistical Techniques I EXST7005 Multiple Regression.
Inference for Regression
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
EPI 809/Spring Probability Distribution of Random Error.
Simple Linear Regression and Correlation
Chapter 12 Simple Linear Regression
Creating Graphs on Saturn GOPTIONS DEVICE = png HTITLE=2 HTEXT=1.5 GSFMODE = replace; PROC REG DATA=agebp; MODEL sbp = age; PLOT sbp*age; RUN; This will.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Chapter 13 Multiple Regression
Multiple regression analysis
Chapter 12 Simple Regression
Statistics for Business and Economics
Correlation and Regression. Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes.
Ch. 14: The Multiple Regression Model building
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Regression and Correlation Methods Judy Zhong Ph.D.
Inference for regression - Simple linear regression
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
12a - 1 © 2000 Prentice-Hall, Inc. Statistics Multiple Regression and Model Building Chapter 12 part I.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Regression Examples. Gas Mileage 1993 SOURCES: Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union. PACE New.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Ordinary Least Squares Estimation: A Primer Projectseminar Migration and the Labour Market, Meeting May 24, 2012 The linear regression model 1. A brief.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Why Design? (why not just observe and model?) CopyrightCopyright © Time and Date AS / Steffen Thorsen All rights reserved. About us | Disclaimer.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Simple Linear Regression ANOVA for regression (10.2)
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Simple Linear Regression. Data available : (X,Y) Goal : To predict the response Y. (i.e. to obtain the fitted response function f(X)) Least Squares Fitting.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Experimental Statistics - week 9
Introduction to Predictive Modeling with Examples David A. Dickey North Carolina State University 2012.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
Lecture 11: Simple Linear Regression
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
Correlation and Simple Linear Regression
Multiple Regression Analysis and Model Building
Slides by JOHN LOUCKS St. Edward’s University.
Correlation and Simple Linear Regression
CHAPTER 29: Multiple Regression*
Prepared by Lee Revere and John Large
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
Introduction to Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Introduction to Predictive Modeling with Examples David A. Dickey

Cool Nerdy “Analytics” = “Statistics” “Predictive Modeling” = “Regression” Part 1: Simple Linear Regression

If the Life Line is long and deep, then this represents a long life full of vitality and health. A short line, if strong and deep, also shows great vitality in your life and the ability to overcome health problems. However, if the line is short and shallow, then your life may have the tendency to be controlled by others

Wilson & Mather JAMA 229 (1974) X=life line length Y=age at death Result: Predicted Age at Death = – 1.367(lifeline) (Is this “real”??? Is this repeatable???) proc sgplot; scatter Y=age X=line; reg Y=age X=line; run ;

We Use LEAST SQUARES Squared residuals sum to 9609

“Best” line is the one that minimizes sum of squared residuals. Best for this sample – is it the true relationship for everyone? SAS PROC REG will compute it. What other lines might be the true line for everyone?? Probably not the purple one. Red one has slope 0 (no effect). Is red line unreasonable? Can we reject H 0 :slope is 0?

Simulation: Age at Death = (life line) + e Error e has normal distribution mean 0 variance 200. Simulate 20 cases with n= 50 bodies each. NOTE: Regression equations : Age(rep:1) = *line. Age(rep:2) = *line. Age(rep:3) = *line. Age(rep:4) = *line. Age(rep:5) = *line. Age(rep:6) = *line. Age(rep:7) = *line. Age(rep:8) = *line. Age(rep:9) = *line. Age(rep:10) = *line. Age(rep:11) = *line. Age(rep:12) = *line. Age(rep:13) = *line. Age(rep:14) = *line. Age(rep:15) = *line. Age(rep:16) = *line. Age(rep:17) = *line. Age(rep:18) = *line. Age(rep:19) = *line. Age(rep:20) = *line. Predicted Age at Death = – 1.367(lifeline) Would NOT be unusual if there is no true relationship.

Conclusion: Estimated slopes vary Standard deviation of estimated slopes = “Standard error” (estimated) Compute t = (estimate – hypothesized)/standard error p-value is probability of larger |t| when hypothesis is correct (e.g. 0 slope) p-value is sum of two tail areas. Traditionally p<0.05  hypothesized value is wrong. p>0.05 is inconclusive. Distribution of t Under H 0

proc reg data=life; model age=line; run; Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 Line Area

Conclusion: insufficient evidence against the hypothesis of no linear relationship. H0:H1:H0:H1: H 0 : Innocence H 1 : Guilt Beyond reasonable doubt P<0.05 H 0 : True slope is 0 (no association) H 1 : True slope is not 0 P=0.3965

Simulation: Age at Death = (life line) + e Error e has normal distribution mean 0 variance 200.  WHY? Simulate 20 cases with n= 50 bodies each. Want estimate of variability around the true line. True variance is Use sums of squared residuals (SS). Sum of squared residuals from the mean is “SS(total)” 9755 Sum of squared residuals around the line is “SS(error)” 9609 (1) SS(total)-SS(error) is SS(model) = 146 (2) Variance estimate is SS(error)/(degrees of freedom) = 200 (3) SS(model)/SS(total) is R 2, i.e. proportion of variablity “explained” by the model. Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square

Part 2: Multiple Regression Issues: (1) Testing joint importance versus individual significance (2) Prediction versus modeling individual effects (3) Collinearity (correlation among inputs) Example: Hypothetical company’s sales Y depend on TV advertising X 1 and Radio Advertising X 2. Y =  0 +  1 X 1 +  2 X 2 +e Jointly critical (can’t omit both!!) Two engine plane can still fly if engine #1 fails Two engine plane can still fly if engine #2 fails Neither is critical individually

Data Sales; length sval $8; length cval $8; input store TV radio sales; (more code) cards; (more data) proc g3d data=sales; scatter radio*TV=sales/shape=sval color=cval zmin=8000; run; TV Sales Radio

P 2 axis

Conclusion: Can predict well with just TV, just radio, or both! SAS code: proc reg data=next; model sales = TV radio; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001  (Can’t omit both) Error Corrected Total Root MSE R-Square  Explaining 95% of variation in sales Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept TV  (can omit TV) radio  (can omit radio) Estimated Sales = TV radio with error variance (standard deviation 213). TV approximately equal to radio so, approximately Estimated Sales = TV or Estimated Sales = radio

Setting TV = radio (approximate relationship) Estimated Sales = TV is this the BEST TV line? Estimated Sales = radio is this the BEST radio line? Proc Reg Data=Stores; Model Sales = TV; Model Sales = radio; run;

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept TV <.0001 ********************************************************************************************* Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept radio <.0001

Sums of squares capture variation explained by each variable Type I: How much when it is added to the model? Type II: How much when all other variables are present (as if it had been added last) Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS Intercept TV radio *********************************************************************************** Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS Intercept radio TV

Summary: Good predictions given by Sales = x TV x Radio or Sales = x TV or Sales = x Radio or (lots of others) Why the confusion? The evil Multicollinearity!! (correlated X’s)

Those Mysterious “Degrees of Freedom” (DF) First Martian  information about average height 0 information about variation. 2 nd Martian gives first piece of information (DF) about error variance around mean. n Martians n-1 DF for error (variation)

Martian Height Martian Weight 2 points  no information on variation of errors n points  n-2 error DF

How Many Table Legs? (regress Y on X 1, X 2 ) X1X1 X2X2 error Fit a plane  n-3 (37) error DF (2 “model” DF, n-1=39 “total” DF) Regress Y on X1 X2 … X7  n-8 error DF (7 “model” DF, n-1 “total” DF) Sum of Mean Source DF Squares Square Model Error Corrected Total Three legs will all touch the floor. Fourth leg gives first chance to measure error (first error DF).

Grades vs. IQ and Study Time Data tests; input IQ Study_Time Grade; IQ_S = IQ*Study_Time; cards; ; Proc reg data=tests; model Grade = IQ; Proc reg data=tests; model Grade = IQ Study_Time; Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept IQ Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept IQ Study_Time

Contrast: TV advertising looses significance when radio is added. IQ gains significance when study time is added. Model for Grades: Predicted Grade = x IQ x Study Time Question: Does an extra hour of study really deliver 2.10 points for everyone regardless of IQ? Current model only allows this.

“Interaction” model: Predicted Grade =  0.13 x IQ  4.11 x Study Time x IQ x Study Time = (72.21  0.13 x IQ )+(  x IQ )x Study Time IQ = 102 predicts Grade = ( )+( ) x Study Time = x Study Time IQ = 122 predicts Grade = ( )+( ) x Study Time = x Study Time proc reg; model Grade = IQ Study_Time IQ_S; Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept IQ Study_Time IQ_S

(1)Adding interaction makes everything insignificant (individually) ! (2)Do we need to omit insignificant terms until only significant ones remain? (3)Has an acquitted defendant proved his innocence? (4)Common sense trumps statistics! Slope = 1.30 Slope = 2.36

Part 3: Diagnosing Problems in Regression Main problems are Multicollinearity (correlation among inputs) Outliers TV $ Radio $ Principal Component Axis 1 Principal Component Axis 2 Proc Corr; Var TV radio sales; Pearson Correlation Coefficients, N = 40 Prob > |r| under H0: Rho=0 TV radio sales TV <.0001 <.0001 radio <.0001 <.0001 sales <.0001 <.0001

TV <.0001 radio <.0001 Principal Components (1)Center and scale variables to mean 0 variance 1. (2)Call these X 1 (TV) and X 2 (radio) (3)n variables  total variation is n (n=2 here) (4)Find most variable linear combination P 1 =__X 1 +__X 2 Variances are out of 2 (along P 1 axis) standard deviation and out of 2 (along P 2 axis) standard deviation Ratio of standard deviations (27.6) is “condition number” large  unstable regression. Rule of thumb: Ratio 1 is perfect, >30 problematic. Spread on long axis is 27.6 times that on short axis. Variance Inflation Factor (1) Regress predictor i on all the others getting r-square: R i 2 (2) VIF is 1/(1- R i 2 ) for variable i (measures collinearity). (3) VIF > 10 is a problem.

Variance Inflation Factor (1) Regress predictor i on all the others getting r-square: R i 2 (2) VIF is 1/(1- R i 2 ) for variable i (measures collinearity). (3) VIF > 10 is a problem. Example: Proc Reg Data=Sales; Model Sales = TV Radio/VIF collinoint; Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > |t| Inflation Intercept TV radio Collinearity Diagnostics (intercept adjusted) Condition --Proportion of Variation- Number Eigenvalue Index TV radio We have a MAJOR problem! (note: other diagnostics besides VIF and condition number are available)

Another problem: Outliers Example: Add one point to TV-Radio data TV 1021, radio 954, Sales 9020 Proc Reg: Model Sales = TV radio/ p r; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept TV ??????? radio <.0001 Dependent Predicted Std Error Student Cook's Obs Variable Value Residual Residual Residual D | *| | | |*** | | ****| | TV ‚ 1200 ˆ ‚ + ‚ ‚ + ‚ ++ ‚ +++ ‚ + ‚ ˆ ++ ‚ ++++ ‚ + + ‚ + ‚ ˆ+ Šˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒ radio

Ordinary residual for store 41 not too bad ( ) PRESS residuals (1)Remove store i, Sales Y(i) (2)Fit model to other 40 stores (3)Get model prediction P(i) for store I (4)PRESS residual is Y(i)-P(i) Store number 41 Regular O and PRESS (dot) residuals proc reg data=raw; model sales = TV radio; output out=out1 r=r press= press; run; View Along the P 2 Axis P 2 (2 nd Principal Component)

Part 4: Classification Variables (dummy variables, indicator variables) Predicted Accidents = X 11 X 11 is 1 in November, 0 elsewhere. Interpretation: In November, predict (1) = In any other month predict (0) = is average of other months is added November effect (vs. average of others) Model for NC Crashes involving Deer: Proc reg data=deer; model deer = X11; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept <.0001 X <.0001

Looks like December and October need dummies too! Proc reg data=deer; model deer = X10 X11 X12; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept <.0001 X <.0001 X <.0001 X <.0001 Average of Jan through Sept. is 929 crashes per month. Add 1391 in October, 2830 in November, 1377 in December.

What the heck – let’s do all but one (need “average of rest” so must leave out at least one) Proc reg data=deer; model deer = X1 X2 … X10 X11; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X X <.0001 Average of rest is just December mean Subtract 886 in January, add 1452 in November. October (X10) is not significantly different than December.

negative positive

Add date (days since Jan in SAS) to capture trend Proc reg data=deer; model deer = date X1 X2 … X10 X11; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X <.0001 X X <.0001 date <.0001 Trend is 0.22 more accidents per day (1 per 5 days) and is significantly different from 0.