Download presentation
Presentation is loading. Please wait.
Published byIsaac Cunningham Modified over 9 years ago
1
Topic 13: Multiple Linear Regression Example
2
Outline Description of example Descriptive summaries Investigation of various models Conclusions
3
Study of CS students Too many computer science majors at Purdue were dropping out of program Wanted to find predictors of success to be used in admissions process Predictors must be available at time of entry into program.
4
Data available GPA after three semesters Overall high school math grade Overall high school science grade Overall high school English grade SAT Math SAT Verbal Gender (of interest for other reasons)
5
Data for CS Example Y is the student’s grade point average (GPA) after 3 semesters 3 HS grades and 2 SAT scores are the explanatory variables (p=6) Have n=224 students
6
Descriptive Statistics Data a1; infile 'C:\...\csdata.dat'; input id gpa hsm hss hse satm satv genderm1; proc means data=a1 maxdec=2; var gpa hsm hss hse satm satv; run;
7
Output from Proc Means VariableNMeanStd DevMinimumMaximum gpa hsm hss hse satm satv 224 224 224 2.64 8.32 8.09 8.09 595.29 504.55 0.78 1.64 1.70 1.51 86.40 92.61 0.12 2.00 3.00 3.00 300.00 285.00 4.00 10.00 10.00 10.00 800.00 760.00
8
Descriptive Statistics proc univariate data=a1; var gpa hsm hss hse satm satv; histogram gpa hsm hss hse satm satv /normal; run;
15
Correlations proc corr data=a1; var hsm hss hse satm satv; proc corr data=a1; var hsm hss hse satm satv; with gpa; run;
16
Output from Proc Corr Pearson Correlation Coefficients, N = 224 Prob > |r| under H0: Rho=0 gpahsmhsshsesatmsatv gpa 1.000000.43650 <.0001 0.32943 <.0001 0.28900 <.0001 0.25171 0.0001 0.11449 0.0873 hsm 0.43650 <.0001 1.000000.57569 <.0001 0.44689 <.0001 0.45351 <.0001 0.22112 0.0009 hss 0.32943 <.0001 0.57569 <.0001 1.000000.57937 <.0001 0.24048 0.0003 0.26170 <.0001 hse 0.28900 <.0001 0.44689 <.0001 0.57937 <.0001 1.000000.10828 0.1060 0.24371 0.0002 satm 0.25171 0.0001 0.45351 <.0001 0.24048 0.0003 0.10828 0.1060 1.000000.46394 <.0001 satv0.11449 0.0873 0.22112 0.0009 0.26170 <.0001 0.24371 0.0002 0.46394 <.0001 1.00000
17
Output from Proc Corr Pearson Correlation Coefficients, N = 224 Prob > |r| under H0: Rho=0 hsmhsshsesatmsatv gpa0.43650 <.0001 0.32943 <.0001 0.28900 <.0001 0.25171 0.0001 0.11449 0.0873 All but SATV significantly correlated with GPA
18
Scatter Plot Matrix proc corr data=a1 plots=matrix; var gpa hsm hss hse satm satv; run; Allows visual check of pairwise relationships
19
No “strong” linear Relationships Can see discreteness of high school scores
20
Use high school grades to predict GPA (Model #1) proc reg data=a1; model gpa=hsm hss hse; run;
21
Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept10.589880.294242.000.0462 hsm10.168570.035494.75<.0001 hss10.034320.037560.910.3619 hse10.045100.038701.170.2451 Root MSE0.69984R-Square0.2046 Dependent Mean2.63522Adj R-Sq0.1937 Coeff Var26.55711 Results Model #1 Meaningful??
22
ANOVA Table #1 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model327.712339.2374418.86<.0001 Error220107.750460.48977 Corrected Total223135.46279 Significant F test but not all variable t tests significant
24
Remove HSS (Model #2) proc reg data=a1; model gpa=hsm hse; run;
25
Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept10.624230.291722.140.0335 hsm10.182650.031965.72<.0001 hse10.060670.034731.750.0820 Root MSE0.69958R-Square0.2016 Dependent Mean2.63522Adj R-Sq0.1943 Coeff Var26.54718 Results Model #2 Slightly better MSE and adjusted R-Sq
26
ANOVA Table #2 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model227.3034913.6517527.89<.0001 Error221108.159300.48941 Corrected Total223135.46279 Significant F test but not all variable t tests significant
27
Rerun with HSM only (Model #3) proc reg data=a1; model gpa=hsm; run;
28
Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept10.907680.243553.730.0002 hsm10.207600.028727.23<.0001 Root MSE0.70280R-Square0.1905 Dependent Mean2.63522Adj R-Sq0.1869 Coeff Var26.66958 Results Model #3 Slightly worse MSE and adjusted R-Sq
29
ANOVA Table #3 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model125.80989 52.25<.0001 Error222109.652900.49393 Corrected Total223135.46279 Significant F test and all variable t tests significant
31
SATs (Model #4) proc reg data=a1; model gpa=satm satv; run;
32
Root MSE0.75770R-Square0.0634 Dependent Mean2.63522Adj R-Sq0.0549 Coeff Var28.75287 Results Model #4 Much worse MSE and adjusted R-Sq Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept11.288680.376043.430.0007 satm10.002280.000662913.440.0007 satv1-0.000024560.00061847-0.040.9684
33
ANOVA Table #4 Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model28.583844.291927.480.0007 Error221126.878950.57411 Corrected Total223135.46279 Significant F test but not all variable t tests significant
34
HS and SATs (Model #5) proc reg data=a1; model gpa=satm satv hsm hss hse; *Does general linear test; sat: test satm, satv; hs: test hsm, hss, hse;
35
Root MSE0.70000R-Square0.2115 Dependent Mean2.63522Adj R-Sq0.1934 Coeff Var26.56311 Results Model #5 Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept10.326720.400000.820.4149 hsm10.145960.039263.720.0003 hss10.035910.037800.950.3432 hse10.055290.039571.400.1637 satm10.000943590.000685661.380.1702 satv1-0.000407850.00059189-0.690.4915
36
Test sat Test sat Results for Dependent Variable gpa SourceDF Mean SquareF ValuePr > F Numerator20.465660.950.3882 Denominator2180.49000 Cannot reject the reduced model…No significant information lost…We don’t need SAT variables
37
Test hs Test hs Results for Dependent Variable gpa SourceDF Mean SquareF ValuePr > F Numerator36.6866013.65<.0001 Denominator2180.49000 Reject the reduced model…There is significant information lost…We can’t remove HS variables from model
38
Best Model? Likely the one with just HSM or the one with HSE and HSM. We’ll discuss comparison methods in Chapters 7 and 8
39
Key ideas from case study First, look at graphical and numerical summaries one variable at a time Then, look at relationships between pairs of variables with graphical and numerical summaries. Use plots and correlations to understand relationships
40
Key ideas from case study The relationship between a response variable and an explanatory variable depends on what other explanatory variables are in the model A variable can be a significant (P 0.5) when other X’s are in the model
41
Key ideas from case study Regression coefficients, standard errors and the results of significance tests depend on what other explanatory variables are in the model
42
Key ideas from case study Significance tests (P values) do not tell the whole story Squared multiple correlations give the proportion of variation in the response variable explained by the explanatory variables) can give a different view We often express R 2 as a percent
43
Key ideas from case study You can fully understand the theory in terms of Y = Xβ + e However to effectively use this methodology in practice you need to understand how the data were collected, the nature of the variables, and how they relate to each other
44
Background Reading Cs2.sas contains the SAS commands used in this topic
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.