Shibin Liu SAS Beijing R&D

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Forecasting Using the Simple Linear Regression Model and Correlation
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Inference for Regression
Probability & Statistical Inference Lecture 9
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Chapter 13 Multiple Regression
Multiple regression analysis
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 12 Multiple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
The Simple Regression Model
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter Topics Types of Regression Models
Linear Regression and Correlation Analysis
Introduction to Probability and Statistics Linear Regression and Correlation.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Chapter 15: Model Building
Correlation and Regression Analysis
Chapter 7 Forecasting with Simple Regression
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Simple Linear Regression Analysis
Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation coefficient r.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Correlation & Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 16 Data Analysis: Testing for Associations.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Lecture 10: Correlation and Regression Model.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
Chapter 4 Basic Estimation Techniques
Correlation and Regression
CHAPTER 29: Multiple Regression*
Presentation transcript:

Shibin Liu SAS Beijing R&D Regression Shibin Liu SAS Beijing R&D

Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 2

Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 3

Lesson overview Response Variable Predictor Variable + ANOVA 4

Lesson overview Correlation analysis Linear regression Continuous 5

Lesson overview Correlation analysis Continuous response Continuous predictor Correlation analysis Measure linear association Examine the relationship Screen for outliers Interpret the correlation 6

Lesson overview Linear regression Continuous response Continuous predictor Linear regression Define the linear association Determine the equation for the line Explain or predict variability 7

Lesson overview 8 What do you want to examine? The relationship between variables The difference between groups on one or more variables The location, spread, and shape of the data’s distribution Summary statistics or graphics? How many groups? Which kind of variables? SUMMARY STATISTICS DISTRIBUTION ANALYSIS TTEST LINEAR MODELS CORRELATIONS ONE-WAY FREQUENCIES & TABLE ANALYSIS LINEAR REGRESSION LOGISTIC REGRESSION Summary statistics Both Two Two or more Descriptive Statistics Descriptive Statistics, histogram, normal, probability plots Analysis of variance Continuous only Frequency tables, chi-square test Categorical response variable Inferential Statistics Lesson 1 Lesson 2 Lesson 3 & 4 Lesson 5 8

1. Exploratory Data Analysis Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 9

Exploratory Data Analysis: Introduction Height Weight Continuous variable Continuous variable Exploratory data analysis Exploratory data analysis Scatter plot Correlation analysis Linear regression 10

Exploratory Data Analysis: Objective Examine the relationship between continuous variable using a scatter plot Quantify the degree of association between two continuous variables using correlation statistics Avoid potential misuses of the correlation coefficient Obtain Pearson correlation coefficients 11

Communicate analysis result Exploratory Data Analysis: Using Scatter Plots to Describe Relationships between Continuous Variables Exploratory data analysis Scatter plot Correlation analysis Relationship Trend X: Predict variable Y: Response variable Coordinate: values of X and Y Range Outlier Communicate analysis result X: Predict variable Y: Response variable Coordinate: values of X and Y 12

Exploratory Data Analysis: Using Scatter Plots to Describe Relationships between Continuous Variables ? Model Terms2 Squared Quadratic 13

Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables Scatter plot Correlation analysis Correlation analysis Linear association Negative Zero Positive Zero Negative Positive 14

Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables Person correlation coefficient: For population For sample 15

No linear relationship Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables Person correlation coefficient: r -1 +1 Correlation analysis No linear relationship Strong negative linear relationship Strong positive linear relationship 16

Exploratory Data Analysis: Hypothesis testing for a Correlation Correlation Coefficient Test H0: 𝜌= 0 Population parameter Sample statistic Correlation 𝜌 r Ha: 𝜌≠ 0 A p-value does not measure the magnitude of the association. Sample size affects the p-value. Rejecting the null hypothesis only means that you can be confident that the true population correlation is not 0. small p-value can occur (as with many statistics) because of very large sample sizes. Even a correlation coefficient of 0.01 can be statistically significant with a large enough sample size. Therefor, it is important to also look at the value of r itself to see whether it is meaningfully large. 17

Exploratory Data Analysis: Hypothesis testing for a Correlation -1 +1 r r r r 0.81 0.72 18

Correlation does not imply causation Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect Correlation does not imply causation Besides causality, could other reasons account for strong correlation between two variables? A strong correlation between two variables does not mean change in one variable causes the other variable to change, or vice versa. Sample correlation coeffcients can be large because of chance or both varibles are affected by other variables. Weight Height 19

Correlation does not imply causation Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect Correlation does not imply causation Weight Height A strong correlation between two variables does not mean change in one variable causes the other variable to change, or vice versa. 20

Correlation does not imply causation Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect Correlation does not imply causation Sample correlation coefficients can be large because of chance or because both variables are affected by other variable. 21

Correlation does not imply causation Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect Correlation does not imply causation Sample correlation coefficients can be large because of chance or because both variables are affected by other variable. 22

SAT score bounded to college entrance or not Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect ? SAT score bounded to college entrance or not Sample correlation coefficients can be large because of chance or because both variables are affected by other variable. X: the percent of students who take the SAT exam in one of the state Y: SAT scores There are many reasons for the varying participation rates. Some state have lower participation because their students primarily take the rival ACT standaized test. Others have rule requiring even non-college-students to take the test. In low participating states, often only the highest performing students choose to take the SAT. X: the percent of students who take the SAT exam in one of the states Y: SAT scores 23

Pearson correlation coefficient: r -> 0 Exploratory Data Analysis: Avoiding Common Errors: Types of Relationships ? Pearson correlation coefficient: r -> 0 curvilinear Pearson correlation coefficient measure linear relationships. A Pearson correlation coefficient close to 0 indicates that there is not a strong linear relationship between two variables. A Pearson correlation coefficient close to 0 does not mean that there is no relationship of any kind between the two variables. parabolic quadratic 24

Exploratory Data Analysis: Avoiding Common Errors: outliers Data one Data two r =0.02 r =0.82 25

Exploratory Data Analysis: Avoiding Common Errors: outliers What to do with outlier? ? Why an outlier Valid Compute two correlation coefficients Error Collect data Replicate data: collecting data at a fixed value of x (in this case, x=10). This determines whether the data point is unusual. Report both coefficients Replicate data 26

Exploratory Data Analysis: Scenario: Exploring Data Using Correlation and Scatter Plots Fitness In exercise physiology, an objective measure of aerobic fitness is how efficiently the body can absorb and use oxygen (oxygen consumption). Subjects participated in a predetermined exercise run of 1.5 miles. Measurement of oxygen consumption as well as several other continuous measurements such as age, pulse, and weight were recorded. The researcher are interested in determining whether any of theses other variables can help predict consumption. Fitness oxygen consumption ? 27

Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots 28

Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots What’s the Pearson correlation coefficient of Oxygen_Consumption with Run_Time? What’s the p-value for the correlation of Oxygen_Consumption with Performance? 29

Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots 30

Exploratory Data Analysis: Examining Correlations between Predictor Variables 31

Exploratory Data Analysis: Examining Correlations between Predictor Variables What are the two highest Pearson correlation coefficient s? 32

Exploratory Data Analysis Question 1. The correlation between tuition and rate of graduation at U.S. college is 0.55. What does this mean? The way to increase graduation rates at your college is to raise tuition Increasing graduation rates is expensive, causing tuition to rise Students who are richer tend to graduate more often than poorer students None of the above. Answer: d 33

2. Simple Linear Regression Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 34

Simple Linear Regression: Introduction 35

Simple Linear Regression: Introduction -1 +1 Variable A Variable B Variable C Variable D Linear relationships 36

Simple Linear Regression: Introduction Same r Different 37

Simple Linear Regression: Introduction Y: variable of primary interest Regression Line Simple Linear Regression: Linear relationship Predict the Y values X: explains variability in Y 38

Simple Linear Regression: Objective Explain the concepts of Simple Linear Regression Fit a Simple Linear Regression using the Linear Regression task Produce predicted values and confidence intervals. 39

Simple Linear Regression Simple Linear Regression: Scenario: Performing Simple Linear Regression Simple Linear Regression Fitness Run_Time Oxygen_Consumption Linear regression 40

Simple Linear Regression: The Simple Linear Regression Model 41

Simple Linear Regression: The Simple Linear Regression Model Question 2. What does epsilon represent? The intercept parameter The predictor variable The variation of X around the line The variation of Y around the line Answer: d 42

Simple Linear Regression: How SAS Performs Linear Regression Method of least square Minimize Best Linear Unbiased Estimators . Are unbiased estimators . Have minimum variance 43

Simple Linear Regression: Measuring How Well a Model Fits the Data Regression model Baseline model VS. VS. 44

Simple Linear Regression: Comparing the Regression Model to a Baseline Model Type of variability Equation Explained (SSM) ( 𝑌 𝑖 − 𝑌 ) 2 Unexplained (SSE) ( 𝑌 𝑖 − 𝑌 𝑖 ) 2 Total ( 𝑌 𝑖 − 𝑌 ) 2 Base line model: 𝑌 Better model: Explain more variability 45

Simple Linear Regression: Hypothesis Testing for Linear Regression 46

Simple Linear Regression: Assumptions of Simple Linear Regression 1 .The mean of Y is linearly related to X. 2. Errors are normally distributed 3. Errors have equal variances. 4. Errors are independent. 47

Simple Linear Regression: Performing Simple Linear Regression Task >Regression>Linear Regression Task >Regression>Linear Regression 48

Simple Linear Regression: Performing Simple Linear Regression Task >Regression>Linear Regression Task >Regression>Linear Regression 49

Simple Linear Regression: Performing Simple Linear Regression Question 3. In the model Y=X, if the parameter estimate (slope) of X is 0, then which of the following is the best guess (predicted value) for Y when X is equals to 13? 13 The mean of Y A random number The mean of X Answer: b 50

Simple Linear Regression: Confidence and Prediction Intervals A 95% confident interval for the mean says that You are 95% confident that your interval contains the true population mean of Y for a particular X. Confident intervals becomes wider as you move away from the mean of the independent variable. This reflects the fact that your estimates become more variable as you move away from the means of X and Y. For prediction: A 95% confident interval is one that you are 95% confident contains a new observation. Prediction intervals are wider than confidence intervals because single observations have more variability than sample means. 51

Simple Linear Regression: Confidence and Prediction Intervals Question 4. Suppose you have a 95% confidence interval around the mean. How do you interpret it? The probability is .95 that the true population mean of Y for a particular X is within the interval. You are 95% confident that a newly sampled value of Y for a particular X is within the interval. You are 95% confident that your interval contains the true population mean of Y for a particular X. Answer: c 52

Simple Linear Regression: Confidence and Prediction Intervals 53

Simple Linear Regression: Confidence and Prediction Intervals 54

Simple Linear Regression: Producing Predicted Values of the Response Variable data Need_Predictions; input Runtime @@; datalines; 9 10 11 12 13 ; run; data Need_Predictions; input Runtime @@; datalines; 9 10 11 12 13 ; run; 55

Simple Linear Regression: Producing Predicted Values of the Response Variable data Need_Predictions; input Runtime @@; datalines; 9 10 11 12 13 ; run; data Need_Predictions; input Runtime @@; datalines; 9 10 11 12 13 ; run; 56

Simple Linear Regression: Producing Predicted Values of the Response Variable 18 18 57

Agenda 3. Multiple Regression 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 58

Multiple Regression 3. Multiple Regression 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 59

Multiple Regression: Introduction Response Variable Predictor Variable Simple Linear Regression Response Variable Predictor Variable Predictor Variable Multiple Linear Regression More than one Predictor Variable 𝑌= 𝛽 0 + 𝛽 1 𝑋 1 + …+ 𝛽 𝑘 𝑋 𝑘 +𝜀 60

Multiple Regression: Introduction Simple Linear Regression Multiple Linear Regression When k=2 When k=2 𝑌= 𝛽 0 + 𝛽 1 𝑋 1 + …+ 𝛽 𝑘 𝑋 𝑘 +𝜀 61

Multiple Regression: Objective Explain the mathematical model for multiple regression Describe the main advantage of multiple regression versus simple linear regression Explain the standard output from the Linear Regression task. Describe common pitfalls of multiple linear regression 62

Multiple Linear Regression Multiple Regression Advantages and Disadvantages of Multiple Regression Multiple Linear Regression Advantages Disadvantages 127 possible model 127 possible model Complex to interpret 63

Multiple Regression Picturing the Model for Multiple Regression Multiple Linear Regression 𝑘+1 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝛽 0 𝑌= 𝛽 0 + 𝛽 1 𝑋 1 + …+ 𝛽 𝑘 𝑋 𝑘 +𝜀 𝑘+1 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑖𝑛𝑐𝑙𝑢𝑑𝑒 𝛽 1 ≠0 𝛽 2 ≠ 0 𝑌= 𝛽 0 𝛽 1 = 𝛽 2 =0 64

Multiple Regression Picturing the Model for Multiple Regression Multiple Linear Regression 𝑌= 𝛽 0 + 𝛽 1 𝑋 1 + …+ 𝛽 𝑘 𝑋 𝑘 +𝜀 65

Multiple Regression Common applications Multiple Linear Regression is a powerful tool for the following tasks: Prediction, which is used to develop a model future values of a response variable (Y) based one its relationships with other predictor variables (Xs). Analytical or Explanatory Analysis, which is used to develop an understanding of the relationships between the response variable and predictor variables Myers (1999) refers to four applications of regression: Prediction Variable screening Model specification Parameter estimation 66

Multiple Regression Analysis versus Prediction in Multiple Regression The terms in the model, the values of their coefficients, and their statistical significance are of secondary importance. The focus is on producing a model that is the best at predicting future values of Y as a function of the Xs. 𝑌 = 𝛽 0+ 𝛽 1 𝑋 1 + …+ 𝛽 𝑘 𝑋 𝑘 67

Multiple Regression Analysis versus Prediction in Multiple Regression Analytical or Explanatory Analysis The focus is understanding the relationship between the dependent variable and independent variables. Consequently, the statistical significance of the coefficient is important as well as the magnitudes and signs of the coefficients. 𝑌 = 𝛽 0+ 𝛽 1 𝑋 1 + …+ 𝛽 𝑘 𝑋 𝑘 68

Multiple Regression Hypothesis Testing for Multiple Regression Multiple Linear Regression 𝑌= 𝛽 0 + 𝛽 1 𝑋 1 + …+ 𝛽 𝑘 𝑋 𝑘 +𝜀 H0: The regression model does not fit the data better than the baseline model. H0: 𝛽 1 = 𝛽 2 =…= 𝛽 𝑘 =0 Null hypothesis: The regression model does not fit the data better than the baseline model. Alternative hypothesis: The regression model does fit the data better than the baseline model. H𝑎: 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽 𝑖 ≠0 H𝑎: The regression model does fit the data better than the baseline model. 69

Multiple Regression Hypothesis Testing for Multiple Regression Question 4. Match below items left and right? a At least one slope of the regression in the population is not 0 and at least one predictor variable explains a significant amount of variability in the response model No predictor variable explains a significant amount of variability in the response variable The estimated linear regression model does not fit the data better than the baseline model a) Reject the null hypothesis a) Reject the null hypothesis b) Fail to reject the null hypothesis b Null hypothesis: The regression model does not fit the data better than the baseline model. Alternative hypothesis: The regression model does fit the data better than the baseline model. b 70

Multiple Regression Assumptions for Multiple Regression Linear regression model Assumptions: 1 .The mean of Y is linearly related to X. 2. Errors are normally distributed 3. Errors have equal variances. 4. Errors are independent. 71

Multiple Regression: Scenario: Using Multiple Regression to Explain Oxygen Consumption Age Performance 72

Multiple Regression: Adjust R2 Adj. R2 𝑅 2 =1− (𝑛−𝑖)(1−R2 ) 𝑛−𝑝 𝑅 2 =1− (𝑛−𝑖)(1−R2 ) 𝑛−𝑝 i = 1 if there is an intercept and 0 otherwise n = the number of observations used to fit the model p = the number of parameters in the model i = 1 if there is an intercept and 0 otherwise n = the number of observations used to fit the model p = the number of parameters in the model 73

Multiple Regression: Performing Multiple Linear Regression 74

Multiple Regression: Performing Multiple Linear Regression What’s the p-value of the overall model? Should we reject the null hypothesis or not? Based on our evidence, do we reject the null hypothesis that the parameter estimate is 0? 75

Multiple Regression: Performing Multiple Linear Regression Oxygen_ Consumption Performance Oxygen_ Consumption RunTime ? Performance Oxygen_ Consumption RunTime 76

Multiple Regression: Performing Multiple Linear Regression -0.82049 Performance RunTime Collinearity 0.82049 77

4. Model Building and Interpretation Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 78

Model Building and Interpretation : Introduction Age Performance 79

Model Building and Interpretation : Introduction ? 80

Model Building and Interpretation : Introduction Stepwise selection methods All possible regressions rank criteria: Forward R2 Adjusted R2 Backward Mallows’ Cp Stepwise ‘No selection’ is the default 81

Model Building and Interpretation: Objectives Explain the Linear Regression task options for the model selection Describe model selection options and interpret output to evaluate the fit of several models 82

Model Building and Interpretation : Approaches to Selecting Models: Manual Full Model ?! manual 83

Stepwise selection methods Model Building and Interpretation : SAS and Automated Approaches to Modeling Stepwise selection methods All possible regressions rank criteria: Forward R2 Run all methods Adjusted R2 Backward Look for commonalities Mallows’ Cp Narrow down models Stepwise ‘No selection’ is the default 84

Model Building and Interpretation : The All-Possible Regressions Approach to Model Building Fitness 128 possible models 128 possible models Predictor variables 85

Model Building and Interpretation : Evaluating Models Using Mallows' Cp Statistic Model Bias Under-fitting Over-fitting Under-fitting 86

Model Building and Interpretation : Evaluating Models Using Mallows' Cp Statistic Model Bias Parameter estimation For Prediction P: the number of the parameters in the model, including the intercept Hockings' criterion: Cp <= 2p –pfull +1 Mallows' criterion: Cp <= p criteria 87

Model Building and Interpretation : Viewing Mallows' Cp Statistic Linea Regression task Cp Partial output Partial output +1=p 88

Model Building and Interpretation : Viewing Mallows' Cp Statistic Mallows' criterion: Cp <= p Partial output Partial output In this output, how many models have a value for Cp that is less than or equal to p? Which of these models has the fewest parameters? 89

Model Building and Interpretation : Viewing Mallows' Cp Statistic First of all, what is the p for the full model? Hockings' criterion: Cp <= 2p –pfull +1 Partial output Pfull = 8 (7 vars +1intercept ) Partial output How many models meet Hockings' criterion for Cp for parameter estimation? Cp <= 12 – 8 +1 90

Model Building and Interpretation : Viewing Mallows' Cp Statistic Question 5. What happens when you use the all-possible regressions method? Select all that apply. y You compare the R-square, adjusted R-square, and Cp statistics to evaluate the models. SAS computes al possible models You choose a selection method (stepwise, forward, or backward) SAS ranks the results. You cannot reduce the number of models in the output You can produce a plot to help identify models that satisfy criteria for the Cp statistic. y y y 91

Model Building and Interpretation : Viewing Mallows' Cp Statistic Question 6. Match below items left and right. c Prefer to use R-square for evaluating multiple linear regression models (take into account the number of terms in the model). Useful for parameter estimation Useful for prediction Mallows' criterion for Cp. Hockings' criterion for Cp. adjusted R-square b a 92

Model Building and Interpretation : Using Automatic Model Selection 93

Model Building and Interpretation : Using Automatic Model Selection 94

Model Building and Interpretation : Estimating and Testing Coefficients for Selected Models – Prediction model 95

Model Building and Interpretation : Estimating and Testing Coefficients for Selected Models –Explanatory Model 96

Model Building and Interpretation : Estimating and Testing Coefficients for Selected Models 97

Stepwise selection methods Model Building and Interpretation : The Stepwise Selection Approach to Model Building Stepwise selection methods Forward Backward With Stepwise selection methods, we don’t need to create all possible models. Stepwise 98

Model Building and Interpretation : The Stepwise Selection Approach to Model Building Forward Forward selection method starts with no variable, then select the most significant variable, until there is no significant variable. The variable added will not be removed even it becomes in-significant later. Forward selection method starts with no variable, then select the most significant variable, until there is no significant variable. The variable added will not be removed even it becomes in-significant later. 99

Model Building and Interpretation : The Stepwise Selection Approach to Model Building Backward Backward selection method starts with all variables in, then remove the most in-significant variable, until all variables left are significant. Once the variable is removed, it cannot re-enter. Backward selection method starts with all variables in, then remove the most in-significant variable, until all variables left are significant. Once the variable is removed, it cannot re-enter. 100

Model Building and Interpretation : The Stepwise Selection Approach to Model Building Stepwise combines the thoughts of both Forward and Backward selection. It starts with no variable, then select the most significant variable as the Forward , however, like Backward selection, stepwise method can drop the in-significant variable one at a time. until there is no significant variable. Stepwise method stops when all terms in the model are significant , and all terms out off model are not significant. Stepwise combines the thoughts of both Forward and Backward selection. It starts with no variable, then select the most significant variable as the Forward , however, like Backward selection, stepwise method can drop the in-significant variable one at a time. until there is no significant variable. Stepwise method stops when all terms in the model are significant , and all terms out off model are not significant. 101

Model Building and Interpretation : The Stepwise Selection Approach to Model Building Application Stepwise selection methods Forward Identify candidate models Backward Use expertise to choose Stepwise 102

Model Building and Interpretation : Performing Stepwise Regression: Forward selection 103

Model Building and Interpretation : Performing Stepwise Regression: Forward selection 104

Model Building and Interpretation : Performing Stepwise Regression: Backward selection 105

Model Building and Interpretation : Performing Stepwise Regression: Backward selection 106

Model Building and Interpretation : Performing Stepwise Regression: Stepwise selection 107

Model Building and Interpretation : Performing Stepwise Regression: Stepwise selection 108

Model Building and Interpretation : Using Alternative Significance Criteria for Stepwise Models Stepwise Regression Models With default significant levels Using 0.05 significant levels Stepwise Regression Models 109

Model Building and Interpretation : Comparison of Selection Methods Stepwise selection methods Use fewer computer resources All-possible regression Generate more candidate models that might have nearly equal R2 and Cp statistics. 110

Agenda 5. Summary 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 111

Home Work: Exercise 1 1.1 Describing the Relationship between Continuous Variables Percentage of body fat, age, weight, height, and 10 body circumference measurements (for example, abdomen) were recorded for 252 men. The data are stored in the BodyFat2 data set. Body fat one measure of health, was accurately estimated by an underwater weighing technique. There are two measures of percentage body fat in this data set. Case case number PctBodyFat1 percent body fat using Brozek’s equation, 457/Density-414.2 PctBodyFat2 percent body fat using Siri’s equation, 495/Density-450 Density Density(gm/cm^3) Age Age(yrs) Weight weight(lbs) Height height (inches) 112

Home Work: Exercise 1 113

Home Work: Exercise 1 1.1 Describing the Relationship between Continuous Variables Generate scatter plots and correlations for the variables Age, Weight, Height, and the circumference measures versus the variable PctBodyFat2. Important! The Correlation task limits you to 10 variables at a time for scatter plot matrices, so for this exercise, look at the relationships with Age, Weight, and Height separately from the circumference variables (Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist) Note: Correlation tables can be created using more than 10 VAR variables at a time. What variable has the highest correlation with PctBodyFat2? What is the value for the coefficient? Is the correlation statistically significant at the 0.05 level? Can straight lines adequately describe the relationships? Are there any outliers that you should investigate? Generate correlations among the variable (Age, Weight, Height), among one another, and among the circumference measures. Are there any notable relationships? 114

Home Work: Exercise 2 2.1 Fitting a Simple Linear Regression Model Use the BodyFat2 data set for this exercise: Perform a simple linear regression model with PctBodyFat2 as the response variable and Weight as the predictor. What is the value of the F statistic and the associated p-value? How would you interpret this with regard to the null hypothesis? Write the predicted regression equation. What is the value of the R2 statistic? How would you interpret this? Produce predicted values for PctBodyFat2 when Weight is 125, 150, 175, 200 and 225. (see SAS code in below comments part) What are the predicted values? What’s the value of PctBodyFat2 when Weight is 150? data BodyFat2; set sasuser.BodyFat; if Height=29.5 then Height=69.5; run; data BodyFatToScore; input Weight @@; datalines; 125 150 175 200 225 ; 115

Home Work: Exercise 3 3.1 Performa Multiple Regression Using the BodyFat2 data set, run a regression of PctBodyFat2 on the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Compare the ANOVA table with that from the model with only Weight in the previous exercise. What is the different? How do the R2 and the adjusted R2 compare with these statistics for the Weight regression demonstration? Did the estimate for the intercept change? Did the estimate for the coefficient of Weight change? 116

Home Work: Exercise 3 3.2 Simplifying the model Rerun the model in the previous exercise, but eliminate the variable with the highest p-value. Compare the result with the previous model. Did the p-value for the model change notably? Did the R2 and adjusted R2 change notably? Did the parameter estimates and their p-value change notably? 3.3 More simplifying of the model Rerun the model in the previous exercise, but eliminate the variable with the highest p-value. How did the output change from the previous model? Did the number of parameters with a p-value less than 0.05 change? 117

Home Work: Exercise 4 4.1 Using Model Building Techniques Use the BodyFat2 data set to identify a set of “best” models. Using the Mallows' Cp option, use an all-possible regression technique to identify a set of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist . Hint: select the best 60 models based on Cp to compare Use a stepwise regression method to select a candidate model. Try Forward selection, Backward selection, and Stepwise selection. How many variables would result from a model using Forward selection and a significant level for entry criterion of 0.05, instead of the default of 0.50? 118

Thank you!