Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regression Shibin Liu SAS Beijing R&D. Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model.

Similar presentations


Presentation on theme: "Regression Shibin Liu SAS Beijing R&D. Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model."— Presentation transcript:

1 Regression Shibin Liu SAS Beijing R&D

2 Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 2

3 Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 3

4 Lesson overview 4 ANOVA Predictor VariableResponse Variable

5 Lesson overview 5 Continuous Correlation analysis Linear regression

6 Lesson overview 6 Continuous predictorContinuous response Measure linear association Examine the relationship Screen for outliers Interpret the correlation Measure linear association Examine the relationship Screen for outliers Interpret the correlation Correlation analysis

7 Lesson overview 7 Continuous predictorContinuous response Define the linear association Determine the equation for the line Explain or predict variability Define the linear association Determine the equation for the line Explain or predict variability Linear regression

8 Lesson overview 8 What do you want to examine? The relationship between variables The difference between groups on one or more variables The location, spread, and shape of the data’s distribution Summary statistics or graphics? How many groups? Which kind of variables? SUMMARY STATISTICS DISTRIBUTION ANALYSIS TTEST LINEAR MODELS CORRELATIONS ONE-WAY FREQUENCIES & TABLE ANALYSIS LINEAR REGRESSION LOGISTIC REGRESSION Summary statistics Both Two Two or more Descriptive Statistics Descriptive Statistics, histogram, normal, probability plots Analysis of variance Continuous only Frequency tables, chi-square test Categorical response variable Descriptive Statistics Inferential Statistics Lesson 1Lesson 2Lesson 3 & 4Lesson 5

9 Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 9

10 Exploratory Data Analysis: Introduction 10 WeightHeight Continuous variable Scatter plot Correlation analysis Exploratory data analysis Linear regression

11 Exploratory Data Analysis: Objective Examine the relationship between continuous variable using a scatter plot Quantify the degree of association between two continuous variables using correlation statistics Avoid potential misuses of the correlation coefficient Obtain Pearson correlation coefficients 11

12 Exploratory Data Analysis: Using Scatter Plots to Describe Relationships between Continuous Variables 12 Scatter plot Correlation analysis Exploratory data analysis Relationship Trend Range Outlier Communicate analysis result X: Predict variable Y: Response variable Coordinate: values of X and Y

13 Exploratory Data Analysis: Using Scatter Plots to Describe Relationships between Continuous Variables 13 Model Terms 2 Squared Quadratic

14 Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables 14 Exploratory data analysis Scatter plot Correlation analysis Linear association Linear association ZeroPositiveNegative

15 Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables 15 Person correlation coefficient: For population For sample

16 Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables 16 Correlation analysis Person correlation coefficient: r 0 +1 Strong negative linear relationship Strong positive linear relationship No linear relationship

17 Exploratory Data Analysis: Hypothesis testing for a Correlation 17 Correlation Coefficient Test Population parameter Sample statistic Correlationr A p-value does not measure the magnitude of the association. Sample size affects the p-value. Rejecting the null hypothesis only means that you can be confident that the true population correlation is not 0. small p-value can occur (as with many statistics) because of very large sample sizes. Even a correlation coefficient of 0.01 can be statistically significant with a large enough sample size. Therefor, it is important to also look at the value of r itself to see whether it is meaningfully large.

18 Exploratory Data Analysis: Hypothesis testing for a Correlation r r r r

19 Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect 19 Correlation does not imply causation Besides causality, could other reasons account for strong correlation between two variables?

20 Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect 20 Weight Height Correlation does not imply causation A strong correlation between two variables does not mean change in one variable causes the other variable to change, or vice versa.

21 Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect 21 Correlation does not imply causation

22 Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect 22 Correlation does not imply causation

23 Exploratory Data Analysis: Avoiding Common Errors in Interpreting Correlations Cause and Effect 23 X: the percent of students who take the SAT exam in one of the states Y: SAT scores SAT score bounded to college entrance or not

24 Exploratory Data Analysis: Avoiding Common Errors: Types of Relationships 24 Pearson correlation coefficient: r -> 0 curvilinear quadratic parabolic

25 Exploratory Data Analysis: Avoiding Common Errors: outliers 25 Data oneData two r =0.82 r =0.02

26 Exploratory Data Analysis: Avoiding Common Errors: outliers 26 What to do with outlier? Why an outlier ? Valid Error Collect data Replicate data Compute two correlation coefficients Report both coefficients

27 Exploratory Data Analysis: Scenario: Exploring Data Using Correlation and Scatter Plots 27 Fitness oxygen consumption ?

28 Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots 28

29 Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots 29 What’s the Pearson correlation coefficient of Oxygen_Consumption with Run_Time? What’s the p-value for the correlation of Oxygen_Consumption with Performance?

30 Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots 30

31 Exploratory Data Analysis: Examining Correlations between Predictor Variables 31

32 Exploratory Data Analysis: Examining Correlations between Predictor Variables 32 What are the two highest Pearson correlation coefficient s?

33 Exploratory Data Analysis 33 Question 1. The correlation between tuition and rate of graduation at U.S. college is What does this mean? a)The way to increase graduation rates at your college is to raise tuition b)Increasing graduation rates is expensive, causing tuition to rise c)Students who are richer tend to graduate more often than poorer students d)None of the above. Answer: d

34 Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 34

35 Simple Linear Regression: Introduction 35

36 Simple Linear Regression: Introduction Linear relationships Variable AVariable DVariable CVariable B

37 Simple Linear Regression: Introduction 37 r r Same Different

38 Simple Linear Regression: Introduction 38 Simple Linear Regression Y: variable of primary interest X: explains variability in Y Regression Line

39 Simple Linear Regression: Objective 39 Explain the concepts of Simple Linear Regression Fit a Simple Linear Regression using the Linear Regression task Produce predicted values and confidence intervals.

40 Simple Linear Regression: Scenario: Performing Simple Linear Regression 40 Fitness Run_TimeOxygen_Consumption Simple Linear Regression Linear regression

41 Simple Linear Regression: The Simple Linear Regression Model 41

42 Simple Linear Regression: The Simple Linear Regression Model 42 Question 2. What does epsilon represent? a)The intercept parameter b)The predictor variable c)The variation of X around the line d)The variation of Y around the line Answer: d

43 Simple Linear Regression: How SAS Performs Linear Regression 43 Method of least square Minimize Best Linear Unbiased Estimators. Are unbiased estimators. Have minimum variance Best Linear Unbiased Estimators. Are unbiased estimators. Have minimum variance

44 Simple Linear Regression: Measuring How Well a Model Fits the Data 44 Regression model Baseline model VS.

45 Simple Linear Regression: Comparing the Regression Model to a Baseline Model 45 Base line model: Better model: Explain more variability Type of variabilityEquation Explained (SSM) Unexplained (SSE) Total

46 Simple Linear Regression: Hypothesis Testing for Linear Regression 46 Linear regression

47 Simple Linear Regression: Assumptions of Simple Linear Regression 47 Linear regression Assumptions: 1.The mean of Y is linearly related to X. 2. Errors are normally distributed 3. Errors have equal variances. 4. Errors are independent. Assumptions: 1.The mean of Y is linearly related to X. 2. Errors are normally distributed 3. Errors have equal variances. 4. Errors are independent.

48 Simple Linear Regression: Performing Simple Linear Regression 48 Task >Regression>Linear Regression

49 Simple Linear Regression: Performing Simple Linear Regression 49 Task >Regression>Linear Regression

50 Simple Linear Regression: Performing Simple Linear Regression 50 Question 3. In the model Y=X, if the parameter estimate (slope) of X is 0, then which of the following is the best guess (predicted value) for Y when X is equals to 13? a)13 b)The mean of Y c)A random number d)The mean of X e)0 Answer: b

51 Simple Linear Regression: Confidence and Prediction Intervals 51

52 Simple Linear Regression: Confidence and Prediction Intervals 52 Question 4. Suppose you have a 95% confidence interval around the mean. How do you interpret it? a)The probability is.95 that the true population mean of Y for a particular X is within the interval. b)You are 95% confident that a newly sampled value of Y for a particular X is within the interval. c) You are 95% confident that your interval contains the true population mean of Y for a particular X. Answer: c

53 Simple Linear Regression: Confidence and Prediction Intervals 53

54 Simple Linear Regression: Confidence and Prediction Intervals 54

55 Simple Linear Regression: Producing Predicted Values of the Response Variable 55 data Need_Predictions; input Runtime datalines; ; run;

56 Simple Linear Regression: Producing Predicted Values of the Response Variable 56 data Need_Predictions; input Runtime datalines; ; run;

57 Simple Linear Regression: Producing Predicted Values of the Response Variable 57 18

58 Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 58

59 Multiple Regression 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 59

60 Multiple Regression: Introduction 60 Simple Linear Regression Predictor VariableResponse Variable Multiple Linear Regression Response VariablePredictor Variable More than one Predictor Variable Predictor Variable

61 Multiple Regression: Introduction 61 Simple Linear Regression Multiple Linear Regression When k=2

62 Multiple Regression: Objective 62 Explain the mathematical model for multiple regression Describe the main advantage of multiple regression versus simple linear regression Explain the standard output from the Linear Regression task. Describe common pitfalls of multiple linear regression

63 Multiple Regression Advantages and Disadvantages of Multiple Regression 63 AdvantagesDisadvantages Multiple Linear Regression 127 possible model Complex to interpret

64 Multiple Regression Picturing the Model for Multiple Regression 64 Multiple Linear Regression

65 Multiple Regression Picturing the Model for Multiple Regression 65 Multiple Linear Regression

66 Multiple Regression C ommon applications 66 Multiple Linear Regression is a powerful tool for the following tasks: 1.Prediction, which is used to develop a model future values of a response variable (Y) based one its relationships with other predictor variables (Xs). 2.Analytical or Explanatory Analysis, which is used to develop an understanding of the relationships between the response variable and predictor variables

67 Multiple Regression Analysis versus Prediction in Multiple Regression 67 Prediction 1.The terms in the model, the values of their coefficients, and their statistical significance are of secondary importance. 2.The focus is on producing a model that is the best at predicting future values of Y as a function of the Xs.

68 Multiple Regression Analysis versus Prediction in Multiple Regression 68 Analytical or Explanatory Analysis 1.The focus is understanding the relationship between the dependent variable and independent variables. 2.Consequently, the statistical significance of the coefficient is important as well as the magnitudes and signs of the coefficients.

69 Multiple Regression Hypothesis Testing for Multiple Regression 69 Multiple Linear Regression

70 Multiple Regression Hypothesis Testing for Multiple Regression 70 Question 4. Match below items left and right? a1.At least one slope of the regression in the population is not 0 and at least one predictor variable explains a significant amount of variability in the response model 2.No predictor variable explains a significant amount of variability in the response variable 3.The estimated linear regression model does not fit the data better than the baseline model a) Reject the null hypothesis b) Fail to reject the null hypothesis a) Reject the null hypothesis b b

71 Multiple Regression Assumptions for Multiple Regression 71 Linear regression model Assumptions: 1.The mean of Y is linearly related to X. 2. Errors are normally distributed 3. Errors have equal variances. 4. Errors are independent. Assumptions: 1.The mean of Y is linearly related to X. 2. Errors are normally distributed 3. Errors have equal variances. 4. Errors are independent.

72 Multiple Regression: Scenario: Using Multiple Regression to Explain Oxygen Consumption 72 Performance Age

73 Multiple Regression: Adjust R 2 73 Adj. R 2 i = 1 if there is an intercept and 0 otherwise n = the number of observations used to fit the model p = the number of parameters in the model

74 Multiple Regression: Performing Multiple Linear Regression 74

75 Multiple Regression: Performing Multiple Linear Regression 75 What’s the p-value of the overall model? Should we reject the null hypothesis or not? Based on our evidence, do we reject the null hypothesis that the parameter estimate is 0?

76 Multiple Regression: Performing Multiple Linear Regression 76 Oxygen_ Consumption RunTime PerformanceOxygen_ Consumption Oxygen_ Consumption Performance RunTime ?

77 Multiple Regression: Performing Multiple Linear Regression 77 Performance RunTime Collinearity

78 Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 78

79 Model Building and Interpretation : Introduction 79 Performance Age

80 Model Building and Interpretation : Introduction 80 ?

81 Model Building and Interpretation : Introduction 81 Stepwise selection methods Forward Backward Stepwise All possible regressions rank criteria: R2R2 R2R2 Adjusted R 2 Mallows’ C p ‘No selection’ is the default ‘No selection’ is the default

82 Model Building and Interpretation: Objectives 82 Explain the Linear Regression task options for the model selection Describe model selection options and interpret output to evaluate the fit of several models

83 Model Building and Interpretation : Approaches to Selecting Models: Manual 83 Full Model

84 Model Building and Interpretation : SAS and Automated Approaches to Modeling 84 Stepwise selection methods Forward Backward Stepwise All possible regressions rank criteria: R2R2 R2R2 Adjusted R 2 Mallows’ C p ‘No selection’ is the default ‘No selection’ is the default Run all methods Look for commonalities Narrow down models

85 Model Building and Interpretation : The All-Possible Regressions Approach to Model Building 85 Fitness Predictor variables 128 possible models

86 Model Building and Interpretation : Evaluating Models Using Mallows' Cp Statistic 86 CpCp CpCp Mallows' Cp Statistic Model Bias Under-fitting Over-fitting

87 Model Building and Interpretation : Evaluating Models Using Mallows' Cp Statistic 87 CpCp CpCp Mallows' Cp Statistic Model Bias For Prediction Mallows' criterion: C p <= p criteria Parameter estimation Hockings' criterion: C p <= 2p –p full +1

88 Model Building and Interpretation : Viewing Mallows' Cp Statistic 88 CpCp CpCp Linea Regression task Partial output +1=p

89 Model Building and Interpretation : Viewing Mallows' Cp Statistic 89 Partial output Mallows' criterion: C p <= p In this output, how many models have a value for C p that is less than or equal to p? Which of these models has the fewest parameters?

90 Model Building and Interpretation : Viewing Mallows' Cp Statistic 90 Partial output How many models meet Hockings' criterion for C p for parameter estimation? First of all, what is the p for the full model? Hockings' criterion: C p <= 2p –p full +1 P full = 8 (7 vars +1intercept ) C p <= 12 – 8 +1

91 Model Building and Interpretation : Viewing Mallows' Cp Statistic 91 Question 5. What happens when you use the all-possible regressions method? Select all that apply. y 1.You compare the R-square, adjusted R-square, and C p statistics to evaluate the models. 2.SAS computes al possible models 3.You choose a selection method (stepwise, forward, or backward) 4.SAS ranks the results. 5.You cannot reduce the number of models in the output 6.You can produce a plot to help identify models that satisfy criteria for the C p statistic. y y y

92 Model Building and Interpretation : Viewing Mallows' Cp Statistic 92 Question 6. Match below items left and right. c 1.Prefer to use R-square for evaluating multiple linear regression models (take into account the number of terms in the model). 2.Useful for parameter estimation 3.Useful for prediction b a a.Mallows' criterion for C p. b.Hockings' criterion for C p. c.adjusted R-square

93 Model Building and Interpretation : Using Automatic Model Selection 93

94 Model Building and Interpretation : Using Automatic Model Selection 94

95 Model Building and Interpretation : Estimating and Testing Coefficients for Selected Models – Prediction model 95

96 Model Building and Interpretation : Estimating and Testing Coefficients for Selected Models – Explanatory Model 96

97 Model Building and Interpretation : Estimating and Testing Coefficients for Selected Models 97

98 Model Building and Interpretation : The Stepwise Selection Approach to Model Building 98 Stepwise selection methods Forward Backward Stepwise

99 Model Building and Interpretation : The Stepwise Selection Approach to Model Building 99 Forward Forward selection method starts with no variable, then select the most significant variable, until there is no significant variable. The variable added will not be removed even it becomes in-significant later.

100 Model Building and Interpretation : The Stepwise Selection Approach to Model Building 100 Backward Backward selection method starts with all variables in, then remove the most in-significant variable, until all variables left are significant. Once the variable is removed, it cannot re-enter.

101 Model Building and Interpretation : The Stepwise Selection Approach to Model Building 101 Stepwise Stepwise combines the thoughts of both Forward and Backward selection. It starts with no variable, then select the most significant variable as the Forward, however, like Backward selection, stepwise method can drop the in-significant variable one at a time. until there is no significant variable. Stepwise method stops when all terms in the model are significant, and all terms out off model are not significant.

102 Model Building and Interpretation : The Stepwise Selection Approach to Model Building 102 Application Stepwise selection methods Application Stepwise selection methods Forward Backward Stepwise Identify candidate models Use expertise to choose

103 Model Building and Interpretation : Performing Stepwise Regression: Forward selection 103

104 Model Building and Interpretation : Performing Stepwise Regression: Forward selection 104

105 Model Building and Interpretation : Performing Stepwise Regression: Backward selection 105

106 Model Building and Interpretation : Performing Stepwise Regression: Backward selection 106

107 Model Building and Interpretation : Performing Stepwise Regression: Stepwise selection 107

108 Model Building and Interpretation : Performing Stepwise Regression: Stepwise selection 108

109 Model Building and Interpretation : Using Alternative Significance Criteria for Stepwise Models 109 Stepwise Regression Models With default significant levels Using 0.05 significant levels

110 Model Building and Interpretation : Comparison of Selection Methods 110 Stepwise selection methods All-possible regression Use fewer computer resources Generate more candidate models that might have nearly equal R 2 and C p statistics.

111 Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary 111

112 Home Work: Exercise Describing the Relationship between Continuous Variables Percentage of body fat, age, weight, height, and 10 body circumference measurements (for example, abdomen) were recorded for 252 men. The data are stored in the BodyFat2 data set. Body fat one measure of health, was accurately estimated by an underwater weighing technique. There are two measures of percentage body fat in this data set. Casecase number PctBodyFat1 percent body fat using Brozek’s equation, 457/Density PctBodyFat2percent body fat using Siri’s equation, 495/Density-450 DensityDensity(gm/cm^3) Age Age(yrs) Weight weight(lbs) Height height (inches) 112

113 Home Work: Exercise 1 113

114 Home Work: Exercise Describing the Relationship between Continuous Variables a.Generate scatter plots and correlations for the variables Age, Weight, Height, and the circumference measures versus the variable PctBodyFat2. Important! The Correlation task limits you to 10 variables at a time for scatter plot matrices, so for this exercise, look at the relationships with Age, Weight, and Height separately from the circumference variables (Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist) Note: Correlation tables can be created using more than 10 VAR variables at a time. b.What variable has the highest correlation with PctBodyFat2? c.What is the value for the coefficient? d.Is the correlation statistically significant at the 0.05 level? e.Can straight lines adequately describe the relationships? f.Are there any outliers that you should investigate? g.Generate correlations among the variable (Age, Weight, Height), among one another, and among the circumference measures. Are there any notable relationships? 114

115 Home Work: Exercise Fitting a Simple Linear Regression Model Use the BodyFat2 data set for this exercise: a.Perform a simple linear regression model with PctBodyFat2 as the response variable and Weight as the predictor. b.What is the value of the F statistic and the associated p-value? How would you interpret this with regard to the null hypothesis? c.Write the predicted regression equation. d.What is the value of the R 2 statistic? How would you interpret this? e.Produce predicted values for PctBodyFat2 when Weight is 125, 150, 175, 200 and 225. (see SAS code in below comments part) f.What are the predicted values? g.What’s the value of PctBodyFat2 when Weight is 150? 115

116 Home Work: Exercise Performa Multiple Regression a.Using the BodyFat2 data set, run a regression of PctBodyFat2 on the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Compare the ANOVA table with that from the model with only Weight in the previous exercise. What is the different? b.How do the R2 and the adjusted R2 compare with these statistics for the Weight regression demonstration? c.Did the estimate for the intercept change? Did the estimate for the coefficient of Weight change? 116

117 Home Work: Exercise Simplifying the model a.Rerun the model in the previous exercise, but eliminate the variable with the highest p-value. Compare the result with the previous model. b.Did the p-value for the model change notably? c.Did the R2 and adjusted R2 change notably? d.Did the parameter estimates and their p-value change notably? 3.3 More simplifying of the model a.Rerun the model in the previous exercise, but eliminate the variable with the highest p-value. b.How did the output change from the previous model? c.Did the number of parameters with a p-value less than 0.05 change? 117

118 Home Work: Exercise Using Model Building Techniques Use the BodyFat2 data set to identify a set of “best” models. a.Using the Mallows' Cp option, use an all-possible regression technique to identify a set of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Hint: select the best 60 models based on Cp to compare b.Use a stepwise regression method to select a candidate model. Try Forward selection, Backward selection, and Stepwise selection. c.How many variables would result from a model using Forward selection and a significant level for entry criterion of 0.05, instead of the default of 0.50? 118

119 Thank you!


Download ppt "Regression Shibin Liu SAS Beijing R&D. Agenda 0. Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model."

Similar presentations


Ads by Google