Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory.

Slides:



Advertisements
Similar presentations
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Advertisements

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Inference for Regression
Multiple Regression II 4/11/12 Categorical explanatory variables Adjusted R 2 Not in book Professor Kari Lock Morgan Duke University.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6, 9.1 Least squares line Interpreting.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Categorical variables Variable selection.
Objectives (BPS chapter 24)
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
Chapter 10 Simple Regression.
The Simple Regression Model
ASSESSING THE STRENGTH OF THE REGRESSION MODEL. Assessing the Model’s Strength Although the best straight line through a set of points may have been found.
Chapter Topics Types of Regression Models
Chapter 11 Multiple Regression.
Stat 217 – Day 25 Regression. Last Time - ANOVA When?  Comparing 2 or means (one categorical and one quantitative variable) Research question  Null.
STAT 101 Dr. Kari Lock Morgan Exam 2 Review.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Simple Linear Regression Least squares line Interpreting coefficients Prediction Cautions The formal model Section 2.6, 9.1, 9.2 Professor Kari Lock Morgan.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Correlation & Regression
ANOVA 3/19/12 Mini Review of simulation versus formulas and theoretical distributions Analysis of Variance (ANOVA) to compare means: testing for a difference.
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/27/12 Multiple Regression SECTION 10.3 Categorical variables Variable.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Simple Linear Regression Models
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 (?) Multiple explanatory variables.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
CHAPTER 14 MULTIPLE REGRESSION
Simple Linear Regression One reason for assessing correlation is to identify a variable that could be used to predict another variable If that is your.
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 9.1 Inference for correlation Inference for.
Multiple Regression I 4/9/12 Transformations The model Individual coefficients R 2 ANOVA for regression Residual standard error Section 9.4, 9.5 Professor.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Simple Linear Regression ANOVA for regression (10.2)
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTIONS 9.1, 9.3 Inference for slope (9.1)
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Variable selection Confounding variables.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 12/6/12 Synthesis Big Picture Essential Synthesis Bayesian Inference (continued)
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and Alison Kelly Copyright © 2014 by McGraw-Hill Higher Education. All rights.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 Multiple explanatory variables (10.1,
Statistics: Unlocking the Power of Data Lock 5 Inference for Means STAT 250 Dr. Kari Lock Morgan Sections 6.4, 6.5, 6.6, 6.10, 6.11, 6.12, 6.13 t-distribution.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTION 2.6 Interpreting coefficients Prediction.
ENGR 610 Applied Statistics Fall Week 11 Marshall University CITE Jack Smith.
ENGR 610 Applied Statistics Fall Week 12 Marshall University CITE Jack Smith.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 9.1 Inference for correlation Inference for.
BPS - 5th Ed. Chapter 231 Inference for Regression.
11-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
STAT 250 Dr. Kari Lock Morgan
Simple Linear Regression
CHAPTER 29: Multiple Regression*
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables (10.1) Partitioning variability – R 2, ANOVA (9.2) Conditions – residual plot (10.2) Transformations (not in book)

Statistics: Unlocking the Power of Data Lock 5 Exam 2 Grades

Statistics: Unlocking the Power of Data Lock 5 Exam 2 Re-grades Re-grade requests due in writing by class on Tuesday, 11/27/12 Partial credit will not be altered – only submit a re-grade request if you think you have entirely the correct answer but got points off Grades may go up or down If points were added up incorrectly, just bring me your exam (no need for an official re-grade)

Statistics: Unlocking the Power of Data Lock 5 Today we’ll finally learn a way to handle more than 2 variables! More than 2 variables!

Statistics: Unlocking the Power of Data Lock 5 Multiple regression extends simple linear regression to include multiple explanatory variables: Multiple Regression

Statistics: Unlocking the Power of Data Lock 5 We’ll use your current grades to predict final exam scores, based on a model from last year’s students Response: final exam score Explanatory: hw average, clicker average, exam 1, exam 2 Grade on Final

Statistics: Unlocking the Power of Data Lock 5 What variable is the most significant predictor of final exam score? a) Homework average b) Clicker average c) Exam 1 d) Exam 2 Grade on Final Exam 1 has the lowest p-value

Statistics: Unlocking the Power of Data Lock 5 The p-value for explanatory variable x i is associated with the hypotheses For intervals and p-values of coefficients in multiple regression, use a t-distribution with degrees of freedom n – k – 1, where k is the number of explanatory variables included in the model Inference for Coefficients

Statistics: Unlocking the Power of Data Lock 5 Estimate your score on the final exam. What type of interval do you want for this estimate? a) Confidence interval b) Prediction interval Grade on Final A confidence interval is for an average, a prediction interval is for an individual.

Statistics: Unlocking the Power of Data Lock 5 Estimate your score on the final exam. (hw average is out of 10, clicker average is out of 2) Grade on Final For a HW average of 9, a clicker average of 1.7, and exams scores of 80 on each exam (these were the averages for each category): (9) (1.7) + 0.4(80) (80) = 83.81

Statistics: Unlocking the Power of Data Lock 5 Is the clicker coefficient really negative?!? Give a 95% confidence interval for the clicker coefficient (okay to use t* = 2). Grade on Final -2.7  2×4.93 = -2.7  9.86 = (-12.56, 7.16)

Statistics: Unlocking the Power of Data Lock 5 Is your score on exam 2 really not a significant predictor of your final exam score?!? Grade on Final

Statistics: Unlocking the Power of Data Lock 5 The coefficient (and significance) for each explanatory variable depend on the other variables in the model! In predicting final exam scores, if you know someone’s score on Exam 1, it doesn’t provide much additional information to know their score on Exam 2 (both of these explanatory variables are highly correlated) Coefficients

Statistics: Unlocking the Power of Data Lock 5 If you take Exam 1 out of the model… Grade on Final Model with Exam 1: Now Exam 2 is significant!

Statistics: Unlocking the Power of Data Lock 5 Multiple Regression The coefficient for each explanatory variable is the predicted change in y for one unit change in x, given the other explanatory variables in the model! The p-value for each coefficient indicates whether it is a significant predictor of y, given the other explanatory variables in the model! If explanatory variables are associated with each other, coefficients and p-values will change depending on what else is included in the model

Statistics: Unlocking the Power of Data Lock 5 If you include Project 1 in the model… Grade on Final Model without Project 1:

Statistics: Unlocking the Power of Data Lock 5 Grades

Statistics: Unlocking the Power of Data Lock 5 Evaluating a Model How do we evaluate the success of a model? How we determine the overall significance of a model? How do we choose between two competing models?

Statistics: Unlocking the Power of Data Lock 5 Variability One way to evaluate a model is to partition variability A good model “explains” a lot of the variability in Y Total Variability Variability Explained by the Model Error Variability

Statistics: Unlocking the Power of Data Lock 5 Exam Scores Without knowing the explanatory variables, we can say that a person’s final exam score will probably be between 60 and 98 (the range of Y) Knowing hw average, clicker average, exam 1 and 2 grades, and project 1 grades, we can give a narrower prediction interval for final exam score We say the some of the variability in y is explained by the explanatory variables How do we quantify this?

Statistics: Unlocking the Power of Data Lock 5 Variability How do we quantify variability in Y? a)Standard deviation of Y b)Sum of squared deviations from the mean of Y c)(a) or (b) d)None of the above

Statistics: Unlocking the Power of Data Lock 5 Sums of Squares Total Variability Variability Explained by the model Error variability SSTSSMSSE

Statistics: Unlocking the Power of Data Lock 5 Variability If SSM is much higher than SSE, than the model explains a lot of the variability in Y

Statistics: Unlocking the Power of Data Lock 5 R2R2 R 2 is the proportion of the variability in Y that is explained by the model Total Variability Variability Explained by the Model

Statistics: Unlocking the Power of Data Lock 5 R2R2 For simple linear regression, R 2 is just the squared correlation between X and Y For multiple regression, R 2 is the squared correlation between the actual values and the predicted values

Statistics: Unlocking the Power of Data Lock 5 R2R2

Final Exam Grade

Statistics: Unlocking the Power of Data Lock 5 Is the model significant? If we want to test whether the model is significant (whether the model helps to predict y), we can test the hypotheses: We do this with ANOVA!

Statistics: Unlocking the Power of Data Lock 5 ANOVA for Regression k: number of explanatory variables n: sample size Source Model Error Total df k n-k-1 n-1 Sum of Squares SSM SSE SST Mean Square MSM = SSM/k MSE = SSE/(n-k-1) F MSM MSE p-value Use F k,n-k-1

Statistics: Unlocking the Power of Data Lock 5 ANOVA for Regression Source Model Error Total df Sum of Squares Mean Square F p-value  0

Statistics: Unlocking the Power of Data Lock 5 Final Exam Grade

Statistics: Unlocking the Power of Data Lock 5 Simple Linear Regression For simple linear regression, the following tests will all give equivalent p-values: t-test for non-zero correlation t-test for non-zero slope ANOVA for regression

Statistics: Unlocking the Power of Data Lock 5 Mean Square Error (MSE) Mean square error (MSE) measures the average variability in the errors (residuals) The square root of MSE gives the standard deviation of the residuals (giving a typical distance of points from the line) This number is also given in the R output as the residual standard error, and is known as s  in the textbook

Statistics: Unlocking the Power of Data Lock 5 Final Exam Grade

Statistics: Unlocking the Power of Data Lock 5 Simple Linear Model Residual standard error =  MSE = s e estimates the standard deviation of the residuals (the spread of the normal distributions around the predicted values)

Statistics: Unlocking the Power of Data Lock 5 Residual Standard Error Use the fact that the residual standard error is and your predicted final exam score to compute an approximate 95% prediction interval for your final exam score NOTE: This calculation only takes into account errors around the line, not uncertainty in the line itself, so your true prediction interval will be slightly wider

Statistics: Unlocking the Power of Data Lock 5 Revisiting Conditions For simple linear regression, we learned that the following should hold for inferences to be valid: Linearity Constant variability of the residuals Normality of the residuals How do we assess the first two conditions in multiple regression, when we can no longer visualize with a scatterplot?

Statistics: Unlocking the Power of Data Lock 5 Residual Plot A residual plot is a scatterplot of the residuals against the predicted responses Should have: 1)No obvious pattern or trend (linearity) 2)Constant variability

Statistics: Unlocking the Power of Data Lock 5 Residual Plots Obvious patternVariability not constant

Statistics: Unlocking the Power of Data Lock 5 Final Exam Score Are the conditions satisfied? (a) Yes(b) No

Statistics: Unlocking the Power of Data Lock 5 Transformations If the conditions are not satisfied, there are some common transformations you can apply to the response variable You can take any function of y and use it as the response, but the most common are log(y) (natural logarithm - ln)  y (square root) y 2 (squared) e y (exponential))

Statistics: Unlocking the Power of Data Lock 5 log(y) Original Response, y : Logged Response, log(y) :

Statistics: Unlocking the Power of Data Lock 5 yy Original Response, y : Square root of Response,  y :

Statistics: Unlocking the Power of Data Lock 5 y2y2 Original Response, y : Squared response, y 2 :

Statistics: Unlocking the Power of Data Lock 5 eyey Original Response, y : Exponentiated Response, e y :

Statistics: Unlocking the Power of Data Lock 5 Over-fitting It is possible to over-fit a model: to include too many explanatory variables The fewer the coefficients being estimated, the better they will be estimated Usually, a good model has pruned out explanatory variables that are not helping

Statistics: Unlocking the Power of Data Lock 5 R2R2 Adding more explanatory variables will only make R 2 increase or stay the same Adding another explanatory variable can not make the model explain less, because the other variables are all still in the model Is the best model always the one with the highest proportion of variability explained, and so the highest R 2 ? (a) Yes(b) No

Statistics: Unlocking the Power of Data Lock 5 Adjusted R 2 Adjusted R 2 is like R 2, but takes into account the number of explanatory variables As the number of explanatory variables increases, adjusted R 2 gets smaller than R 2 One way to choose a model is to choose the model with the highest adjusted R 2

Statistics: Unlocking the Power of Data Lock 5 Final Exam Grade

Statistics: Unlocking the Power of Data Lock 5 How do we decide which explanatory variables to include in the model? How do we use categorical explanatory variables? What is the coefficient of one explanatory variable depends on the value of another explanatory variable? To Come…

Statistics: Unlocking the Power of Data Lock 5 To Do Read 9.2, 10.1, 10.2 Do Homework 8 (due Thursday, 11/29)Homework 8 Do Project 2 (poster due Monday, 12/3, paper due 12/6)Project 2