Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Topic 12: Multiple Linear Regression
Kin 304 Regression Linear Regression Least Sum of Squares
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Topic 15: General Linear Tests and Extra Sum of Squares.
Objectives (BPS chapter 24)
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
Statistics for the Social Sciences
Statistics for Managers Using Microsoft® Excel 5th Edition
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Chapter Eighteen MEASURES OF ASSOCIATION
Lecture 24: Thurs., April 8th
Ch. 14: The Multiple Regression Model building
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Regression Analysis
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Lecture 5 Correlation and Regression
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Regression and Correlation Methods Judy Zhong Ph.D.
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Correlation and Regression
Inference for regression - Simple linear regression
Class 4 Ordinary Least Squares SKEMA Ph.D programme Lionel Nesta Observatoire Français des Conjonctures Economiques
Simple Linear Regression
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Diploma in Statistics Introduction to Regression Lecture 2.21 Introduction to Regression Lecture Review of Lecture 2.1 –Homework –Multiple regression.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Soc 3306a Lecture 9: Multivariate 2 More on Multiple Regression: Building a Model and Interpreting Coefficients.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Multiple Regression BPS chapter 28 © 2006 W.H. Freeman and Company.
Simple Linear Regression ANOVA for regression (10.2)
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Regression Analysis © 2007 Prentice Hall17-1. © 2007 Prentice Hall17-2 Chapter Outline 1) Correlations 2) Bivariate Regression 3) Statistics Associated.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Correlation & Regression Analysis
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2.
EXTRA SUMS OF SQUARES  Extra sum squares  Decomposition of SSR  Usage of Extra sum of Squares  Coefficient of partial determination.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Copyright © 2008 by Nelson, a division of Thomson Canada Limited Chapter 18 Part 5 Analysis and Interpretation of Data DIFFERENCES BETWEEN GROUPS AND RELATIONSHIPS.
BPS - 5th Ed. Chapter 231 Inference for Regression.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.
Stats Methods at IC Lecture 3: Regression.
The simple linear regression model and parameter estimation
Inference for Least Squares Lines
Correlation, Bivariate Regression, and Multiple Regression
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
(Residuals and
Simple Linear Regression - Introduction
CHAPTER 29: Multiple Regression*
Product moment correlation
Presentation transcript:

Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Investigating distributions Bootstrap Robust methods Nonparametric tests Analysis of categorical data Multiple linear regression

Statistical Data Analysis 3 Multiple linear regression (Reader: Chapter 8) Relationship between one response variable and one or more explanatory variable Statistical model: multiple linear regression model Parameter estimation Selection explanatory variables: numerical measures determination coefficient partial correlation coefficient tests F-tests t-test Model quality: global methods/diagnostics several plots

Statistical Data Analysis 4 Pairwise scatter plots; data of body fat, triceps skin-fold thickness, thigh circumference and mid-arm circumference for twenty healthy females aged 20 to 34 Looks promising! Example Body fat: difficult and expensive to obtain Can it be predicted by one or more other, more easily measurable variables? Possible explanatory variables: triceps skin-fold thickness thigh circumference mid-arm circumference What kind of relationship? Try simplest: linear First make plot(s) of available data Which one(s)?

Statistical Data Analysis 5 Statistical model Multiple linear regression model i-th response value of j-th explanatory variable corresponding to i-th response stochastic “measurement error” for i-th response unknown constants or matrix notation: Assumption: independent and normally distributed design matrix intercept

Statistical Data Analysis 6 Statistical model Multiple linear regression model independent and normally distributed Note: response and explanatory variables continuous Other type of models?

Statistical Data Analysis 7 Statistical model Multiple linear regression model independent and normally distributed Issues: 1) estimate 2) select explanatory variables 3) assess model quality

Statistical Data Analysis 8 1) Parameter estimation - Multiple linear regression model independent and normally distributed Estimate with least-squares: minimize w.r.t. Solution: → unbiased estimator

Statistical Data Analysis 9 1) Parameter estimation - Multiple linear regression model independent and normally distributed i-th residual Residual sum of squares Estimate by Under normality of the e i, chisquare distr, df n-p-1 → unbiased estimator What do residuals tell us? If large, model “not so good”

Statistical Data Analysis 10 2) Selection of explanatory variables Multiple linear regression model independent and normally distributed Do more variables explain variability in responses better? Do we want a large model? Want: smallest possible model that explains variability in responses as much as possible → contradictory requirements Need: selection criterion/measure for how much variability is explained

Statistical Data Analysis 11 2) Selection of variables – determination coefficient (1) Multiple linear regression model independent and normally distributed Sum of squares for Y what is this? Sum of squares for regression what is this? Determination coefficient amount of variability in Y explained by design matrix X When is larger, with more or with less variables in model? What is better, large or small ? What is large?

Statistical Data Analysis 12 2) Selection of variables – determination coefficient (2) Multiple linear regression model independent and normally distributed Determination coefficient amount of variability in Y explained by design matrix X For simple linear regression: cor(Y,X 1 ) 2 For multivariate regression: 2 is multiple correlation coefficient = largest cor between Y and any linear combination of the X i s

Statistical Data Analysis 13 2) Selection of variables – overall F -test Multiple linear regression model independent and normally distributed Another scaling of SS reg yields test statistic for Test statistic: ~ If large, makes sense to include all p variables that are considered Overall F-test An F-distribution

Statistical Data Analysis 14 2) Selection of variables – partial F -test Multiple linear regression model independent and normally distributed Next to in model? Which sums of squares give indication? Test statistic ~ Partial F-test

Statistical Data Analysis 15 2) Selection of variables – t -test Multiple linear regression model independent and normally distributed For testing whether or not 1 variable X k should be included Test statistic ~ Relationship t and partial F: Very often used

Statistical Data Analysis 16 2) Selection of variables – partial correlation coefficient Multiple linear regression model independent and normally distributed Linear relationship of Y and X k corrected for other p-1 variables in model partial correlation coefficient = cor(, ) vector of residuals from regression of Y on X j except X k vector of residuals from regression of X k on all other X j If large: indication that X k should be included next to p-1 other variables Equivalent to t-test

Statistical Data Analysis 17 2) Selection of variables – practice Multiple linear regression model independent and normally distributed How to select systematically in practice? Two ways: build up step by step: determination coefficient then t-test for last step break down step by step: t-tests then determination coefficient for last step

Statistical Data Analysis 18 Example - bodyfat Build up a model: Determination coefficients univariate regression: Triceps Thigh Midarm Fat on First regression of Fat on Thigh

Statistical Data Analysis 19 Example - bodyfat R: (data in matrix bf) > zglob3 = globalregression(bf[,3],bf[,1]) > zglob3 $RSS [1] $detcoef [1] $beta #estimate Intercept X $covbeta #estimate of cov matrix of beta-hat x x $sigmakw #estimate [1] $t #value t-statistics Intercept X $pt_#onesided p-values [1] e e-07 # Thigh significant at 0.05 # (two-sided test) $F [1] $pF [1] e-07

Statistical Data Analysis 20 Example - bodyfat R: (data in matrix bf) > zfit3= lsfit(bf[,3],bf[,1]) > zfit3 $coefficients Intercept X $residuals [1] [7] … and some more

Statistical Data Analysis 21 Example - bodyfat Adding one of the other variables: > zglob32 = globalregression(bf[,c(3,2)],bf[,1]) > zglob34 = globalregression(bf[,c(3,4)],bf[,1]) yields almost same value for det.coef: 0.78 moreover, coefficient additional variable not significantly different from 0 So we stop with adding variables: Building up leads to univariate model with explanatory variable Thigh

Statistical Data Analysis 22 Example - bodyfat Breaking down a model Shows problems: starting with all variables yields determination coefficient = 0.80 But: none of betas significantly different from 0! Breaking down based on highest p-value first takes out Thigh (!) (det.coef=0.78) We leave remaining variables in, both their coefficients now significantly different from 0, and taking them out lowers the det.coef to 0.71 or 0.02 Breaking down leads to bivariate model with explanatory variables Triceps and Midarm

Statistical Data Analysis 23 Example - bodyfat Building-up and breaking down leads to different models Which is final model of our choice? Breaking down leads to model with one variable more that has only slightly larger det.coef than model obtained with building up procedure So, smaller, univariate model with only Thigh as explanatory variable for response variable Body fat seems best; Estimates of its coefficients are: (intercept), 0.86 (Tigh); Estimate of its error variance is 6.30 Thigh explains 77% of variation in Body Fat

Statistical Data Analysis 24 3) Assessment of model quality Multiple linear regression model independent and normally distributed Is linear regression model adequate for these data sets? But: these data sets have same, and if (simple) linear regression model is fitted

Statistical Data Analysis 25 3) Assessment of model quality - diagnostics Multiple linear regression model independent and normally distributed Until now: model, incl. assumptions, correct Now: assessment of model quality, incl. appropriateness assumptions Globally: with global quantities like and tests not sufficient Diagnostics: investigation with quantities that have different value for each observation point ( = combined with ) First: make suitable plots and investigate deviating points further

Statistical Data Analysis 26 3) Assessment of model quality - plots Types of plots: i) Scatter plot of Y against each explanatory variable Gives overall picture + deviating values ii) Added variable plot: scatter plot of residuals from regression of Y on X j except X k against residuals from regression of X k on all other X j Gives picture of relation Y and X k corrected for other X j + deviating values (cf. partial correlation coeff) iii) Plots based on residuals

Statistical Data Analysis 27 3) Assessment of model quality - plots iii) Plots based on residuals Scatter plot residuals against each explanatory variable If pattern: linear model perhaps not correct Curvature: include higher order of variable Systematic spread : linear model not correct or non-equal variance Scatter plot residuals against new explanatory variable If linear relationship: include this variable Scatter plot residuals against predicted responses If spread increases/decrease: non-equal variance Normal QQ-plot of residuals: Checks assumption of normality measurement errors Plus: all these plots show deviating individual values

Statistical Data Analysis 28 Example - bodyfat Model of choice: Bodyfat = Thigh + measurement error Some diagnostic checks for this model: - scatter plot of pairs (above) showed no outliers - scatter plot of residuals against explanatory variable (below, left) - scatter plot of residuals against predicted responses (below, middle) - normality check with normal QQ-plot of residuals (below, right) None shows particular pattern or outliers; QQ-plot OK Conclusion: we stay with this model

Statistical Data Analysis 29 3) Assessment of model quality – further diagnostics Next week: further investigation - deviating observation points with numerical measures and tests - explanatory variables that are themselves linearly related