Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.

Similar presentations


Presentation on theme: "Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9."— Presentation transcript:

1 Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

2 Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Investigating distributions Bootstrap Robust methods Nonparametric tests Analysis of categorical data Multiple linear regression

3 Statistical Data Analysis 3 Multiple linear regression (Reader: Chapter 8) Relationship between one response variable and one or more explanatory variable Statistical model: multiple linear regression model Parameter estimation Selection explanatory variables: numerical measures determination coefficient partial correlation coefficient tests F-tests t-test Model quality: global methods/diagnostics several plots

4 Statistical Data Analysis 4 Pairwise scatter plots; data of body fat, triceps skin-fold thickness, thigh circumference and mid-arm circumference for twenty healthy females aged 20 to 34 Looks promising! Example Body fat: difficult and expensive to obtain Can it be predicted by one or more other, more easily measurable variables? Possible explanatory variables: triceps skin-fold thickness thigh circumference mid-arm circumference What kind of relationship? Try simplest: linear First make plot(s) of available data Which one(s)?

5 Statistical Data Analysis 5 Statistical model Multiple linear regression model i-th response value of j-th explanatory variable corresponding to i-th response stochastic “measurement error” for i-th response unknown constants or matrix notation: Assumption: independent and normally distributed design matrix intercept

6 Statistical Data Analysis 6 Statistical model Multiple linear regression model independent and normally distributed Note: response and explanatory variables continuous Other type of models?

7 Statistical Data Analysis 7 Statistical model Multiple linear regression model independent and normally distributed Issues: 1) estimate 2) select explanatory variables 3) assess model quality

8 Statistical Data Analysis 8 1) Parameter estimation - Multiple linear regression model independent and normally distributed Estimate with least-squares: minimize w.r.t. Solution: → unbiased estimator

9 Statistical Data Analysis 9 1) Parameter estimation - Multiple linear regression model independent and normally distributed i-th residual Residual sum of squares Estimate by Under normality of the e i, chisquare distr, df n-p-1 → unbiased estimator What do residuals tell us? If large, model “not so good”

10 Statistical Data Analysis 10 2) Selection of explanatory variables Multiple linear regression model independent and normally distributed Do more variables explain variability in responses better? Do we want a large model? Want: smallest possible model that explains variability in responses as much as possible → contradictory requirements Need: selection criterion/measure for how much variability is explained

11 Statistical Data Analysis 11 2) Selection of variables – determination coefficient (1) Multiple linear regression model independent and normally distributed Sum of squares for Y what is this? Sum of squares for regression what is this? Determination coefficient amount of variability in Y explained by design matrix X When is larger, with more or with less variables in model? What is better, large or small ? What is large?

12 Statistical Data Analysis 12 2) Selection of variables – determination coefficient (2) Multiple linear regression model independent and normally distributed Determination coefficient amount of variability in Y explained by design matrix X For simple linear regression: cor(Y,X 1 ) 2 For multivariate regression: 2 is multiple correlation coefficient = largest cor between Y and any linear combination of the X i s

13 Statistical Data Analysis 13 2) Selection of variables – overall F -test Multiple linear regression model independent and normally distributed Another scaling of SS reg yields test statistic for Test statistic: ~ If large, makes sense to include all p variables that are considered Overall F-test An F-distribution

14 Statistical Data Analysis 14 2) Selection of variables – partial F -test Multiple linear regression model independent and normally distributed Next to in model? Which sums of squares give indication? Test statistic ~ Partial F-test

15 Statistical Data Analysis 15 2) Selection of variables – t -test Multiple linear regression model independent and normally distributed For testing whether or not 1 variable X k should be included Test statistic ~ Relationship t and partial F: Very often used

16 Statistical Data Analysis 16 2) Selection of variables – partial correlation coefficient Multiple linear regression model independent and normally distributed Linear relationship of Y and X k corrected for other p-1 variables in model partial correlation coefficient = cor(, ) vector of residuals from regression of Y on X j except X k vector of residuals from regression of X k on all other X j If large: indication that X k should be included next to p-1 other variables Equivalent to t-test

17 Statistical Data Analysis 17 2) Selection of variables – practice Multiple linear regression model independent and normally distributed How to select systematically in practice? Two ways: build up step by step: determination coefficient then t-test for last step break down step by step: t-tests then determination coefficient for last step

18 Statistical Data Analysis 18 Example - bodyfat Build up a model: Determination coefficients univariate regression: Triceps Thigh Midarm Fat on 0.71 0.77 0.02 First regression of Fat on Thigh

19 Statistical Data Analysis 19 Example - bodyfat R: (data in matrix bf) > zglob3 = globalregression(bf[,3],bf[,1]) > zglob3 $RSS [1] 113.4237 $detcoef [1] 0.7710414 $beta #estimate Intercept X -23.6344891 0.8565466 $covbeta #estimate of cov matrix of beta-hat x 32.0063293 -0.61933288 x -0.6193329 0.01210344 $sigmakw #estimate [1] 6.301316... $t #value t-statistics Intercept X -4.177614 7.785681 $pt_#onesided p-values [1] 5.656662e-04 3.599996e-07 # Thigh significant at 0.05 # (two-sided test) $F [1] 60.61684 $pF [1] 3.599996e-07

20 Statistical Data Analysis 20 Example - bodyfat R: (data in matrix bf) > zfit3= lsfit(bf[,3],bf[,1]) > zfit3 $coefficients Intercept X -23.6344891 0.8565466 $residuals [1] -1.3826690 3.7784688 -2.1202790 -2.7759908 0.3882229 -0.8333722 [7] 0.6265135 4.4084117 2.1928142 -2.8907536 0.5539520 2.2682973 … and some more

21 Statistical Data Analysis 21 Example - bodyfat Adding one of the other variables: > zglob32 = globalregression(bf[,c(3,2)],bf[,1]) > zglob34 = globalregression(bf[,c(3,4)],bf[,1]) yields almost same value for det.coef: 0.78 moreover, coefficient additional variable not significantly different from 0 So we stop with adding variables: Building up leads to univariate model with explanatory variable Thigh

22 Statistical Data Analysis 22 Example - bodyfat Breaking down a model Shows problems: starting with all variables yields determination coefficient = 0.80 But: none of betas significantly different from 0! Breaking down based on highest p-value first takes out Thigh (!) (det.coef=0.78) We leave remaining variables in, both their coefficients now significantly different from 0, and taking them out lowers the det.coef to 0.71 or 0.02 Breaking down leads to bivariate model with explanatory variables Triceps and Midarm

23 Statistical Data Analysis 23 Example - bodyfat Building-up and breaking down leads to different models Which is final model of our choice? Breaking down leads to model with one variable more that has only slightly larger det.coef than model obtained with building up procedure So, smaller, univariate model with only Thigh as explanatory variable for response variable Body fat seems best; Estimates of its coefficients are: -23.63 (intercept), 0.86 (Tigh); Estimate of its error variance is 6.30 Thigh explains 77% of variation in Body Fat

24 Statistical Data Analysis 24 3) Assessment of model quality Multiple linear regression model independent and normally distributed Is linear regression model adequate for these data sets? But: these data sets have same, and if (simple) linear regression model is fitted

25 Statistical Data Analysis 25 3) Assessment of model quality - diagnostics Multiple linear regression model independent and normally distributed Until now: model, incl. assumptions, correct Now: assessment of model quality, incl. appropriateness assumptions Globally: with global quantities like and tests not sufficient Diagnostics: investigation with quantities that have different value for each observation point ( = combined with ) First: make suitable plots and investigate deviating points further

26 Statistical Data Analysis 26 3) Assessment of model quality - plots Types of plots: i) Scatter plot of Y against each explanatory variable Gives overall picture + deviating values ii) Added variable plot: scatter plot of residuals from regression of Y on X j except X k against residuals from regression of X k on all other X j Gives picture of relation Y and X k corrected for other X j + deviating values (cf. partial correlation coeff) iii) Plots based on residuals

27 Statistical Data Analysis 27 3) Assessment of model quality - plots iii) Plots based on residuals Scatter plot residuals against each explanatory variable If pattern: linear model perhaps not correct Curvature: include higher order of variable Systematic spread : linear model not correct or non-equal variance Scatter plot residuals against new explanatory variable If linear relationship: include this variable Scatter plot residuals against predicted responses If spread increases/decrease: non-equal variance Normal QQ-plot of residuals: Checks assumption of normality measurement errors Plus: all these plots show deviating individual values

28 Statistical Data Analysis 28 Example - bodyfat Model of choice: Bodyfat = -23.63 + 0.86 Thigh + measurement error Some diagnostic checks for this model: - scatter plot of pairs (above) showed no outliers - scatter plot of residuals against explanatory variable (below, left) - scatter plot of residuals against predicted responses (below, middle) - normality check with normal QQ-plot of residuals (below, right) None shows particular pattern or outliers; QQ-plot OK Conclusion: we stay with this model

29 Statistical Data Analysis 29 3) Assessment of model quality – further diagnostics Next week: further investigation - deviating observation points with numerical measures and tests - explanatory variables that are themselves linearly related


Download ppt "Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9."

Similar presentations


Ads by Google