Statistical Data Analysis - Lecture /04/03

Statistical Data Analysis - Lecture17 - 29/04/03
Regression We have looked at ANOVA models for a single response with one or two factors Sometimes these factors are known as predictors However that term is usually reserved for continuous variables What do we do when we have a single response and one (or more) variables we believe might be related to the response? One solution is linear regression and least squares Statistical Data Analysis - Lecture /04/03

Simple linear regression
When we have a single response and a single predictor or explanatory variable, depending on what we’re interested in we might fit a simple linear regression model Why would we do such a thing? If we plotted (transformed data) and saw that the data were related in some linear fashion If we’re interested in making some prediction of response based on experimental or observational data If we think that a line would be an adequate summary of the data Statistical Data Analysis - Lecture /04/03

What does linear mean Linear = line (we will see later on that this concept must be extended to higher dimensions) Technically, linear refers to the coefficients in the regression model What is this regression model we keep referring to? In school (hopefully) we learned a number of ways to describe a straight line A straight line can be described to points on a graph – the line passes through the points (x1, y1) and (x2, y2) Statistical Data Analysis - Lecture /04/03

y-axis (x2, y2) (x1, y1) x-axis We can also describe the line in terms of a slope and an intercept The slope the change in the y-value for a unit change in the x-value. In this simple situation we can think of this as the change in the height of the line as we progress along the x-axis The intercept is the height of the line when x = 0 Statistical Data Analysis - Lecture /04/03

Simple linear regression
Perhaps now we have the tools to begin to write down a model Generally we have more than two points to work with  Ideally we wouldn’t fit a regression model to a data set with fewer than thirty points We have a number of responses, yi, and an associated measurement (which we assume is taken without measurement error) xi which we think explains our response. However, the points (usually) don’t lie exactly on a straight line – there is a bit of “noise” associated with each measurement – does this sound familiar? Statistical Data Analysis - Lecture /04/03

A probability model for simple linear regression
The response variable, Y, is described by the intercept and a coefficient for the predictor X. If we subtract the model we expect to find residuals which are independently and identically normally distributed with mean zero and standard deviation sigma. How do we know we find the intercept and the coefficient for the predictor X? Statistical Data Analysis - Lecture /04/03

The method of least squares
We have a “theoretical” model which we believes describes the data. How do we “fit” the model to our data? That is, how do we find the slope and intercept for our model? These ideas are best illustrated with some data Our data set has 50 observations, with a response Y and a predictor X Given this is bivariate data, the first thing we do is plot it Statistical Data Analysis - Lecture /04/03

We can see from our plot, that the data points seem to cluster around a straight line. Maybe a regression model is appropriate Statistical Data Analysis - Lecture /04/03

We could try fitting a line “by eye” But everyone’s best guess would probably be different We want consistency Statistical Data Analysis - Lecture /04/03

Least squares The least squares procedure is a method for fitting regression lines It attempts to find the intercept and the slope such that the residual sum of squares is minimised. I.e. Find 0 and 1 such that is minimized The minimum value of this function is zero. This is hardly ever achieved. The least squares fitted values are denoted Statistical Data Analysis - Lecture /04/03

We try find a slope and intercept such that the red and green lines as small as possible I.e. we try and minimize the perpendicular distances from the points to the fitted values Statistical Data Analysis - Lecture /04/03

Fitted values and residuals
After we fit the regression line, we get fitted coefficient values for the slope and the intercept. If we apply these to the data, we get the fitted values, i.e. Corresponding to the fitted values are the residuals, our estimate of the errors in the models, i.e. We use the residuals to assess model fit, and to check the model assumptions Statistical Data Analysis - Lecture /04/03

Checking model assumptions
Recall that we when proposed the regression model we made an assumption. Namely, the errors are normally distributed with mean zero and variance sigma squares This means, like in ANOVA, we need to check whether the residuals are normally distributed and whether the variances are equal amongst residuals We’ve seen how to check for normality – a norplot Statistical Data Analysis - Lecture /04/03

Our norplot is follows a straight line reasonably well, therefore we might think our assumption of normality is satisfied. The intercept of a fitted line on this plot is zero – what does that mean? The slope is 5.62 – what does that mean? Statistical Data Analysis - Lecture /04/03

Equality of variance It is possible to have normality without having equality of variance, i.e. in some situations we fit the model However, we did not fit this model to this set of data, we assumed that we had equal variances for every error, i.e. We check this assumption, as before, with a pred-res plot This time however, we will generally see less patterning, because the data are not grouped Statistical Data Analysis - Lecture /04/03

Residuals vs Fitted 16 15 10 49 5 Residuals -5 -10 17 20 40 60 80 100 Fitted values lm(formula = y ~ x) Statistical Data Analysis - Lecture /04/03

What do we look for in a pred-res plot? Extreme residuals – our estimated standard deviation of the residuals is 5.6 More negative residuals than positive or vice versa Strong patterns or trends in the residuals What do these features mean? If we have extreme residuals then there are a number of reasons why An outlier or a data entry error Poor model fit Possible high leverage points elsewhere If there is a disproportionate ratio of positive to negative residuals then we may have A skewed response variable Statistical Data Analysis - Lecture /04/03

Interpreting pred-res plots
If we have strong patterns in the pred-res plot then this can mean a number of things The equality of variance assumption has been violated – this is usually shown by a funnel shape in the plot The simple linear model did not explain the trend in the data, i.e. there is some trend that still exists in the data which might require the addition of extra model terms – this is more likely in multivariate regression The data require transformation before a linear model is appropriate Statistical Data Analysis - Lecture /04/03

This funnel effect is evidence of “non-homogeneity of variance” Usually we can get around it by transforming the data or fitting a different model It is never valid to proceed from this point to the interpretation of the regression coefficients Statistical Data Analysis - Lecture /04/03

We usually see this type of effect when the real trend is really non-linear The actual model here is y=exp(5x) This is definitely a non-linear model Taking logs would cure this problem – why? Statistical Data Analysis - Lecture /04/03

Here we have more negative residuals than positive The extreme residuals are all positive as well This says the errors are skewed This violates our assumption of normality The real model here was y=5x+3+e, e~exp(50) Statistical Data Analysis - Lecture /04/03

Statistical Data Analysis - Lecture /04/03

Similar presentations

Presentation on theme: "Statistical Data Analysis - Lecture /04/03"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Data Analysis - Lecture /04/03

Similar presentations

Presentation on theme: "Statistical Data Analysis - Lecture /04/03"— Presentation transcript:

Similar presentations

About project

Feedback