# The Simple Regression Model

## Presentation on theme: "The Simple Regression Model"— Presentation transcript:

The Simple Regression Model
Interval Estimation, Section 15.3 also read confidence and prediction intervals Correlation, Section 15.4 Estimation and Tests, Section 15.5, 15.6, 15.7

The Standard Error of the Estimate - Se or Sy.x
The least squares method minimizes the distance between the predicted y and the observed y, the SSE Need a statistic that measures the variability of the observed y values from the predicted y A measure of the variability of the observed y values around the sample regression line Also our estimate of the scatter of the y values in the population around the population regression line It is an estimate of y|x The least squares method results in a line that fits the data such that the distance between the predicted y and the observed y is minimized. We would like a statistic that measures the variability of the observed y values from the predicted. In other words, a measure of the variability of the observed y values around the sample regression line. This statistic is also our estimate of the scatter of the y values in the population around the population regression line. Thus it is an estimate of y|x PP 9

Standard Error of the Estimate
Se is an estimate of σy|x E(Y|X=20) E(Y|X=50) E(Y|X=80) Y X 20 50 80 σ y|x A -Sample B - Sample C - Population PP 9

Standard Error of the Estimate
The standard error has the units of the dependent variable, y The formula requires us to find first the predicted value for each observation in the data set and second, the error term for that observation Can calculate the error or residual for an observation PP 9

Calculating a Residual
xi yi ei 40 165 54 85 125.2 -40.2 9 37.5 -28.5 xi yi ei 40 165 54 85 125.2 -40.2 9 37.5 -28.5 To calculate the standard error for a sample, use For a given x value, you should be able to calculate the error term. However, to calculate the standard error of the estimate, you will want to use the computational formula that is faster PP 9

Calculating the Standard Error of the Estimate
Substituting Units are deaths per 1000 live births. Since we choose b0 and b1 to minimize the SSE, we were implicitly minimizing the standard error of the estimate PP 9

The Coefficient of Determination
Want to develop a measure as to how well the independent variable predicts the dependent variable Want to answer the following question Of the total variation among the y’s, how much can be attributed to the relationship between X and Y, and how much can be attributed to chance? PP 9

The Coefficient of Determination
By total variation among the y’s, we mean the changes in Y from one sample observation to another Why do the values of Y differ from observation to observation? The answer, according to our hypothesized regression model, is That the variation in Y is partly due to changes in X, which leads to changes in the expected value of Y And partly due to chance, that is, the effect of the random error term PP 9

The Coefficient of Determination
Ask how much of the observed variation in Y can be attributed to the variation in X and how much is due to other factors (error) Define “sample variation of Y” If there was no variation in Y, all the values of Y when plotted against X would lie on a straight line Corresponds to the average value of Y PP 9

No Variation in Y X PP 9

The Coefficient of Determination
Now in reality the observed values of Y are scattered around this line Variation in Y can be measured as the distance of the observed yi from the average Y yi,xi X PP 9

SST = SSR + SSE Total variation can be decomposed into explained variation and unexplained variation SST = SSR + SSE yi Xi PP 9

Coefficient of Determination or R2
R2 is the proportion of the variation of Y that can be attributed to the variation of X R2 = SSR/SST or R2 = 1 - SSE/SST SST = SSR + SSE SST/SST = SSR/SST + SSE/SST 1 = R2 + SSE/SST PP 9

Coefficient of Determination or R2
R2 describes how well the sample regression line fits the observed data Tells us the proportion of the total variation in the dependent variable explained by variation in the explanatory variable R2 is an index No units associated 0  R2  1 PP 9

Interpreting R2 R2 = 1 indicates a perfect fit
An R2 close to zero indicates a very poor fit of the regression line to the data R2 = 0 PP 9

Computational Formulas
SSR = – = Here is an interesting question. If we have the R2 value and we want to know the correlation coefficient (R value), can we determine it? The answer is that we will not know what the sign is for the correlation coefficient. We could determine the sign of the relationship between the two variables by looking at the coefficient on the estimated slope of the regression equation R2 = / = Interpret the R2 value in terms of our problem 68.74% of the variation in mortality rates is explained by variation in immunization rates PP 9

Interpretation of R2 as a Descriptive Statistic
Suppose we find a very low R2 for a given sample Implies that the sample regression line fits the observations poorly A possible explanation is that X is a poor explanatory variable This is a statement about the population regression line That is, the population regression line is horizontal Can test this with reference to the sample data Null hypothesis is H0: 1 = 0 If we do not reject this null hypothesis, we find that Y is influenced only by the random error term Another explanation of a low R2 is that X is a relevant explanatory variable But that its influence on Y is weak compared to the influence of the error term PP 9

Pearson’s Correlation Coefficient
Correlation is used to measure the strength of the linear association between two variables The correlation coefficient is an index No units of measurement Positive or negative sign associated with the measure The boundaries for the correlation coefficient are The values r = 1 and r = -1 occur when there is an exact linear relationship between x and y PP 9

Pearson’s Correlation Coefficient
X and Y are perfectly negatively correlated X and Y are perfectly positively correlated X and Y are uncorrelated PP 9

Pearson’s Correlation Coefficient
As the relationship between x and y deviate from perfect linearity, r moves away from |1| toward 0 With the data to the right, the correlation model should not be applied Y If y tends to decrease as x increases, then the correlation is negative. If y tends to increase as x increases the correlation is positive. If r = 0, we say x and y are uncorrelated. There is no linear relationship between the two variables. However, a non-linear relationship may exist. X PP 9

Computational Formula for r
Based on this sample there appears to be a fairly strong linear relationship between the percentage of children immunized in a specified country and its under-5 mortality rate. The correlation coefficient is fairly close to 1. In addition there is a negative relationship. Mortality decreases as percent immunized increases. PP 9

Pearson’s Correlation Coefficient
Limitations of the Correlation Model The correlation model does not specify the nature of the relationship Do not infer causality An effective immunization program might be the primary reason for the decrease in mortality, but it is possible that the immunization program is a small part of an overall health care system that is responsible for the decrease in mortality The model measures linear relationships The Y values for a given X are assumed to be normally distributed and the X values for a given Y are also assumed to be normally distributed Sampling from a “bivariate normal distribution” The model is very sensitive to outliers If there are pairs of data points way outside the range of the other data points, this can alter the value of the correlation coefficient and give misleading results Do not extrapolate the correlation coefficient outside the range of data points The relationship between X and Y may change outside the range of sample points PP 9

Testing Hypotheses about the Population Correlation Coefficient
Test whether there is a significant correlation, , in the population between X and Y H0:  = 0 There is no linear association H1:   0 There is a significant linear association The sample correlation coefficient is an unbiased estimator of the population correlation coefficient, which we designate as  That is, the E(r) =  The sampling distribution of the statistic r is approximately normally distributed PP 9

Testing Hypotheses about the Population Correlation Coefficient
The standard error of the sample correlation coefficient is The test statistic is PP 9

Testing Hypotheses about the Population Correlation Coefficient
Critical Value at ⍺ = 0.05 t18,.05/2 = t 18,.025 = 2.101 Degrees of freedom = df = n - 2 Decision Rule If ( ≤ ≤ 2.101) do not reject Therefore, Reject Comparing the test statistic with the critical value, we reject the null hypothesis and conclude that there is a significant linear association between immunization rates and mortality rates PP 9

Relationship between Correlation, R, and Coefficient of Determination, R2
r = R = the square root of the coefficient of determination, R2 R = Correlation coefficient R2 = Coefficient of determination PP 9

Computer Presentation of Correlation Matrix
MORTRATE IMMUNRATE 1 PP 9

Want to create a confidence interval for the slope (or intercept) or want to test whether the population slope, , (or intercept) equals zero Saw before (OLS properties): Sampling Distributions E(b0) = β0 normal E(b1) = β1 normal We want to use statistical inference to draw conclusions about the population parameters. For example, we might want to create a confidence interval for the slope (or intercept) or we might want to test whether the population slope, , equals zero We considered earlier the properties of the OLS estimators. These properties described the sampling distributions. The estimators, a and b, are linear combinations of the yi. This implies that the distribution of b will follow the distribution of the yi (or the error term). If the error terms are normal, the distribution of the b is normal. If the sample is large, the distribution of the b will be approximately normal even fi the error terms are not normal. We also saw that the expected values of b0 and b1 are 0 and 1, respectively. b0 b1 PP 9

Among all linear unbiased estimators, OLS estimators have the smallest variance The standard error of b0 and b1 are PP 9

Since y|x is unknown, we substitute the standard error of the estimate, Se, and use the t distribution In order to use the t distribution, we now have to assume the yi’s are normal. In large samples, the t provides a good approximation even if the yi’s are not normal. PP 9

Confidence Intervals Population Slope and Intercept
Use information about the sampling distributions to construct confidence intervals for the population slope and intercept If the conditional probability distribution of Y|X follows a normal distribution PP 9

Confidence Intervals Population Slope and Intercept
For the slope t18,05/2 = t 18,.025 = 2.101 -3.77  1  with a degree of confidence of .95 For the intercept  0  with a degree of confidence of .95 In 95 out of 100 intervals the population parameter will fall w/in the interval. PP 9

Confidence Intervals Population Slope and Intercept
The interval estimates appear wide Small sample size Large variation in mortality for given immunization rates Se is large PP 9

Tests of Hypotheses The most common type of hypothesis that is tested with the regression model is that there is no relationship between the explanatory variable X and the dependent variable Y The relationship between X and Y is given by the linear dependence of the mean value of Y on X, that is E(Y|X) = 0 +1 x To say there is no relationship means E(Y|X) is not linearly dependent, which is to say 1 equals zero H0: 1 = 0 There is no relationship between X and Y H1: 1  0 There is a significant relationship between X and Y If we have a theory that suggests the direction of the relationship than we will want a one tail test The most common type of hypothesis that is tested with the help of the regression model is that there is no relationship between the explanatory variable X and the dependent variable Y. The relationship between X and Y is given by the linear dependence of the mean value of Y on X, that is, E(Y|X) = +x. To say there is no relationship means E(Y|X) is not linearly dependent, which is to say  equals zero. H0:  = 0 There is no relationship between X and Y H1:   0 There is a significant relationship between X and Y If we have a theory that suggests the direction of the relationship than we will want a one tail test. The test statistic is PP 9

H0: 1 = 0 There is no relationship between X and Y

Sampling Distribution under the null hypothesis
Tests of Hypotheses The test statistic is Set level of significance Find critical value in t -table df = n - 2 DR: if (-tcv ≤ t-test ≤ tcv), do not reject Sampling Distribution under the null hypothesis t n - 2 -t reject do not reject normal reject b1 t PP 9

Sampling Distribution under the null hypothesis
Tests of Hypotheses For our problem H0: 1 ≥ 0 No relationship between X and Y H1: 1 < 0 An inverse relationship between X and Y Test statistic Let ⍺ = 0.05 Critical value: t18,0.05 = -1.734 DR: if (-tcv ≤ t-test), do not reject ( > ), reject Sampling Distribution under the null hypothesis do not reject reject -2.831 b1 -1.734 -6.291 t n - 2 PP 9

Tests of Hypotheses Conclude that the immunization rate is significantly and inversely related to the mortality rate Remember: You want to reject the null You have found that your independent variable is related PP 9

Computer Output of the Problem
MORTALITY,Y IMMUNIZED, X Mean 62.2 76.3 Standard Error Median 31 83 Mode 9 Standard Deviation Sample Variance 4700.8 Range 220 72 Minimum 6 26 Maximum 226 98 Sum 1244 1526 Count 20 PP 9

Excel Output = Se b0 = b1 = Sb0 = Sb1 = PP 9

Online Homework - Chapter 15 Overview Simple Regression
CengageNOW fourteenth assignment PP 9