1 Regression & Correlation (1) 1.A relationship between 2 variables X and Y 2.The relationship seen as a straight line 3.Two problems 4.How can we tell if our regression line is useful? 5.Test of hypothesis about the slope, β 1 6.Correlation 7.Useful features of r 8.Test of hypothesis about ρ 9.Examples
2 A relationship between two variables X & Y We often have pairs of scores for a given set of cases. For example, we might have: * # of years of education and annual income, or * IQ and GPA * income and # of books in the household More generally, we have any X and Y, and our question is, does knowing something about X tell us anything about Y?
3 A relationship between two variables X & Y Does knowing something about X tell us anything about Y? For example, knowing how many years of education a person has, could you usefully estimate their annual income, or the number of cigarettes they smoke in a year?
4 A relationship between two variables X & Y Often, the answer to that question is, Yes – there is a relationship between the X and Y scores you have measured. * On average, as number of years of education goes up (across a set of people), number of cigarettes smoked per year goes down.
5 A relationship between two variables X & Y In the graph on the next slide, we see two things: 1.X goes down as Y goes up. 2.At each value of X, there is some variability in Y – but substantially less than there is in Y overall.
6 X = Years of education Y = Cigarettes per year Note that the range of the Y values for this value of X is small, compared to the whole range of Y in the data set.
7 The relationship seen as a straight line The relationship between an X and a Y can be described using the equation for a straight line. Y = β 0 + β 1 X + ε Y-interceptSlopeError Note: this is the (theoretical) population equation relating Y to X
8 Two problems Y = β 0 + β 1 X + ε In principle, this equation would let us predict the value of Y for a given X without error IF A. X were the only variable that influenced Y * Usually, it isn’t B. We knew the population values of β 0 + β 1 * Usually, we don’t
9 Two problems Be sure to distinguish between A.Actual values of Y in the population. B.Values of Y we would predict using Y = β 0 + β 1 X + ε if we had the population values for β 0 + β 1. C. Values of Y we predict on the basis of the X-Y relationship in our sample data: Y = β 0 + β 1 X ^^ Why no ε here?
10 Two problems When we predict Y on the basis of X for a given case, two things can cause the predicted values to be different from the values we would find if we actually measured Y for that case: 1. We don’t know the population values of β 0 and β 1 – only the sample values β 0 and β 1. Note that if we did know β 0 & β 1, this source of error would disappear. ^^
11 Two problems 2. In the population, Y is not uniquely determined by X. As a result, for each value of X, there is a distribution of Y values. * relative to our predicted Y for a given value of X, the observed values of Y will sometimes be higher and sometimes be lower. * these “errors” are random – over the long term, they will cancel each other out * but even if we knew β 0 and β 1, this source of error would still exist.
12 Two problems In other words 1.We don’t have population values for the slope and the intercept of the line relating X to Y. That’s one problem. 2.Even if we had population values for the slope and the intercept, the equation relating X to Y would still not perfectly predict Y. That’s the other problem.
13 How can we tell if our regression line is useful? The line is useful if the predicted values of Y are close to the observed values of Y (in the sample). We use our sample X and Y values to compute the regression line, Y = β 0 + β 1 X. We then use this line to predict the same Y values, and compare our predicted values with the observed values in the sample data. If the prediction is good, we can then use the regression line to predict Y for values of X not in our sample. ^
14 How can we tell if our regression line is useful? (Y i – Y i ) = Y i – (β 0 + β 1 X i )(since Y i = β 0 + β 1 X i ) Therefore, the sum of the squared deviations of predicted Y values from actual Y values is: SSE = Σ[Yi – (β 0 + β 1 X i )] 2 Now β 0 and β 1 are the “least squares estimators” of β 0 + β 1 – giving smaller SSE than any other values of β 0 and β 1 would. ^^^^^^ ^^ ^^ ^ ^
15 X Y When there is no relation between X and Y, the best estimator of the Y value for any case is the mean, Y. Notice that the slope of this line is zero!
16 How can we tell if our regression line is useful? If X is completely unrelated to Y, the best estimate we could make of Y would be the mean, Y, for any value of X. We find out whether our regression line is useful by asking whether its slope is different from 0. H 0 : β 1 = 0 [Why not β 1 ?] ^
17 How can we tell if our regression line is useful? To test that null hypothesis, we use the fact that β1 is one slope taken from the sampling distribution of β 1. β 1 = SS XY β 0 = Y - β 1 X SS XX Where SS XY = Σ(X i – X) (Y i –Y) = ΣX i Y i – ΣX i ΣY i n ^ ^ ^^^
18 How can we tell if our regression line is useful? SS XX = Σ(X i – X) 2 = ΣX 2 – (ΣX) 2 n (n = sample size) For the sampling distribution of β 1 : The mean = β 1 β 1 = √SS XX ^ ^
19 How can we tell if our regression line is useful? We estimate β 1 by s β 1 = s √SS XX Where s = SSE n-2 ^ ^ √
20 Test of hypothesis about the slope, β 1 Since is unknown, we use t to test H 0 : H 0 : β 1 = 0H 0 : β 1 = 0 H A : β 1 < 0H A : β 1 ≠ 0 or β 1 > 0 Test statistic:t = β 1 – 0 S β 1 ^ ^
21 Test of hypothesis about the slope, β 1 Rejection region: t obt t /2 t obt > t t crit is based on n-2 degrees of freedom.
22 Correlation The Pearson Correlation coefficient r is a numerical, descriptive measure of the strength and direction of relationship between two variables X and Y. r = SS XY SS XX SS YY r gives much the same information as β 1. However r is “scale-less” and (-1 ≤ r ≤1) √ ^
23 Useful features of r r indexes the X-Y relationship: r > 0 means Y increases as X increases r < 0 means Y decreases as X increases r = 0 means there is no relationship between X & Y r is the sample correlation coefficient. We can use it to estimate rho (ρ), the population correlation coefficient, and use r to test H 0 : ρ = 0
24 Test of hypothesis about ρ H 0 : ρ = 0H 0 : ρ = 0 H A : ρ < 0H A : ρ ≠ 0 or ρ > 0 Test statistic:t = r – ρ 1 – r 2 n – 2 t crit has n-2 degrees of freedom. √
25 Example 1 H 0 : ρ = 0 H A : ρ ≠ 0 Test statistic:t = r – ρ 1 – r 2 n – 2 t crit = t (5, α/2 =.025) = √
26 Example 1 – Sum formulas First, calculations involving X: ΣX = 74(ΣX) 2 = 5476ΣX 2 = 922 Then, analogous calculations involving Y: ΣY = 82(ΣY) 2 = 6724ΣY 2 = 1076 Then, calculations involving X and Y: ΣXY = 976
27 Example 1 – Sums of squares formulas SS XY = Σ(X i – X) (Y i –Y) = ΣX i Y i – ΣX i ΣY i n SS XX = Σ(X i – X) 2 = ΣX 2 – (ΣX) 2 n SS YY = Σ(Y i – Y) 2 = ΣY 2 – (ΣY) 2 n
28 Example 1 – calculate r SS XY = SS XX = SS YY = r = SS XY r =.859 SS XX SS YY √
29 Example 1 – do t-test t = r – ρ 1 – r 2 n – 2 t = =.859= Reject H 0 : A significant correlation exists. √ √
30 Example 2 H 0 : ρ = 0 H A : ρ > 0 Test statistic:t = r – ρ 1 – r 2 n – 2 t crit = t (7-2 = 5, α =.05) = √ Note – these are the Greek letter rho, NOT the English letter P
31 Example 2 – Sum formulas First, calculations involving X: ΣX = 4.2(ΣX) 2 = 17.64ΣX 2 = 2.86 Then, analogous calculations involving Y: ΣY = 32(ΣY) 2 = 1024ΣY 2 = Then, calculations involving X and Y: ΣXY = 21.35
32 Example 2 – calculate r SS XY = – (4.2)(32) = SS XX = 2.86 – 17.64=.34 7
33 Example 2 – calculate r SS YY = – 1024= r = SS XY SS XX SS YY r =.945 √
34 Example 2 – do t-test t = r – ρ 1 – r 2 n – 2 t = =.945= Reject H 0 : A significant correlation exists. √ √