Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Chapter 8 – Regression 2 Basic review, estimating the standard error of the estimate and short cut problems and solutions.

Similar presentations


Presentation on theme: "1 Chapter 8 – Regression 2 Basic review, estimating the standard error of the estimate and short cut problems and solutions."— Presentation transcript:

1 1 Chapter 8 – Regression 2 Basic review, estimating the standard error of the estimate and short cut problems and solutions.

2 2 You can use the regression equation when: 1. the relationship between X and Y is linear, 2. r falls outside the CI.95 around 0.000 and is therefore a statistically significant correlation, and 3. X is within the range of X scores observed in your sample,

3 3 Simple problems using the regression equation t Y ' =r *tXtX t Y ' =.150 *0.40 = 0.06 t Y ' =.40 *-1.70 = -0.68 t Y ' =.40 *1.70 = 0.68

4 4 Predictions from Raw Data 1. Calculate the t score for X. 2. Solve the regression equation. 3. Transform the estimated t score for Y into a raw score.

5 5 Predicting from and to raw scores Problem: Estimate the midterm point total given a study time of 400 minutes. It is given that the estimated mean of the study time is 560 minutes and the estimated standard deviation is 216.02. (Range = 260-860) It is given that the estimated mean of midterm points is 76 and their estimated standard deviation is 7.98. There were 10 pairs of t X,t Y scores The estimated correlation coefficient is.851.

6 6 Can you use the regression equation?

7 1 2 3 4 5 6 7 8 9 10 11 12. 100 200 300 500 1000 2000 10000 -.996 to.996 -.949 to.949 -.877 to.877 -.810 to.810 -.753 to.753 -.706 to.706 -.665 to.665 -.631 to.631 -.601 to.601 -.575 to.575 -.552 to.552 -.531 to.531. -.194 to.194 -.137 to.137 -.112 to.112 -.087 to.087 -.061 to.061 -.043 to.043 -.019 to.019.997.950.878.811.754.707.666.632.602.576.553.532..195.138.113.088.062.044.020.9999.990.959.917.874.834.798.765.735.708.684.661..254.181.148.115.081.058.026 df nonsignificant.05.01

8 8 YES! zr (8) =.851, p <.01 z400 minutes is inside the range of X scores seen in the random sample (260-860 minutes)

9 9 Predicting from and to raw scores 1. Translate raw X to t X score. X X-bar s X (X-X-bar) / s X = t X 400 560 216.02 (400-560)/216.02= -0.74

10 10 Use regression equation 2. Find value of t Y' r r * t X = t Y'.851.851*-0.74=-0.63

11 11 Translate t Y' to raw Y' Y s Y Y + (t Y' * s Y ) = Y' 76.00 7.98 76.00+(-0.63*7.98) = 70.97

12 12 A Caution zNever assume that a correlation will stay linear outside of the range you originally observed. zTherefore, never use the regression equation to make predictions from X values outside of the range you found in your sample. zExample: Basing a prediction of the height of a 50 year old adult based on a study examining the correlation of age and height in a sample composed only of children age 14 or less.

13 13 Correlation Characteristics: Which line best shows the relationship between age (X) and height (Y) Linear vs Curvilinear

14 14. Reviewing the r table and reporting the results of calculating r from a random sample

15 15 How the r table is laid out: the important columns yColumn 1 of the r table shows degrees of freedom for correlation and regression (df REG ) ydf REG =n P -2 yColumn 2 shows the CI.95 for varying degrees of freedom yColumn 3 shows the absolute value of the r that falls just outside the CI.95. Any r this far or further from 0.000 falsifies the hypothesis that rho=0.000 and can be used in the regression equation to make predictions of Y scores for people who were not in the original sample but who were part of the population from which the sample is drawn.

16 1 2 3 4 5 6 7 8 9 10 11 12. 100 200 300 500 1000 2000 10000 -.996 to.996 -.949 to.949 -.877 to.877 -.810 to.810 -.753 to.753 -.706 to.706 -.665 to.665 -.631 to.631 -.601 to.601 -.575 to.575 -.552 to.552 -.531 to.531. -.194 to.194 -.137 to.137 -.112 to.112 -.087 to.087 -.061 to.061 -.043 to.043 -.019 to.019.997.950.878.811.754.707.666.632.602.576.553.532..195.138.113.088.062.044.020.9999.990.959.917.874.834.798.765.735.708.684.661..254.181.148.115.081.058.026 df nonsignificant.05.01

17 1 2 3 4 5 6 7 8 9 10 11 12. 100 200 300 500 1000 2000 10000 -.996 to.996 -.949 to.949 -.877 to.877 -.810 to.810 -.753 to.753 -.706 to.706 -.665 to.665 -.631 to.631 -.601 to.601 -.575 to.575 -.552 to.552 -.531 to.531. -.194 to.194 -.137 to.137 -.112 to.112 -.087 to.087 -.061 to.061 -.043 to.043 -.019 to.019.997.950.878.811.754.707.666.632.602.576.553.532..195.138.113.088.062.044.020.9999.990.959.917.874.834.798.765.735.708.684.661..254.181.148.115.081.058.026 df nonsignificant.05.01 If r falls in within the 95% CI around 0.000, then the result is not significant. Find your degrees of freedom (n p -2) in this column You cannot reject the null hypothesis. You must assume that rho = 0.00. Does the absolute value of r equal or exceed the value in this column? r is significant with alpha =.05. If r is significant you can consider it an unbiased, least squares estimate of rho. alpha =.05. You can use it in the regression equation to estimate Y scores.

18 18 Can we generalize to the population from the correlation in the sample? zA Type 1 error involves saying that there is a correlation in the population as a whole, when the correlation is actually 0.000 (and the null is true). zWe carefully guard against Type 1 error by using significance tests to try to falsify the null hypothesis.

19 19 Example : Achovy pizza and horror films, rho=0.000 H 1 : People who enjoy food with strong flavors also enjoy other strong sensations. H 0 : There is no relationship between enjoying food with strong flavors and enjoying other strong sensations. anchovies 7 3 0 8 4 1 horror films 7 9 8 6 9 6 5 2 1 6 Can we reject the null hypothesis? (scale 0-9)

20 20 Can we reject the null hypothesis? 0 8 6 4 2 08642 Horror films Pizza

21 21 Can we reject the null hypothesis? r =.352 df = 8 We do the math and we find that:

22 1 2 3 4 5 6 7 8 9 10 11 12. 100 200 300 500 1000 2000 10000 -.996 to.996 -.949 to.949 -.877 to.877 -.810 to.810 -.753 to.753 -.706 to.706 -.665 to.665 -.631 to.631 -.601 to.601 -.575 to.575 -.552 to.552 -.531 to.531. -.194 to.194 -.137 to.137 -.112 to.112 -.087 to.087 -.061 to.061 -.043 to.043 -.019 to.019.997.950.878.811.754.707.666.632.602.576.553.532..195.138.113.088.062.044.020.9999.990.959.917.874.834.798.765.735.708.684.661..254.181.148.115.081.058.026 df nonsignificant.05.01

23 1 2 3 4 5 6 7 8 9 10 11 12. 100 200 300 500 1000 2000 10000 -.996 to.996 -.949 to.949 -.877 to.877 -.810 to.810 -.753 to.753 -.706 to.706 -.665 to.665 -.631 to.631 -.601 to.601 -.575 to.575 -.552 to.552 -.531 to.531. -.194 to.194 -.137 to.137 -.112 to.112 -.087 to.087 -.061 to.061 -.043 to.043 -.019 to.019.997.950.878.811.754.707.666.632.602.576.553.532..195.138.113.088.062.044.020.9999.990.959.917.874.834.798.765.735.708.684.661..254.181.148.115.081.058.026 df nonsignificant.05.01

24 1 2 3 4 5 6 7 8 9 10 11 12. 100 200 300 500 1000 2000 10000 -.996 to.996 -.949 to.949 -.877 to.877 -.810 to.810 -.753 to.753 -.706 to.706 -.665 to.665 -.631 to.631 -.601 to.601 -.575 to.575 -.552 to.552 -.531 to.531. -.194 to.194 -.137 to.137 -.112 to.112 -.087 to.087 -.061 to.061 -.043 to.043 -.019 to.019.997.950.878.811.754.707.666.632.602.576.553.532..195.138.113.088.062.044.020.9999.990.959.917.874.834.798.765.735.708.684.661..254.181.148.115.081.058.026 df nonsignificant.05.01

25 1 2 3 4 5 6 7 8 9 10 11 12. 100 200 300 500 1000 2000 10000 -.996 to.996 -.949 to.949 -.877 to.877 -.810 to.810 -.753 to.753 -.706 to.706 -.665 to.665 -.631 to.631 -.601 to.601 -.575 to.575 -.552 to.552 -.531 to.531. -.194 to.194 -.137 to.137 -.112 to.112 -.087 to.087 -.061 to.061 -.043 to.043 -.019 to.019.997.950.878.811.754.707.666.632.602.576.553.532..195.138.113.088.062.044.020.9999.990.959.917.874.834.798.765.735.708.684.661..254.181.148.115.081.058.026 df nonsignificant.05.01

26 1 2 3 4 5 6 7 8 9 10 11 12. 100 200 300 500 1000 2000 10000 -.996 to.996 -.949 to.949 -.877 to.877 -.810 to.810 -.753 to.753 -.706 to.706 -.665 to.665 -.631 to.631 -.601 to.601 -.575 to.575 -.552 to.552 -.531 to.531. -.194 to.194 -.137 to.137 -.112 to.112 -.087 to.087 -.061 to.061 -.043 to.043 -.019 to.019.997.950.878.811.754.707.666.632.602.576.553.532..195.138.113.088.062.044.020.9999.990.959.917.874.834.798.765.735.708.684.661..254.181.148.115.081.058.026 df nonsignificant.05.01

27 27 This finding falls within the CI.95 around 0.000 zWe call such findings “nonsignificant” zNonsignificant is abbreviated n.s. zWe would report these finding as follows zr (8)=0.352, n.s. zGiven that it fell inside the CI.95, we must assume that rho actually equals zero and that our sample r is.352 instead of 0.000 solely because of sampling fluctuation. zWe go back to predicting that everyone will score at the mean of Y.

28 28 In fact, the null hypothesis was correct; rho = 0.000 zI made up that example using numbers randomly selected from a random number table. zSo there really was no relationship between the two sets of scores: rho really equaled 0.000 zBut samples don’t give you an r of zero, they fluctuate around 0.000 zSignificance testing is your protection against mistaking sampling fluctuation for a real correlation. zSignificance testing protects against Type 1 error.

29 29 We use significance testing to protect us from Type 1 error. zOur sample gave us an r of.352. zWithout the r table, we could have thought that far enough from zero to represent a true correlation in the population. z0.352 was the product only of sampling fluctuation zSignificance testing is your protection against mistaking sampling fluctuation for a real correlation. zSignificance testing protects against Type 1 error.

30 30 How to report a significant r zFor example, let’s say that you had a sample (n P =30) and r = -.400 zLooking under n P -2=28 df REG, we find the interval consistent with the null is between -.360 and +.360 zSo we are outside the CI.95 for rho=0.000 zWe would write that result as r(28)=-.400, p<.05 zThat tells you the df REG, the value of r, and that you can expect an r that far from 0.000 five or fewer times in 100 when rho = 0.000

31 31 Then there is Column 4 zColumn 4 shows the values that lie outside a CI.99 z(The CI.99 itself isn’t shown like the CI.95 in Column 2 because it isn’t important enough.) zHowever, Column 4 gives you bragging rights. zIf your r is as far or further from 0.000 as the number in Column 4, you can say there is 1 or fewer chance in 100 of an r being this far from zero (p<.01). zFor example, let’s say that you had a sample (n P =30) and r = -.525. zThe critical value at.01 is.463. You are further from 0.000 than that.So you can brag. zYou write that result as r(28)=-.525, p<.01.

32 32 To summarize zIf r falls inside the CI.95 around 0.000, it is nonsignificant (n.s.) and you can’t use the regression equation (e.g., r(28)=.300, n.s. zIf r falls outside the CI.95, but not as far from 0.000 as the number in Column 4, you have a significant finding and can use the regression equation (e.g., r(28)=-.400,p<.05 zIf r is as far or further from zero as the number in Column 4, you can use the regression equation and brag while doing it (e.g., r(28)=-.525, p<.01

33 33 Can you reject H 0 ? 10 11 12 13 14 15 16 17 18 19. 40 50 60 -.575 to.575 -.552 to.552 -.531 to.531 -.513 to.513 -.496 to.496 -.481 to.481 -.467 to.467 -.455 to.455 -.443 to.443 -.432 to.432. -.303 to.303 -.272 to.272 -.249 to.249.576.553.532.514.497.482.468.456.444.433..304.273.250.708.684.661.641.623.606.590.575.561.549..393.354.325 df nonsignificant.05.01 r =.386 n p = 19 df REG = 17

34 34 Can you reject H 0 ? 10 11 12 13 14 15 16 17 18 19. 40 50 60 -.575 to.575 -.552 to.552 -.531 to.531 -.513 to.513 -.496 to.496 -.481 to.481 -.467 to.467 -.455 to.455 -.443 to.443 -.432 to.432. -.303 to.303 -.272 to.272 -.249 to.249.576.553.532.514.497.482.468.456.444.433..304.273.250.708.684.661.641.623.606.590.575.561.549..393.354.325 df nonsignificant.05.01 r = -.386 n p = 47 df reg = 45

35 35 How much better than the mean can we guess?

36 36 Improved prediction zIf we can use the regression equation rather than the mean to make individualized estimates of Y scores, how much better are our estimates? zWe are making predictions about scores on the Y variable from our knowledge of the statistically significant correlation between X & Y and the fact that we know someone’s X score. zThe average unsquared error when we predict that everyone will score at the mean of Y equals s Y, the ordinary standard deviation of Y. zHow much better than that can we do?

37 37 Estimating the standard error of the estimate the (very) long way. zCalculate correlation (which includes calculating s for Y). zIf the correlation is significant, you can use the regression equation to make individualized predictions of scores on the Y variable. zThe average unsquared error of prediction when you do that is called the estimated standard error of the estimate.

38 38 Example for Prediction Error zA study was performed to investigate whether the quality of an image affects reading time. zThe experimental hypothesis was that reduced quality would slow down reading time. zQuality was measured on a scale of 1 to 10. Reading time was in seconds.

39 39 Quality vs Reading Time data: Compute the correlation Quality (scale 1-10) 4.30 4.55 5.55 5.65 6.30 6.45 Reading time (seconds) 8.1 8.5 7.8 7.3 7.5 7.3 6.0 Is there a relationship? Check for linearity. Compute r.

40 40 Calculate t scores for X X 4.30 4.55 5.55 5.65 6.30 6.45  X=39.25 n= 7 X=5.61 (X - X) 2 1.71 1.12 0.00 0.48 0.71 X - X -1.31 -1.06 -0.06 0.04 0.69 0.84 t X = (X - X) / s X -1.48 -1.19 -0.07 0.05 0.78 0.95 MS W = 4.73/(7-1) = 0.79 s = 0.89 SS W = 4.73

41 41 Calculate t scores for Y Y 8.1 8.5 7.8 7.3 7.5 7.3 6.0  Y=52.5 n= 7 Y=7.50 MS W = 3.78/(7-1) = 0.63 s Y = 0.794 (Y - Y) 2 0.36 1.00 0.09 0.04 0.00 0.04 2.25 Y - Y 0.60 1.00 0.30 -0.20 0.00 -0.20 -1.50 t Y = (Y - Y) / s Y 0.76 1.26 0.38 -025 0.00 -0.25 -1.89 SS W = 3.78

42 42 Plot t scores t Y 0.76 1.28 0.39 -0.25 0.00 -0.25 -1.89 t X -1.48 -1.19 -0.07 0.05 0.78 0.95

43 43 t score plot with best fitting line: linear? YES!

44 44 Calculate r t Y 0.76 1.28 0.39 -0.25 0.00 -0.25 -1.88 t X -1.48 -1.19 -0.07 0.05 0.78 0.95 t Y - t X -2.24 -2.47 -0.46 0.30 0.78 1.20 2.83 (t Y - t X ) 2 5.02 6.10 0.21 0.09 0.61 1.44 8.01  (t X - t Y ) 2 / (n P - 1) = 3.580 r = 1 - (1/2 * 3.580) = 1 - 1.79 = -0.790  (t X - t Y ) 2 = 21.48

45 45 Check whether r is significant r = -0.790 df = n P -2 = 5  is.05 Look in r table:With 5 df REG, the CI.95 goes from -.753 to +.753 r(5)= -.790, p <.05 r is significant!

46 46 We can calculate the Y' for every raw X X 4.30 4.55 5.55 5.65 6.30 6.45 Y ' 8.42 8.23 7.54 7.47 7.01 6.91

47 47 Can we show mathematically that regression estimates are better than mean estimates? Y 8.1 8.5 7.8 7.3 7.5 7.3 6.0 Y ' 8.42 8.23 7.54 7.47 7.01 6.91 To calculate the standard deviation we take deviations of Y from the mean of Y, square them, add them up, divide by degrees of freedom, and then take the square root. To calculate the standard error of the estimate, s EST, we will take the deviations of each raw Y score from its regression equation estimate, square them, add them up, divide by degrees of freedom, and take the square root. We expect of course that there will be less error if we use regression. Y 7.5

48 48 Estimated standard error of the estimate MS RES = 1.49/(7-2) = 0.298 S EST = 0.546 (Y - Y ' ) 2 0.10 0.07 0.03 0.24 0.15 0.83 Y - Y ' -0.32 0.27 0.26 -0.17 0.49 0.39 -0.91 SS RES = 1.49 Y 8.1 8.5 7.8 7.3 7.5 7.3 6.0 Y ' 8.42 8.23 7.54 7.47 7.01 6.91

49 49 How much better? MS RES = 0.30MS Y = 0. 64 52% less squared error when we use the regression equation instead of the mean to predict Y scores.

50 50 How much better is the estimated standard error of the estimate than the estimated standard deviation? S EST = 0.546S Y = 0.80 31% less error of prediction(using unsquared units) when we use the regression equation instead of the mean to predict.

51 51 Mathematical magic There is usually an alternative formula for calculating statistics that is easier to perform. We went through a lot of extra steps to calculate S EST = 0.546. It is not necessary to calculate all of the estimated Y scores, find the difference between each actual Y score and Y ', then square, sum, and divide by df REG.

52 52 Another way to phrase it: How much error did we get rid of? zTreat it as a weight loss problem. zIf Jack is 30 pounds overweight and he loses 40% of it, how much is he still overweight. zHe lost.400 x 30 pounds = 12 pounds. zHe has 30 – 12 = 18 pounds left to lose.

53 53 How did we solve that problem? zFirst we found how much weight Jack had gotten rid of. zThat equaled the percent he lost (expressed as a proportion) times the amount overweight he started with zHe was 30 pounds overweight and lost 40% of it. z30 *.400 = 12.00. zHe lost 12 pounds.

54 54 Then we found how much he was still overweight. zHe started off 30 pounds overweight. zHe lost 12.00 pounds. zSo he had 30-12=18 pounds of overweight left. zTo find what is left, subtract what you got rid of from what you lost.

55 55 So to compute how much of something is left after some is lost, you need to know how much there was to start with and what percentage was gotten rid of. Percentage gotten rid of times original quantity = amount gotten rid of. Original quantity minus amount gotten rid of = what’s left.

56 56 SS Y = error to start r 2 =percent of error lost zSS Y is the total amount of error we start with when prediction scores on Y. It is the amount of error when everyone is predicted to score at the mean. zThe proportion of error you get rid of using the regression equation as your predictor equals Pearson’s correlation coefficient squared (r 2 )!

57 57 To get the total error left find how much you got rid of, then subtract from what you started with zAmount you got rid of: SS Y * r 2 zAmount left: SS RES = SS Y – (SS Y * r 2 )

58 58 Now you want to estimate the average amount of unsquared error you will have left if you use the regression equation to make predictions for the whole population.

59 59 zSS REs = sum of squared error left when using the regression equation in your sample. zAS USUAL, to estimate the average squared error of prediction in the population when you use the regression equation to predict Y scores, divide the sum of squares by degrees of freedom.

60 60 zOnly change is that you used the regression equation to get SS RES. zSo you divide SS RES by df REG =n P -2 to get MS RES, the average amount of squared error you will have left when you use the regression equation. zThen take the square root of MS RES to get s EST. zs EST is your best estimate of the average unsquared error of prediction when you properly use the regression equation to predict Y scores. zRemember to properly use the regression equation, r must be significant and X within the range of X scores observed in your random sample.

61 61 Here are the formulae: zResidual sum of squared error if the regression equation is used:  SS RES = SS Y - (SS Y * r 2 ) zEstimated average amount of squared error left: yMS RES = SS RES /df REG = SS RES /(n P -2) zEstimated average amount of unsquared error left: ys EST = square root of MS RES

62 62 Computing s EST the easier way! In the problem for which we computed s EST the long way, we already knew that SS Y = 3.78 and r = -0.790. Thus, r 2 = (-0.790 )2 =0.624. Here is the computation: MS RES = 1.42/(7-2) = 0.284 S EST = 0.533 = 3.78 - (3.78 * (0.624) 2 )) SS RES = SS Y - (SS Y * r 2 ) = 1.42

63 63 How much better? S EST = 0.533S Y = 0.80 33% less error when use the regression equation instead of the mean to predict. Note: the difference between 33% and 31% when we calculated using the long way is mostly due to rounding error in the long calculation. 33% is more accurate

64 64 Stating the obvious: zThe estimated standard deviation (s) was the estimated average unsquared distance of scores in the population from mu. If we are looking at Y scores it is the average unsquared difference of Y scores from the mean of Y. zWhen using the regression equation we are predicting Y scores. zThe estimated standard error of the estimate (s EST ) is the estimated average unsquared distance of Y scores in the population from the regression equation based predicted Y scores. z Both reflect the error of prediction. Using the regression equation individualizes prediction. If r is significant, and prediction restricted to values of X within the range of X scores seen in the random sample, using the regression equation leads to less error.

65 65 Do one yourself. zAssume the original sum of squares for error is 420.00, n P =22 and the sum of the squared differences between the t X and t Y scores is 12.60. zWhat is r? zIs r statistically significant? Write the results as you would in a report. zWhat is the estimated average unsquared distance of Y scores from the regression line? zWhat percent improvement is obtained when s is compared to s EST ?

66 66 Answers: What is r? Is it significant? zCompute r  (t X - t Y ) 2 / (n P - 1) = 12.60/21=.600 r=1.000-1/2(.600) =.700 Is r significant? r(20)=.700, p<.01  (t X - t Y ) 2 = 12.60

67 67 What is the estimated average unsquared distance of scores in the population from the regression line? zThat is the same as asking “What is the estimated standard error of the estimate?” MS RES = 214.20/(20) = 10.71 S EST = 3.27 = 420.00 – [420.00 * (0.70) 2 ] SS RES = SS Y - (SS Y * r 2 ) = 214.20

68 68 What percent improvement is obtained when s is compared to s EST ? zMS W =SS W /df = 420.00/21 = 20.00 z

69 69 Last and (perhaps) least: zProportion improvement = (s-s EST )/s z(4.47 – 3.27)/4.47=.268 zPercent improvement = proportion improvement *100 zIn this case there was about a 26.8% improvement in unsquared error when you use the regression equation rather than the mean as your basis for predicting Y scores.

70 70 Some final notes on types of error and alpha

71 71 Error Types: Type 1 Error zType 1 error occurs when you accidentally get a random sample with an r outside the range predicted by the null hypothesis even though rho=0.000. This forces you to reject the null hypothesis when there really is no relationship between X and Y in the population as a whole. zScientists are conservative and set up conditions to avoid Type 1 errors.

72 72 Error Types: Type 2 Error zA type 2 error can only occur when there really is a correlation between X and Y in the population, but you accidentally get a sample r that falls within the range predicted by the null hypothesis. You must then fail to reject the null and assume rho=0.000 zThis is incorrect and results in Type 2 error.

73 73 Alpha levels zAny result can be found by chance. zHowever some results are so strong that they are very unlikely. zUnlikely is defined as occuring by chance 5 (or fewer) times in 100. zThe risk of getting a weird sample that causes a Type 1 error is called alpha. zAlmost universally in the biomedical sciences  =.05


Download ppt "1 Chapter 8 – Regression 2 Basic review, estimating the standard error of the estimate and short cut problems and solutions."

Similar presentations


Ads by Google