Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.

Similar presentations


Presentation on theme: "1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation."— Presentation transcript:

1 1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation

2 2 Example Probability Plots for Various Data Shapes

3 3 December 2000 Unemployment Rates in the 50 States

4 4 Distribution of Monthly Returns for all U.S. Common Stocks from 1951 - 2000

5 5 Distribution of Individual Salaries of Cincinnati Reds Players on Opening Day of the 2000 Season

6 6 Back to Correlation and Regression

7 Association Between Two Variables Regression Analysis -- we want to predict the dependent variable using the independent variable Correlation Analysis -- measures the strength of the linear association between 2 quantitative variables

8 8 Calculating the Correlation Coefficient

9 9 Notation: So --

10 10 Study Time Exam (hours) Score (X) (Y) 10 92 15 81 12 84 20 74 8 85 16 80 14 84 22 80 The data below are the study times and the test scores on an exam given over the material covered during the two weeks. Find r

11 11 r calculated from a set of data is an estimate of a theoretical parameter  or  yx Population Parameter  -- if  = 0 then there is no linear relationship between the two variables -- in the same way the sample average is an estimate of the population mean 

12 12 Rejection Region Test Statistic t > t  2 or t <  t  /2 df  n  2 Testing Statistical Significance of Correlation Coefficient

13 13 Study Time Exam (hours) Score (X) (Y) 10 92 15 81 12 84 20 74 8 85 16 80 14 84 22 80 The data below are the study times and the test scores on an exam given over the material covered during the two weeks. Test H 0 :    H a :   

14 14 Correlation Between Study Time and Score Rejection Region: Test Statistic: Conclusion: P-value: Test H 0 :   H a :   

15 15 The CORR Procedure 2 Variables: score time Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum score 8 82.50000 5.18239 660.00000 74.00000 92.00000 time 8 14.62500 4.74906 117.00000 8.00000 22.00000 Pearson Correlation Coefficients, N = 8 Prob > |r| under H0: Rho=0 score time score 1.00000 -0.77490 0.0239 time -0.77490 1.00000 0.0239 Study Time by Score

16 16

17 17 11.1-5 Regression Analysis

18 18 Notation Theoretical Model Regression line -- these are evaluated from the data

19 19 Data we write

20 20

21 21 Least Squares Estimates Computation Formula

22 22 Study Time Exam (hours) Score (X) (Y) 10 92 15 81 12 84 20 74 8 85 16 80 14 84 22 80 The data below are the study times and the test scores on an exam given over the material covered during the two weeks. Find the equation of the regression line for prediction exam score from study time.

23 23 Calculations: Study Time Data Equation of Regression Line:

24 24 The GLM Procedure Dependent Variable: score Sum of Source DF Squares Mean Square F Value Pr > F Model 1 112.8883610 112.8883610 9.02 0.0239 Error 6 75.1116390 12.5186065 Corrected Total 7 188.0000000 R-Square Coeff Var Root MSE score Mean 0.600470 4.288684 3.538164 82.50000 Source DF Type I SS Mean Square F Value Pr > F time 1 112.8883610 112.8883610 9.02 0.0239 Source DF Type III SS Mean Square F Value Pr > F time 1 112.8883610 112.8883610 9.02 0.0239 Standard Parameter Estimate Error t Value Pr > |t| Intercept 94.86698337 4.30408629 22.04 <.0001 time -0.84560570 0.28159265 -3.00 0.0239 PROC REG; MODEL score=time; RUN; YX

25 25 To Predict Y for a given x: -- plug x into the regression equation and solve for Y Example: If a student studied 10 hours, then the predicted score would be

26 26 Notes: - is called the sum-of-squared residuals -- SS(Residuals) -- SSE is the estimate of the error variance

27 27 Testing for Significance of the Regression If knowing x is of absolutely no help in predicting Y, then it seems reasonable that the regression line for predicting Y from x should have slope ________. That is, to test for a “significant regression” we test Test Statistic Rejection Region: where t has n  2 df

28 28 Study Time Data

29 29 The GLM Procedure Dependent Variable: score Sum of Source DF Squares Mean Square F Value Pr > F Model 1 112.8883610 112.8883610 9.02 0.0239 Error 6 75.1116390 12.5186065 Corrected Total 7 188.0000000 R-Square Coeff Var Root MSE score Mean 0.600470 4.288684 3.538164 82.50000 Source DF Type I SS Mean Square F Value Pr > F time 1 112.8883610 112.8883610 9.02 0.0239 Source DF Type III SS Mean Square F Value Pr > F time 1 112.8883610 112.8883610 9.02 0.0239 Standard Parameter Estimate Error t Value Pr > |t| Intercept 94.86698337 4.30408629 22.04 <.0001 time -0.84560570 0.28159265 -3.00 0.0239 PROC GLM; MODEL score=time; RUN;

30 30 The CORR Procedure 2 Variables: score time Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum score 8 82.50000 5.18239 660.00000 74.00000 92.00000 time 8 14.62500 4.74906 117.00000 8.00000 22.00000 Pearson Correlation Coefficients, N = 8 Prob > |r| under H0: Rho=0 score time score 1.00000 -0.77490 0.0239 time -0.77490 1.00000 0.0239 Study Time by Score

31 31 Note: The t values for testing H 0 :  and for testing H 0 :    are the same. - both tests depend on the normality assumption

32 32 Recall: One-sample Test about a Mean In general: df = n – 1

33 33 (1 –  )100% Confidence Interval for  df = n – 1

34 34 Similarly

35 35 df = n – 2 Can also find confidence interval for   - not as useful Alternative form

36 36 Prediction Setting:

37 37 2 Intervals 1. Confidence Interval on  Y|x n+1

38 38

39 39

40 40 2. Prediction Interval for y n+1 Notes:

41 41

42 ExtrapolationExtrapolation l Predicting beyond the range of predictor variables

43 43 Predict the price of a car that weighs 3500 lbs. - extrapolation would say it’s about $16,000

44 44 Predict the price of a car that weighs 3500 lbs. - extrapolation would say it’s about $16,000 oops!!!

45 ExtrapolationExtrapolation l Predicting beyond the range of predictor variables NOT a good idea

46 46 Analysis of Variance Approach Mathematical Fact SS(Total) = SS(Regression) + SS(Residuals) p. 649 (SS “explained” by the model) (SS “unexplained” by the model) (S yy )

47 47 Plot of Production vs Cost

48 48 SS(???)

49 49 SS(???)

50 50 SS(???)

51 51 measures the proportion of the variability in Y that is explained by the regression on X

52 52 12 8 7 12 4 15 11 10 15 12 20 8 17 14 24 7 8 12 4 12 11 15 YXX

53 53 The REG Procedure Dependent Variable: y Sum of Source DF Squares Model 1 19.575 Error 6 174.425 Corrected Total 7 194.000 The REG Procedure Dependent Variable: y Sum of Source DF Squares Model =SS(reg) 1 170.492 Error =SS(Res) 6 23.508 Corrected Total 7 194.000 =SS(Total)

54 54 RECALL Theoretical Model Regression line residuals

55 55 Residual Analysis Examination of residuals to help determine if: - assumptions are met - regression model is appropriate Residual Plot: Plot of x vs residuals

56 56

57 57

58 58 Study Time Data PROC REG; MODEL score=time; OUTPUT out=new r=resid; RUN; PROC GPLOT; TITLE 'Plot of Residuals'; PLOT resid*time; RUN;

59 59 Average Height of Girls by Age

60 60 Average Height of Girls by Age

61 61 Residual Plot

62 62 Residual Analysis Examination of residuals to help determine if: - assumptions are met - regression model is appropriate Residual Plot: - plot of x vs residuals Normality of Residuals: - probability plot - histogram


Download ppt "1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation."

Similar presentations


Ads by Google