Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression.

Similar presentations


Presentation on theme: "Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression."— Presentation transcript:

1 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression

2 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.2 Variables of Interest? One (continuous) Variable Methods from Before Midterm… Two Variables Both Continuous Interested in predicting one from another Simple Linear Regression Interested in presence of association Both variables normal Pearson Correlation Not normal Spearman (Rank) Correlation *Note: If categorical variable is ordinal, Spearman Rank Correlation methods are applicable… Where are we going now?

3 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.3 Relationships Between Variables Predictor (independent variable) Predictee (dependent variable) Methods Continuous Correlation & Simple Linear Regression CategoricalContinuousANOVA (F-test, t-test) Continuous or Categorical Categorical (usually binary) Logistic Regression Categorical Contingency Table Binary OR/RR Continuous or Categorical Time to EventProportional Hazards Regression CategoricalTime to EventLog Rank Test

4 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.4 Scatterplots  SCATTERPLOTS are useful graphical tools to visualize the relationship between two continuous variables  i.e., y vs. x  Informative  Easy to understand

5 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.5 Results: NPZ-3 score vs. Log 10 (HCV RNA VL) (Restricted to HCV VL+ Subjects) log (HCV RNA VL) NPZ-3 score 2345678 -4 -2 0 2 4

6 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.6 Correlation  Used to assess the relationship between two continuous variables  Example: weight and number of hours of exercise per week

7 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.7 Correlation  Exercise and Weight example…  How do we determine if this represents correlation? Exercise (hrs/wk) Weight

8 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.8 Correlation  Range: –1 (perfect negative correlation) to +1 (perfect positive correlation) x y

9 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.9 Correlation  Correlation of –1 means that all of the data lie in a straight line with some negative slope  Correlation of 1 means that all of the data lie in a straight line with some positive slope  Correlation of 0, means that the variables are not correlated  A plot of the data would reveal a cloud with no discernable pattern

10 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.10 Correlation x y y x II  One way to visualize… IIIIV I

11 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.11 Correlation x y Example 1 Example 2 x y y y x x I II IIIIV III II I  Which is more correlated?

12 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.12 Correlation x y Example 1 Example 2 x y y y x x I II IIIIV III II I  Which is more correlated?

13 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.13 Correlation Exercise (hrs/wk) Weight x y  Exercise and Weight example: We can see more data in quadrants II and IV -- negative correlation III IIIIV

14 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.14 Correlation Coefficient  Measures the strength of the linear relationship  ρ = the population correlation coefficient (parameter)  r = the sample correlation coefficient (statistic)

15 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.15 Correlation Coefficient  Can conduct hypothesis tests for ρ using the t distribution  H 0 : ρ =0 is often tested  Confidence Intervals for ρ may also be obtained

16 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.16 Correlation Coefficient  Two major correlation coefficients: 1. Pearson correlation: parametric  Sensitive to extreme observations 2. Spearman correlation: nonparametric  Based on ranks  Robust to extreme observations and thus is recommended particularly when “outliers” are present

17 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.17 Pearson Correlation  Population Parameter:  Sample Statistic:

18 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.18 Spearman Correlation  Nonparametric version  Plug in ranks rather than raw values into sample statistic:

19 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.19 Correlation  Two different questions: 1.Is the correlation strong, moderate or weak? 2.Is the correlation statistically significant? (P-value) -1 -0.8 -0.5 -0.2 0 0.2 0.5 0.8 1.0 Perfect None Perfect + - Strong Moderate Weak Weak Moderate Strong

20 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.20 Correlation  Also note:  Never extrapolate correlation beyond observed range of values  Correlation does NOT necessarily imply causation  Correlation does NOT measure the magnitude of the regression slope  No Correlation does NOT mean no relationship (could be non-linear relationship between x and y, see below) x y

21 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.21 Correlation: DPT Example  Example: Diphtheria, pertussis, and tetanus (DPT) immunization rates and mortality  Is there any association between the proportion of newborns immunized and the level of mortality of children less than 5 years of age?  Let: X = Percent of infants immunized against DPT Y = Infant mortality (number of infants under 5 dying per 1,000 live births)

22 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.22 Correlation: DPT Example  Is there a pattern? n=20 countries ex. Bolivia (77, 118)  Immunization=77%  Mortality Rate=118 per 1,000 live births

23 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.23 Correlation: DPT Example  X = Percent of infants immunized against DPT  Y = Infant mortality (number of infants under 5 dying per 1,000 live births)  Calculate pieces we need for sample correlation coefficient:

24 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.24 Correlation: DPT Example  And the correlation coefficient is:  Thus there appears to be a negative association between immunization levels for DPT and infant mortality

25 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.25 Correlation: DPT Example  Hypothesis test: Testing for zero correlation based on: and test statistic:

26 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.26  The test of the null hypothesis of zero correlation is constructed as follows: 1. Ho: 2. a. Ha: b. Ha: c. Ha: 3. The level of significance is (1- α )% 4. Rejection rule: a. Reject Ho if t>tn-2;1- α b. Reject Ho if t<tn-2; α c. Reject Ho if t>tn-2;1- α /2 or t<tn-2; α /2 Correlation: DPT Example

27 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.27 Correlation: DPT Example  Setting α at 5%, we have:  Since –5.47<-t 18;0.025 we reject the null hypothesis at the 95% level of significance  Significant negative correlation between immunization levels and infant mortality  Points to an inversely proportional relationship (as immunization levels rise, infant mortality decreases)  Note that we cannot estimate how much infant mortality would decrease if a country were to increase its immunization levels by 10%

28 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.28 Correlation: DPT Example  STATA Output: | immunize under5 ----------+------------------ immunize | 1.0000 | under5 | -0.7911 1.0000 | 0.0000 P-value for test

29 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.29 Correlation: DPT Example  Confidence Intervals:

30 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.30 Correlation: DPT Example

31 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.31 Correlation: DPT Example  The interval (-0.913, -0.534) does not contain zero (the hypothesized value under the null hypothesis in the previous test)  Therefore, we reject the hypothesis of zero correlation between immunization levels and infant mortality  Note that the confidence interval for correlation and hypothesis test are are based on different statistics, so these may not always match up

32 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.32 Correlation: DPT Example  What if we had used Spearman Rank Correlation instead?  Rank percentages and mortalities between all 20 countries  If all ranks perfectly match up (i.e. country’s immunization % and mortality rate have same ranking) then r=1  If all ranks are perfectly inversely related then r=-1  r s = -0.54 (slightly lower, likely due to some non-normality/outliers in data)  t s = -2.72, still reject at 95% level

33 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.33 Correlation: Assay Comparison  Two different assays to measure LDL cholesterol  Correlated?  Agreement?  One step beyond correlation…

34 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.34 Simple Linear Regression  Correlation gives us a measure of the association between two variables, but NOT the magnitude  With simple linear regression, we can now put a value on this magnitude  i.e., What is the average decrease in weight for every additional workout hour per week?  Hypothesis tests and CIs may be obtained for the magnitude of association (i.e., slope of line)  One may make predictions of the effect of changes in x on y

35 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.35 Simple Linear Regression  Again, we plot the data first (scatterplot):  This enables us to see the relationship between the 2 variables  Designated “Response” variable and “Predictor” (or “Explanatory”) variable  Unlike correlation where there is no distinction between 2 variables  If the relationship is non-linear (as evidenced by the plot) then a simple linear regression model is the wrong approach

36 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.36 Simple Linear Regression Exercise (hrs/wk) Weight x y  Already determined correlated  Looks like linear trend

37 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.37 Exercise (hrs/wk) Simple Linear Regression Weight Slope = Rise = Δy Run Δ x = 10 lbs/1 hour ~ 10 lbs ~ 1 hour

38 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.38 Simple Linear Regression  Used to describe and estimate the relationship between two continuous variables  We can attempt to characterize this relationship with two parameters: 1. an intercept, and 2. a slope  Several assumptions are made

39 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.39 Simple Linear Regression  The model: y i = β 0 + β 1 x i  y i is the dependent variable (i.e., weight)  x i is the independent variable (predictor, i.e. exercise)  β 0 is the y-intercept (or written as α )  β 1 is the slope of the line (or just β )  It describes how steep the line is and which way it leans  It is also the “effect” (e.g., a treatment effect) on y of a 1 unit change in x  Note that if there was no association between x and y, then β 1 will be close to 0 (i.e., x has no effect on y)

40 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.40 Simple Linear Regression

41 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.41 Simple Linear Regression  Constant slope - For a fixed increase x, there will be a fixed change in y regardless of the value of x  e.g., there is a fixed increase if the slope is positive and a fixed decrease if the slope is negative  In contrast to a non-linear relationship, such a quadratic or polynomial, where for some values of x, y will be increasing, and for some other values y will be decreasing (or vice versa) Δ xΔ x Δ yΔ y

42 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.42 Simple Linear Regression  For example, even though it seems that y may be increasing for increasing x, the relationship is not a perfect line (many possible “best” lines)

43 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.43  Use the method of Least Squares to find the best-fitting line (minimize distances from data points):  Best guess without knowing x ’s, is the mean of y x y Method of Least Squares

44 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.44 Least Squares Line Finding “Y-hat”  Does knowing x ’s allow us a better line? Try one… x y

45 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.45  For each data point… yiyi xixi x y Least Squares Line Finding “Y-hat”

46 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.46  Find distance to new line, y-hat, and to ‘naïve’ guess, y yiyi xixi Least Squares Line Finding “Y-hat” y x

47 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.47  The regression line (whichever it is) will not pass through all data points y i  Thus, in most cases, for each point x i the line will produce an estimated point and most probably,  As we see in the previous figure,  For each choice of (note that each pair completely defines the line) we get a new line, and a whole new set of deviation terms Least Squares Line Finding “Y-hat”

48 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.48  The “best-fitting line” according to the least-squares method is the one that minimizes the sum of square deviations:  We desire to capture all of the structure in the data with the model (finding the systematic component), leaving only random errors  Thus if we plotted the errors, we would hope that no discernable pattern exists  If a pattern exists, then we have not captured all of the systematic structure in the data, and we should look for a better model… More on this later! Least Squares Line Finding “Y-hat”

49 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.49 yiyi xixi Least Squares Line Finding “Y-hat”

50 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.50 Variability & “Sum of Squares” -- part of total variability that can be explained -- part left unexplained, because the model cannot account for differences between the estimated points and the data (this is called error sum of squares)

51 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.51  SSY is made up of n terms  Once the mean has been estimated, only n-1 terms are needed to compute SSY (i.e., the degrees of freedom of SSY are n-1)  SSR has only one degree of freedom, since it’s computed from a single function of y:  SSE has the remaining n-2 degrees of freedom Degrees of Freedom & “Mean Squares”

52 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.52 1. is the variability of Y without consideration of X 2. is the residual variability after the relationship of Y and X has been accounted for  Unless ρ=0, this will always be less than S y 2 Degrees of Freedom & “Mean Squares”

53 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.53 Analysis of Variance Table  Note that F simplifies to t when the numerator d.f. = 1

54 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.54 Assumptions Assumptions of the linear regression model: 1. The y values are distributed according to a normal distribution with mean and variance that is unknown 2. The relationship between X and Y is given by the formula 3. The y are independent 4. For every value x the standard deviation of the outcomes y is constant and equal to (homoscedasticity)

55 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.55 Hypothesis Testing  In these models, y is our target (or dependent variable, the outcome that we cannot control but want to explain) and x is the explanatory (or independent variable)  Within each regression, the primary interest is the assessment of the existence of the linear relationship between x and y  If such an association exists, then x provides information about y

56 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.56 Hypothesis Testing  Inference on the existence of the linear association is accomplished via tests of hypotheses and confidence intervals  Both of these center around the estimate of the slope, since it is clear, that if the slope is zero, then changing x will have no impact on y (thus there is no association between x and y)

57 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.57 Hypothesis Testing  The test of hypothesis of no linear association is defined as follows: 1. H o : No linear association exists between x and y 2. H a : A linear association exists between x and y 3. Tests are carried out at the (1- α )% level of significance 4. The test statistic is, 5. Rejection rule: Reject Ho, if F>F 1, n-2; α  This will happen if F is far from one  Note when MSR=MSE, F=1 and we gain no information from the regression line beyond just the mean of y

58 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.58 Hypothesis Testing  The F test of linear association tests whether a line, other than the horizontal one going through the sample mean of the Y’s, is useful in explaining some of the variability in the data  This is based on the fact that if the population regression slope β 1  0, that is, if the regression does not add anything new to our understanding of the data (does not explain a substantial part of the variability), then the two mean square errors MSR and MSE are estimating a common quantity (the population variance σ 2 )

59 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.59 Hypothesis Testing  Thus the ratio should be close to 1 if the hypothesis of no linear association between X and Y is present  On the other hand, if a linear relationship exists, ( β 1 is far from zero) then SSR>SSE and the ratio will deviate significantly from 1

60 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.60 Hypothesis Testing  The test of hypothesis of zero slope is defined as follows (~F-test for linear association): 1. Ho: No linear association between x and y: β 1 =0 2. Ha: A linear association exists between x and y: a. β 1  0 (two-sided test) b. β 1 >0 c. β 1 <0 3. Tests are carried out at the (1- α )% level of significance } (one-sided tests)

61 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.61 Hypothesis Testing 4. The test statistic is, is dist’d as a t distribution with n-2 d.f. 5. Rejection Rule: Reject Ho, in favor of the three alternatives respectively, if a. t tn-2;(1- α /2) b. t>tn-2;(1- α ) c. t<tn-2; α

62 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.62 Confidence Intervals for β 1  Constructed as usual and are based on, the standard error of, and the t statistic  A (1- α )% confidence interval is as follows:

63 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.63 Simple Linear Regression: Newborn Infant Length Example  Consider the problem described in section 18.4 of the textbook: Is there a relationship between the length and the gestational age of newborn infants?  Scatterplot of length vs. gestational age…

64 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.64 Simple Linear Regression: Newborn Infant Length Example STATA Output:

65 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.65 Simple Linear Regression: Newborn Infant Length Example Degrees of freedom: There is one degree of freedom associated with the model, and n-2=98 degrees of freedom comprising the residual

66 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.66 Simple Linear Regression: Newborn Infant Length Example F test: This is the overall test of linear association (measures deviations from one)  Note that the numerator degrees of freedom is the model degrees of freedom, while the denominator degrees of freedom is the residual (error) degrees of freedom

67 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.67 Simple Linear Regression: Newborn Infant Length Example Rejection rule of the F test: Since the p-value is 0.000<0.05= α, we reject the null hypothesis  There is evidence to suggest a linear association between gestational age and length of the newborn

68 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.68 Simple Linear Regression: Newborn Infant Length Example Root MSE: This is the square root of the mean squares for error, and as before, can be used as an estimate of the population variability

69 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.69 Simple Linear Regression: Newborn Infant Length Example gestage: Estimate of the slope, β 1 =.9516035  This means that the fetus grows by 0.95 inches every week of gestation

70 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.70 Simple Linear Regression: Newborn Infant Length Example p-value of the t-test: Since 0.000<0.05, we reject the null hypothesis  There is a strong positive linear relationship between gestational age and length of the newborn

71 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.71 Simple Linear Regression: Newborn Infant Length Example The 95% confidence interval for β 1 is [0.743, 1.160]  Since it excludes 0, we reject the null hypothesis and conclude that there is a strong positive linear relationship between gestational age and length of newborn

72 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.72 Simple Linear Regression: Newborn Infant Length Example The estimate of the intercept _cons=9.328 implies that the fetus is 9.3 inches long at conception (week zero) which is totally false  Note however, that our model is only valid in the period in which our data have been collected

73 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.73 Correlation & Regression Final Notes 1.Correlation does not measure the magnitude of the slope  Even though, the size of the slope is also dependent on the variability in the X and Y

74 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.74 Correlation & Regression Final Notes 2.Correlation is not a measure of the appropriateness of the straight-line model  Consider the case of a quadratic model: The correlation coefficient may be large, but a straight- line model may still be inappropriate (figure)

75 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.75 Correlation & Regression Final Notes  Beyond the limitations of the correlation coefficient just mentioned, there are additional advantages in using a regression model  Consider again the DPT immunization example:  The estimates of slope and intercept are β 1 = 224.32 and β 0 = -2.136 respectively

76 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.76 Correlation & Regression Final Notes  If an official wanted to estimate the decline in infant mortality for a 10% increase in immunization from current levels they would only have to multiply 10 (=10%) by the estimate for the slope –2.136 (taking advantage of the fact that for a constant increase in x, y decreases by a constant factor)  If DPT immunization rates were increased by 10%, the infant mortality would drop by (10)(-2.136)=-21.36 or about 21 infants per 1,000 live births  This would account for one-third the mortality rate of Egypt, or two-thirds the infant mortality rate in Russia  Taking advantage of the properties of a straight line and the results of the regression model, a concrete public health policy could be constructed

77 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.77 Review Variables of Interest? One (continuous) Variable Methods from Before Midterm… Two Variables Both Continuous Interested in predicting one from another Simple Linear Regression Interested in presence of association Both variables normal Pearson Correlation Not normal Spearman (Rank) Correlation

78 Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.78 Next Week… More Regression  Predicted Values  Evaluation of linear regression model  i.e., homoscedasticity, R 2  Extend to Multiple Linear Regression  Parallels with Analysis of Variance (ANOVA)


Download ppt "Introduction to Biostatistics, Harvard Extension School, Spring, 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Correlation and Simple Linear Regression."

Similar presentations


Ads by Google