Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1.

Similar presentations


Presentation on theme: "Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1."— Presentation transcript:

1 Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1

2 Bi-variate Analyses Association between one continuous variable and one categorical variable – two-independent sample t-test, one-way ANOVA and their non-parametric approaches Association between two categorical variables – Chisq- test and Fisher’s exact test Association between two continuous variables – Pearson’s correlation coefficient, Spearman rho correlation coefficient, and simple linear regression 2

3 3 Correlation

4 Correlations Bi-variate correlation (  ) is an indicator of the strength and direction of the relationship between two variables –Related to the slope for the two variables through variance terms Correlation ‘matrix’ is frequently used to screen for important relationships Correlation or association may not necessarily reflect “causality” 4

5 Correlation Matrix Correlation between “Immunization rate” and “<5 mortality rate 5

6 Correlation Analysis Performing “Scatter Plots” before calculating correlation coefficient Only “ linear ” correlation is discussed here. 6

7 % of Immunization & <5 Mortality Rate (Per 1,000 Live Birth) in 20 Countries 7

8 8

9 No Correlation 9

10 No Correlation ? 10

11 11

12 Notes, cont. Linear relations only. Correlation applies only to linear relationships This figure shows a strong non-linear relationship, yet r = 0.00. Correlation does not necessarily mean causation. Beware lurking variables (next slide). 12

13 Confounded Correlation A near perfect negative correlation (r = −.987) was seen between cholera mortality and elevation above sea level during a 19th century epidemic. We now know that cholera is transmitted by water. The observed relationship between cholera and elevation was confounded by the lurking variable proximity to polluted water. 13

14 Direction of Association Direction of association can also be determined by both scatter plots and correlation coefficients The shape of scatter plot may reflect the direction of association Correlation coefficient ranges from –1 (perfect inverse relationship) to +1 (perfect positive relationship) 14

15 Strength of Association Can be judged from scatter plots A “flat” scatter plot indicates a strong association, while a “round” scatter plot represents “no” association 15

16 The strength of association between two continuous variables can also be determined by correlation coefficients The magnitude of a correlation coefficient may indicate the strength of association; the larger the magnitude, the stronger the association. 16

17 Examples of correlations

18 Pearson’s Correlation Coefficient 18

19 19

20 Hypothesis Test We conduct the hypothesis test to guard against identifying too many random correlations. Random selection from a random scatter can result in an apparent correlation 20

21 相關分析 - 的統計推論 To determine a 95% CI for  is complicated by the fact that only when  =0 can r be considered to have come from an approximate normal distribution. For values of  other than 0, Fisher’s Z transformation, defined below, must be employed 21

22 22

23 Assumptions for Pearson’s Correlation Coefficient Variables must be continuous Variables must be normally distributed What if the above assumption(s) do not hold ? Answers: Using non-parametric approach 23

24 Spearman’s Rank Correlation Coefficient 24

25 Spearman Rank-order Correlation Coefficient Two professors assessed 12 students. The following table showed the information. What is the correlation between the two sets of scores assigned by the two professors? 25

26 Student idProf. AProf. Bdidi 2 xiyixi-yi(xi-yi) 2 12.5*5-2.56.25 22.5*2**0.50.25 3981.01.00 45.5*7-1.52.25 512 00 67.5*11-3.512.25 712**1.00 81064.016.00 942**2.04.00 105.5*41.52.25 117.5*10-2.56.25 121192.04.00 Sum=55.50 26

27 Simple Linear Regression 27

28 28 Simple Linear Regression ANOVA was extension of T-Test to multiple group means Linear regression extends ANOVA to continuous predictor variables –Systolic blood pressure predicted by body mass index –Body mass index predicted by caloric intake –Caloric intake predicted by measure of stress However, may relate to inactivity due to lack of visual acuity Important to specify a biologically plausible model –Systolic blood pressure predicted by eye color –Body mass index predicted by visual acuity

29 Regression describes the relationship in the data with a line that predicts the average change in Y per unit X. The best fitting line is found by minimizing the sum of squared residuals, as shown in this figure. 29

30 30 Assumptions for Linear Regression Model Assumptions are important in linear regression, but are not absolute (LINE) –Predictor variables are ‘fixed’; i.e., same meaning among individuals –Predictor variable measured ‘without error’ –For each value of the predictor variable, there is a normal (N) distribution of outcomes (subpopulations) and the variance of these distributions are equal (E)

31 31 Assumptions (continued) –The means of the outcome subpopulations lie on a straight line related to the predictors; i.e., The predictors and the outcomes are linearly related (L)  Y|x =  +  x –The outcomes are independent of each other (I) Regression model: y =  +  x + 

32 32 Graphic Presentation of Model Assumptions

33 33 Interpretation of Regression Model Two parameters –  is the intercept (Y value) when the predictor is zero May not be really plausible –  is the ‘slope’ of the regression line and represents the change in Y for a unit change in X i.e., a slope of 0.58 would indicate that for a one unit change in X, there is a 0.58 unit change in Y –  is the error term for each individual and is the residual for that individual Residual is the difference between the fitted line (predicted value) and the observed value

34 34 Approach to Developing a Regression Model Determine outcome and plausible predictors Plot outcome vs. each predictor to check for linearity Fit the regression model and review parameters and tests If model has a significant fit and parameters are significantly different from 0, look at residuals to better evaluate fit

35 Maternal average body weight during pregnancy and infant birth weight ID MWBW ID MWBW 1 49.43515 15 63.12722 2 63.53742 16 65.83345 3 683629 17 61.23714 4 52.22680 18 55.82991 5 54.43006 19 61.24026 6 70.34068 20 56.72920 7 50.83373 21 63.54152 8 73.94124 22 592977 9 65.83572 23 49.92764 10 54.43359 24 65.82920 11 73.53230 25 43.12693 12 593572 Mean 59.7483341.2 13 61.23062 SD 7.860403464.3844 14 52.23374 Variance 61.78593215652.8 35

36 36 Scattered plot of mother’s pregnancy weight (kg) (X) versus infant’s birth weight (g) (Y)

37 37 Fitting a linear regression line

38 38

39 Regression Line The regression line equation is: where ŷ ≡ predicted value of Y, a ≡ the intercept of the line, and b ≡ the slope of the line Equations to calculate a and b SLOPE: INTERCEPT: 39

40 Regression Line Slope b is the key statistic produced by the regression 40

41 41

42 42 判定係數 即

43 43 R 2 =1406178/5175668=0.27 F=t 2 / 2.929 2 =8.580

44 44 Testing Assumptions by Analyzing Residuals (N) Normal Distribution If the relationships are linear and the dependent variable is normally distributed for each value of the independent variable, the distribution of the residuals should be approximately normal. This can be assessed by using a histogram of the standardized residuals.

45 45

46 46 Testing Assumptions by Analyzing Residuals (E) Homoscedasticity To check this assumption, the residuals can be plotted against the predicted values and against the independent variables.

47 47

48 Residual Plots With a little experience, you can get good at reading residual plots. Here’s an example of linearity with equal variance. 48

49 Residual Plots Example of linearity with unequal variance 49

50 50 Testing Assumptions by Analyzing Residuals (L) Linearity When standardized predicted values are plotted against the observed values, the data would form a straight line from the lower-left corner to the upper-right corner, if the model fit the data exactly.

51 51

52 Example of Residual Plots Example of non-linearity with equal variance 52


Download ppt "Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1."

Similar presentations


Ads by Google