Presentation is loading. Please wait.

Presentation is loading. Please wait.

BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression.

Similar presentations


Presentation on theme: "BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression."— Presentation transcript:

1 BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

2 Thus far, we have considered whether means of a response variable differ among groups. Sometimes it is of interest to know whether a variable covaries with another variable, or whether the value of one variable can predict the value of another. With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs ( x i, y i ). The data can be both qualitative, one qualitative and one quantitative, or both quantitative. In some examples, x i is the independent (predictor) variable and y i is the dependent (response) variable. Although it might not be readily apparent we have been working all along with qualitative (nominal) independent variables (e.g., grouping variables). Now we are going to shift gears and look at continuous quantitative independent variables. BIOL 582 Considering Multiple Variables

3 With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs ( x i, y i ). The data can be both qualitative, one qualitative and one quantitative, or both quantitative. In some examples, x i is the independent (predictor) variable and y i is the dependent (response) variable. Bivariate Quantitative variables Scatter Plot: BIOL 582 Considering Multiple Variables

4 Variables include units Points are ordered pairs ( x i, y i ) (21.56, 0.32) (36.77, 1.36) Independent (predictor) variable Dependent (response) variable BIOL 582 Considering Multiple Variables

5 Is there a linear relationship for the data? BIOL 582 Considering Multiple Variables

6 x y x y x y x y x y Positive linear relationship Negative linear relationship No relationship Non-linear relationships BIOL 582 Considering Multiple Variables

7 Correlation The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient. Sample Correlation Coefficient where is the sample mean for the predictor variable, is the sample standard deviation of the predictor variable, is the sample mean of the response variable, is the sample standard deviation of the response variable, is the number of individual units in the sample. BIOL 582 Considering Multiple Variables

8 Correlation The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient. Sample Correlation Coefficient Here is a computationally easier way to calculate r BIOL 582 Considering Multiple Variables

9 BIOL 582 Scatter Diagrams; Correlation Consider the pupfish example ixixi yiyi 121.560.32 228.870.81 328.500.63 428.960.70 527.000.55 632.500.92 730.390.67 836.771.36 929.390.61 Add 3 more columns

10 Consider the pupfish example ixixi yiyi xi2xi2 yi2yi2 xiyixiyi 121.560.32 228.870.81 328.500.63 428.960.70 527.000.55 632.500.92 730.390.67 836.771.36 929.390.61 BIOL 582 Scatter Diagrams; Correlation

11 Consider the pupfish example ixixi yiyi xi2xi2 yi2yi2 xiyixiyi 121.560.32464.830.106.90 228.870.81833.480.6623.38 328.500.63812.250.4017.96 428.960.70838.680.4920.27 527.000.55729.000.3014.85 632.500.921056.250.8529.90 730.390.67923.550.4520.36 836.771.361352.031.8550.01 929.390.61863.770.3717.93 sum263.946.577873.855.46201.56 BIOL 582 Scatter Diagrams; Correlation

12 More on correlation coefficients r meaning 1.0Perfectly positively correlated 0.8Strongly positively correlated 0.6 0.4Weakly positively correlated 0.2 0Not Correlated -0.2 -0.4Weakly negatively correlated -0.6 -0.8Strongly negatively correlated Perfectly negatively correlated x y x y xx y Match: r = 0.1 r = 0.3 r = 0.9 r = 0.7 BIOL 582 Scatter Diagrams; Correlation

13 More on correlation coefficients r meaning 1.0Perfectly positively correlated 0.8Strongly positively correlated 0.6 0.4Weakly positively correlated 0.2 0Not Correlated -0.2 -0.4Weakly negatively correlated -0.6 -0.8Strongly negatively correlated Perfectly negatively correlated x y x y xx y Match: r = -0.1 r = -0.3 r = -0.9 r = -0.7 BIOL 582 Scatter Diagrams; Correlation

14 More on correlation coefficients WARNINGS Question: Does a correlation coefficient of 0 mean no association or no relationship? ixixi yiyi 1-24 21 300 411 524 xi2xi2 yi2yi2 xiyixiyi 416-8 11 000 111 4168 r = 0 y i = x i 2 Thus, r = 0 could mean no association or a non-linear relationship BIOL 582 Scatter Diagrams; Correlation

15 More on correlation coefficients WARNINGS Question: How do extreme points affect correlation? ixixi yiyi 111 222 333 444 550 ixixi yiyi 111 212 321 422 514 r = 0 r > 0.99 BIOL 582 Scatter Diagrams; Correlation

16 More on correlation coefficients WARNINGS Question: How do extreme points affect correlation? ixixi yiyi 111 222 333 444 550 ixixi yiyi 111 212 321 422 514 r = 1 r =0 BIOL 582 Scatter Diagrams; Correlation

17 More on correlation coefficients WARNINGS Question: Does correlation mean causation? Pupfish data (MR = metabolic rate, mgO 2 /hr ) Length Weight MR ixixi yiyi zizi 121.560.320.18 228.870.810.44 328.500.630.54 428.960.700.53 527.000.550.46 632.500.920.53 730.390.670.43 836.771.361.20 929.390.610.32 r = 0.94 r = 0.92 But, the correlation between length and MR is also strong: r = 0.84 Neither length nor weight “cause” increase in MR. MR happens to be biologically, positively associated with weight. Weight also happens to have a positive association with length. Thus, it appears that length and MR are related when they are not really directly related. Remember, causation can only be inferred from an experimental approach. BIOL 582 Scatter Diagrams; Correlation

18 We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship. ixixi yiyi zizi 121.560.320.18 228.870.810.44 328.500.630.54 428.960.700.53 527.000.550.46 632.500.920.53 730.390.670.43 836.771.361.20 929.390.610.32 r = 0.94 r = 0.92 Length Weight MR This is a line of “best fit” for the linear relationship. It is usually found by Least-Squares Regression. This is the equation of the line. BIOL 582 Least-Squares Regression

19 We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship. Least-Squares Regression Criterion The least-squares regression line is the one that minimizes the sum of squared errors. It is the line that minimizes the square of vertical distance between observed values of y and those predicted by the line, (“y-hat”). We represent this as: Minimize Σ residuals 2 MR vs. Weight in pupfish y = 0.90x - 0.14 0 0.2 0.4 0.6 0.8 1 1.2 1.4 00.511.5 Weight (g) MR (mgO2/hr) BIOL 582 Least-Squares Regression

20 Observed Predicted Residual Note: Some residuals are positive, some are negative. Therefore, we try to minimize Σ residuals 2. This will (1) minimize the sum of positive values and (2) be analagous to calculating variance. BIOL 582 Least-Squares Regression

21 Observed Predicted Residual Why is this not a better line? Although not readily apparent, Σ residuals 2 > Σ residuals 2 BIOL 582 Least-Squares Regression

22 So how do we find the “best fit” line to describe our linear relationship? ( x 1,y 1 ) ( x 2,y 2 ) BIOL 582 Least-Squares Regression

23 So how do we find the “best fit” line to describe our linear relationship? ( x 1,y 1 ) ( x 2,y 2 ) y -intercept BIOL 582 Least-Squares Regression

24 So how do we find the “best fit” line to describe our linear relationship? Any line can be described as y = b 0 + b 1 x, where b 0 is the y -intercept and b 1 is the slope of the line. In Least-Squares Regression, we define the linear relationship as: What this equation means is that for any value of x, we can predict a value of y (called y-hat), if we know the y -intercept, b 0, and the slope, b 1. We can find the slope and intercept (in succession) with the following formulae: The resulting equation minimizes the sum of squared residuals!!! BIOL 582 Least-Squares Regression

25 So how do we find the “best fit” line to describe our linear relationship? Let’s consider the pupfish example: ixixi yiyi 121.560.32 228.870.81 328.500.63 428.960.70 527.000.55 632.500.92 730.390.67 836.771.36 929.390.61 Length Weight We need to calculate: -or- BIOL 582 Least-Squares Regression

26 So how do we find the “best fit” line to describe our linear relationship? Let’s consider the pupfish example: ixixi xi2xi2 yiyi yi2yi2 xiyixiyi 121.56464.830.320.106.90 228.87833.480.810.6623.38 328.50812.250.630.4017.96 428.96838.680.700.4920.27 527.00729.000.550.3014.85 632.501056.250.920.8529.90 730.39923.550.670.4520.36 836.771352.031.361.8550.01 929.39863.770.610.3717.93 Σ 263.947873.856.575.46201.56 Length Weight Here is something to think about….. The numerator is the “Sum of Squares” BIOL 582 Least-Squares Regression

27 So how do we find the “best fit” line to describe our linear relationship? Let’s consider the pupfish example: Length Weight Thus, it should be straightforward that And each is easy to calculate with our data ixixi xi2xi2 yiyi yi2yi2 xiyixiyi 121.56464.830.320.106.90 228.87833.480.810.6623.38 328.50812.250.630.4017.96 428.96838.680.700.4920.27 527.00729.000.550.3014.85 632.501056.250.920.8529.90 730.39923.550.670.4520.36 836.771352.031.361.8550.01 929.39863.770.610.3717.93 Σ 263.947873.856.575.46201.56 BIOL 582 Least-Squares Regression

28 Length Weight 5.46 0.37 1.85 0.45 0.85 0.30 0.49 0.40 0.66 0.10 yi2yi2 201.56 17.93 50.01 20.36 29.90 14.85 20.27 17.96 23.38 6.90 xiyixiyi 6.577873.85263.94 Σ 0.61863.7729.399 1.36 0.67 0.92 0.55 0.70 0.63 0.81 0.32 yiyi 1352.03 923.55 1056.25 729.00 838.68 812.25 833.48 464.83 xi2xi2 36.778 30.397 32.506 27.005 28.964 28.503 28.872 21.561 xixi i Thus, it should be straightforward that And each is easy to calculate with our data BIOL 582 Least-Squares Regression

29 Review: The steps of Least-Squares Regression: 1.Plot bivariate data 2.Calculate means for x i and y i. 3.Calculate SS, standard deviations (or both), and correlation coefficient. 4.Calculate slope. 5.Calculate y -intercept. 6.Describe linear equation 7.Calculate the Coefficient of Determination. BIOL 582 Least-Squares Regression

30 The Coefficient of Determination, R 2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line. Recall the least-squares regression criterion: the least-squares regression line minimizes the sum of squared errors (residuals 2 ). R 2 is a value between 0 and 1, AND FOR SIMPLE LINEAR REGRESSION, it is the same as r 2. (It is not the same as r 2 for multiple or non-linear regression) An R 2 of 0 means that none of the total variation is explained by the regression line (plot A) and an R 2 of 1 means all of the variation is explained by the regression line (plot B). A value in between describes the proportion of explained variation. A B R 2 = 0 R 2 = 1 BIOL 582 The Coefficient of Determination

31 The Coefficient of Determination, R 2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line. So what is meant by “explained” and “unexplained” variation? Consider this example: Observed Predicted (1, 2) (2, 2.2) (3, 6) (4, 9.8) (5, 10) = 2.36x - 1.08 R 2 = 0.9148 0 2 4 6 8 10 12 0123456 x y BIOL 582 The Coefficient of Determination

32 (1, 2) (2, 2.2) (3, 6) (4, 9.8) (5, 10) = 2.36x - 1.08 R 2 = 0.9148 0 2 4 6 8 10 12 0123456 x y BIOL 582 The Coefficient of Determination

33 Total deviation Residual Explained deviation (unexplained deviation) Analogously, but algebraically too difficult to worry about, Total Variation = Unexplained variation + Explained variation SS (Total) = SS (error) + SS (R) Where R stands for “regression” (Note: sometimes M is used for “model”) BIOL 582 The Coefficient of Determination

34 Length Weight 5.46 0.37 1.85 0.45 0.85 0.30 0.49 0.40 0.66 0.10 yi2yi2 201.56 17.93 50.01 20.36 29.90 14.85 20.27 17.96 23.38 6.90 xiyixiyi 6.577873.85263.94 Σ 0.61863.7729.399 1.36 0.67 0.92 0.55 0.70 0.63 0.81 0.32 yiyi 1352.03 923.55 1056.25 729.00 838.68 812.25 833.48 464.83 xi2xi2 36.778 30.397 32.506 27.005 28.964 28.503 28.872 21.561 xixi i The pupfish data….. SSESST 0.670.000.080.16 0.01-0.120.01-0.11 0.400.630.020.16 0.00-0.060.01-0.11 0.040.190.00 0.03-0.180.00-0.01 0.00-0.030.000.01 -0.10.00-0.03 0.010.080.020.13 0.17-0.410.010.12 22 )()()()(yyyyyyyy ii i i i i   R 2 = 1 – SSE/SST = 1 – 0.08/0.67 = 0.88 Note: This is the same as r 2 = 0.94 2 = 0.88 BIOL 582 The Coefficient of Determination

35 BIOL 582 Final Comments One can only square the correlation coefficient to get the coefficient of determination for the case of simple linear regression If one does multiple regression, or ANCOVA (combination of regression and factorial ANOVA), then the full or partial coefficient of determination is for the SS of all effects or one of the effects, respectively, with respect to the total SS. Values will not be the same as squaring correlation coefficients. ANOVA on regression models is pretty much the same as before. For simple linear regression, randomization can be used. Simply randomize values of y and hold x constant. This will be demonstrated next time


Download ppt "BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression."

Similar presentations


Ads by Google