BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression.

Slides:



Advertisements
Similar presentations
Lesson 10: Linear Regression and Correlation
Advertisements

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Learning Objectives Copyright © 2004 John Wiley & Sons, Inc. Bivariate Correlation and Regression CHAPTER Thirteen.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Linear regression models
Correlation and Regression
Chapter 4 The Relation between Two Variables
Simple Linear Regression and Correlation
Definition  Regression Model  Regression Equation Y i =  0 +  1 X i ^ Given a collection of paired data, the regression equation algebraically describes.
Chapter 4 Describing the Relation Between Two Variables
2.2 Correlation Correlation measures the direction and strength of the linear relationship between two quantitative variables.
Chapter 12 Simple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
The Simple Regression Model
Lesson #32 Simple Linear Regression. Regression is used to model and/or predict a variable; called the dependent variable, Y; based on one or more independent.
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
REGRESSION AND CORRELATION
Introduction to Probability and Statistics Linear Regression and Correlation.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Relationships Among Variables
Simple Linear Regression
Lecture 5 Correlation and Regression
Correlation & Regression
Linear Regression and Correlation
Descriptive Methods in Regression and Correlation
Regression and Correlation Methods Judy Zhong Ph.D.
Relationship of two variables
MAT 254 – Probability and Statistics Sections 1,2 & Spring.
Chapter 15 Correlation and Regression
Ch4 Describing Relationships Between Variables. Pressure.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Section 5.2: Linear Regression: Fitting a Line to Bivariate Data.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Chapter 10 Correlation and Regression
Examining Relationships in Quantitative Research
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Scatterplot and trendline. Scatterplot Scatterplot explores the relationship between two quantitative variables. Example:
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
MARKETING RESEARCH CHAPTER 18 :Correlation and Regression.
Chapter Thirteen Copyright © 2006 John Wiley & Sons, Inc. Bivariate Correlation and Regression.
Creating a Residual Plot and Investigating the Correlation Coefficient.
Chapter 4 Summary Scatter diagrams of data pairs (x, y) are useful in helping us determine visually if there is any relation between x and y values and,
Correlation & Regression Analysis
Chapter 8: Simple Linear Regression Yang Zhenlin.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Regression Analysis. 1. To comprehend the nature of correlation analysis. 2. To understand bivariate regression analysis. 3. To become aware of the coefficient.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Simple Linear Regression The Coefficients of Correlation and Determination Two Quantitative Variables x variable – independent variable or explanatory.
CORRELATION ANALYSIS.
Lecture 10 Introduction to Linear Regression and Correlation Analysis.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Chapter 13 Simple Linear Regression
The simple linear regression model and parameter estimation
Regression and Correlation
Regression Analysis AGEC 784.
Correlation & Regression
CHAPTER 10 Correlation and Regression (Objectives)
Correlation and Regression
CORRELATION ANALYSIS.
Correlation and Regression
Ch 4.1 & 4.2 Two dimensions concept
Presentation transcript:

BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression

Thus far, we have considered whether means of a response variable differ among groups. Sometimes it is of interest to know whether a variable covaries with another variable, or whether the value of one variable can predict the value of another. With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs ( x i, y i ). The data can be both qualitative, one qualitative and one quantitative, or both quantitative. In some examples, x i is the independent (predictor) variable and y i is the dependent (response) variable. Although it might not be readily apparent we have been working all along with qualitative (nominal) independent variables (e.g., grouping variables). Now we are going to shift gears and look at continuous quantitative independent variables. BIOL 582 Considering Multiple Variables

With bivariate data, two values are measured on each population (or experimental) unit. We denote the data as ordered pairs ( x i, y i ). The data can be both qualitative, one qualitative and one quantitative, or both quantitative. In some examples, x i is the independent (predictor) variable and y i is the dependent (response) variable. Bivariate Quantitative variables Scatter Plot: BIOL 582 Considering Multiple Variables

Variables include units Points are ordered pairs ( x i, y i ) (21.56, 0.32) (36.77, 1.36) Independent (predictor) variable Dependent (response) variable BIOL 582 Considering Multiple Variables

Is there a linear relationship for the data? BIOL 582 Considering Multiple Variables

x y x y x y x y x y Positive linear relationship Negative linear relationship No relationship Non-linear relationships BIOL 582 Considering Multiple Variables

Correlation The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient. Sample Correlation Coefficient where is the sample mean for the predictor variable, is the sample standard deviation of the predictor variable, is the sample mean of the response variable, is the sample standard deviation of the response variable, is the number of individual units in the sample. BIOL 582 Considering Multiple Variables

Correlation The Linear Correlation Coefficient or Pearson Product Correlation Coefficient is a measure of the strength of linear relation between two quantitative variables. We use the Greek letter ρ (rho) to represent the population correlation coefficient and r to represent the sample correlation coefficient. Sample Correlation Coefficient Here is a computationally easier way to calculate r BIOL 582 Considering Multiple Variables

BIOL 582 Scatter Diagrams; Correlation Consider the pupfish example ixixi yiyi Add 3 more columns

Consider the pupfish example ixixi yiyi xi2xi2 yi2yi2 xiyixiyi BIOL 582 Scatter Diagrams; Correlation

Consider the pupfish example ixixi yiyi xi2xi2 yi2yi2 xiyixiyi sum BIOL 582 Scatter Diagrams; Correlation

More on correlation coefficients r meaning 1.0Perfectly positively correlated 0.8Strongly positively correlated Weakly positively correlated 0.2 0Not Correlated Weakly negatively correlated Strongly negatively correlated Perfectly negatively correlated x y x y xx y Match: r = 0.1 r = 0.3 r = 0.9 r = 0.7 BIOL 582 Scatter Diagrams; Correlation

More on correlation coefficients r meaning 1.0Perfectly positively correlated 0.8Strongly positively correlated Weakly positively correlated 0.2 0Not Correlated Weakly negatively correlated Strongly negatively correlated Perfectly negatively correlated x y x y xx y Match: r = -0.1 r = -0.3 r = -0.9 r = -0.7 BIOL 582 Scatter Diagrams; Correlation

More on correlation coefficients WARNINGS Question: Does a correlation coefficient of 0 mean no association or no relationship? ixixi yiyi xi2xi2 yi2yi2 xiyixiyi r = 0 y i = x i 2 Thus, r = 0 could mean no association or a non-linear relationship BIOL 582 Scatter Diagrams; Correlation

More on correlation coefficients WARNINGS Question: How do extreme points affect correlation? ixixi yiyi ixixi yiyi r = 0 r > 0.99 BIOL 582 Scatter Diagrams; Correlation

More on correlation coefficients WARNINGS Question: How do extreme points affect correlation? ixixi yiyi ixixi yiyi r = 1 r =0 BIOL 582 Scatter Diagrams; Correlation

More on correlation coefficients WARNINGS Question: Does correlation mean causation? Pupfish data (MR = metabolic rate, mgO 2 /hr ) Length Weight MR ixixi yiyi zizi r = 0.94 r = 0.92 But, the correlation between length and MR is also strong: r = 0.84 Neither length nor weight “cause” increase in MR. MR happens to be biologically, positively associated with weight. Weight also happens to have a positive association with length. Thus, it appears that length and MR are related when they are not really directly related. Remember, causation can only be inferred from an experimental approach. BIOL 582 Scatter Diagrams; Correlation

We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship. ixixi yiyi zizi r = 0.94 r = 0.92 Length Weight MR This is a line of “best fit” for the linear relationship. It is usually found by Least-Squares Regression. This is the equation of the line. BIOL 582 Least-Squares Regression

We have considered whether or not there is a linear relationship between two variables, now let’s consider how to describe the relationship. Least-Squares Regression Criterion The least-squares regression line is the one that minimizes the sum of squared errors. It is the line that minimizes the square of vertical distance between observed values of y and those predicted by the line, (“y-hat”). We represent this as: Minimize Σ residuals 2 MR vs. Weight in pupfish y = 0.90x Weight (g) MR (mgO2/hr) BIOL 582 Least-Squares Regression

Observed Predicted Residual Note: Some residuals are positive, some are negative. Therefore, we try to minimize Σ residuals 2. This will (1) minimize the sum of positive values and (2) be analagous to calculating variance. BIOL 582 Least-Squares Regression

Observed Predicted Residual Why is this not a better line? Although not readily apparent, Σ residuals 2 > Σ residuals 2 BIOL 582 Least-Squares Regression

So how do we find the “best fit” line to describe our linear relationship? ( x 1,y 1 ) ( x 2,y 2 ) BIOL 582 Least-Squares Regression

So how do we find the “best fit” line to describe our linear relationship? ( x 1,y 1 ) ( x 2,y 2 ) y -intercept BIOL 582 Least-Squares Regression

So how do we find the “best fit” line to describe our linear relationship? Any line can be described as y = b 0 + b 1 x, where b 0 is the y -intercept and b 1 is the slope of the line. In Least-Squares Regression, we define the linear relationship as: What this equation means is that for any value of x, we can predict a value of y (called y-hat), if we know the y -intercept, b 0, and the slope, b 1. We can find the slope and intercept (in succession) with the following formulae: The resulting equation minimizes the sum of squared residuals!!! BIOL 582 Least-Squares Regression

So how do we find the “best fit” line to describe our linear relationship? Let’s consider the pupfish example: ixixi yiyi Length Weight We need to calculate: -or- BIOL 582 Least-Squares Regression

So how do we find the “best fit” line to describe our linear relationship? Let’s consider the pupfish example: ixixi xi2xi2 yiyi yi2yi2 xiyixiyi Σ Length Weight Here is something to think about….. The numerator is the “Sum of Squares” BIOL 582 Least-Squares Regression

So how do we find the “best fit” line to describe our linear relationship? Let’s consider the pupfish example: Length Weight Thus, it should be straightforward that And each is easy to calculate with our data ixixi xi2xi2 yiyi yi2yi2 xiyixiyi Σ BIOL 582 Least-Squares Regression

Length Weight yi2yi xiyixiyi Σ yiyi xi2xi xixi i Thus, it should be straightforward that And each is easy to calculate with our data BIOL 582 Least-Squares Regression

Review: The steps of Least-Squares Regression: 1.Plot bivariate data 2.Calculate means for x i and y i. 3.Calculate SS, standard deviations (or both), and correlation coefficient. 4.Calculate slope. 5.Calculate y -intercept. 6.Describe linear equation 7.Calculate the Coefficient of Determination. BIOL 582 Least-Squares Regression

The Coefficient of Determination, R 2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line. Recall the least-squares regression criterion: the least-squares regression line minimizes the sum of squared errors (residuals 2 ). R 2 is a value between 0 and 1, AND FOR SIMPLE LINEAR REGRESSION, it is the same as r 2. (It is not the same as r 2 for multiple or non-linear regression) An R 2 of 0 means that none of the total variation is explained by the regression line (plot A) and an R 2 of 1 means all of the variation is explained by the regression line (plot B). A value in between describes the proportion of explained variation. A B R 2 = 0 R 2 = 1 BIOL 582 The Coefficient of Determination

The Coefficient of Determination, R 2, measures the percentage of total variation in the response variable that is explained by the least-squares regression line. So what is meant by “explained” and “unexplained” variation? Consider this example: Observed Predicted (1, 2) (2, 2.2) (3, 6) (4, 9.8) (5, 10) = 2.36x R 2 = x y BIOL 582 The Coefficient of Determination

(1, 2) (2, 2.2) (3, 6) (4, 9.8) (5, 10) = 2.36x R 2 = x y BIOL 582 The Coefficient of Determination

Total deviation Residual Explained deviation (unexplained deviation) Analogously, but algebraically too difficult to worry about, Total Variation = Unexplained variation + Explained variation SS (Total) = SS (error) + SS (R) Where R stands for “regression” (Note: sometimes M is used for “model”) BIOL 582 The Coefficient of Determination

Length Weight yi2yi xiyixiyi Σ yiyi xi2xi xixi i The pupfish data….. SSESST )()()()(yyyyyyyy ii i i i i   R 2 = 1 – SSE/SST = 1 – 0.08/0.67 = 0.88 Note: This is the same as r 2 = = 0.88 BIOL 582 The Coefficient of Determination

BIOL 582 Final Comments One can only square the correlation coefficient to get the coefficient of determination for the case of simple linear regression If one does multiple regression, or ANCOVA (combination of regression and factorial ANOVA), then the full or partial coefficient of determination is for the SS of all effects or one of the effects, respectively, with respect to the total SS. Values will not be the same as squaring correlation coefficients. ANOVA on regression models is pretty much the same as before. For simple linear regression, randomization can be used. Simply randomize values of y and hold x constant. This will be demonstrated next time