Regression and correlation methods

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Managerial Economics in a Global Economy
Lesson 10: Linear Regression and Correlation
Forecasting Using the Simple Linear Regression Model and Correlation
13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
EPI 809/Spring Probability Distribution of Random Error.
Objectives (BPS chapter 24)
1-1 Regression Models  Population Deterministic Regression Model Y i =  0 +  1 X i u Y i only depends on the value of X i and no other factor can affect.
Chapter 10 Simple Regression.
Correlation and Simple Regression Introduction to Business Statistics, 5e Kvanli/Guynes/Pavur (c)2000 South-Western College Publishing.
The Simple Regression Model
SIMPLE LINEAR REGRESSION
Chapter Topics Types of Regression Models
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Introduction to Probability and Statistics Linear Regression and Correlation.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
SIMPLE LINEAR REGRESSION
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Simple Linear Regression and Correlation
Introduction to Regression Analysis, Chapter 13,
Simple Linear Regression Analysis
Lecture 5 Correlation and Regression
Correlation & Regression
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Inference for regression - Simple linear regression
Correlation and Linear Regression
Regression Analysis (2)
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Simple Linear Regression Models
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Introduction to Linear Regression
Topic 10 - Linear Regression Least squares principle - pages 301 – – 309 Hypothesis tests/confidence intervals/prediction intervals for regression.
Inference for Regression Chapter 14. Linear Regression We can use least squares regression to estimate the linear relationship between two quantitative.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Regression Analysis Deterministic model No chance of an error in calculating y for a given x Probabilistic model chance of an error First order linear.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Inference about the slope parameter and correlation
The simple linear regression model and parameter estimation
Regression and Correlation
Regression Analysis AGEC 784.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Simple Linear Regression
Chapter 11: Simple Linear Regression
Correlation and Simple Linear Regression
Correlation and Regression
CHAPTER 29: Multiple Regression*
6-1 Introduction To Empirical Models
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
Simple Linear Regression
Linear Regression and Correlation
SIMPLE LINEAR REGRESSION
Simple Linear Regression
Linear Regression and Correlation
Chapter Thirteen McGraw-Hill/Irwin
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Regression and correlation methods Chapter 11 Regression and correlation methods

Goals To relate (associate) a continuous random variable, preferably normally distributed, to other variables Abdus Wahed BIOST 2041

Terminology Dependent Variable (Y): The variable which is supposed to depend on others e.g., Birthweight Independent variable, explanatory variable or predictors (x): The variables which are used to predict the dependent variable, or explains the variation in the dependent variable, e.g., estriol levels Abdus Wahed BIOST 2041

Assumptions Dependent Variable: Independent variable: Continuous, preferably normally distributed Have a linear association with the predictors Independent variable: Fixed (not random) Abdus Wahed BIOST 2041

Simple Linear Regression Model Assume Y be the dependent variable and x be the lone covariate. Then a linear regression assumes that the true relationship between Y and x is given by E(Y|x) = α + βx (1) Abdus Wahed BIOST 2041

Simple Linear Regression Model (1) can be written as Y = α + βx + e, (2) where e is an error term with mean 0 and variance σ2. Abdus Wahed BIOST 2041

e e

Implication If there was a perfect linear relationship, every subject with the same value of x would have a common value of Y. Deterministic relationship The error term takes into account the inter-patient variability. σ2 = Var(Y) = Var(e). Abdus Wahed BIOST 2041

Parameters α is the intercept of the line. β is the slope of the line, referred to as regression coefficient β < 0 indicates a negative linear association (the higher the x, the smaller the Y) β = 0, no linear relationship. β > 0 indicates a positive linear association (the higher the x, the larger the Y) β is the amount of change in Y for a unit change in x. Abdus Wahed BIOST 2041

Data Estriol (mg/24hr) Birthweight(g/100) x1=7 y1=25 x2=9 y2=25 x3=9 . Abdus Wahed BIOST 2041

Goal How to estimate α, β, and σ2? Fitting Regression Lines How to draw inference? The relationship we see – is it just due to chance? Inference about regression parameters Abdus Wahed BIOST 2041

Fitting Regression Line Least Square method Abdus Wahed BIOST 2041

Least square method Idea: Implement: Estimate α and β in a way that the observations are “closest” to the line Impossible Implement: Estimate α and β in a way that the sum of squared deviations is minimized. Abdus Wahed BIOST 2041

Least square method Minimize Σ(yi - α – βxi)2 a = (Σyi – bΣxi)/n Least square estimate of α a = (Σyi – bΣxi)/n Σxiyi – Σxi Σ yi/n Least square estimate of β b = Σxi2 –(Σxi)2/n Estimated Regression line: y = a + bx Abdus Wahed BIOST 2041

Example 11.3 Estimate the regression line for the birthweight data in Table 11.1, i.e. Estimate the intercept a and slope b We do the following calculations (see the corresponding Excel file) Abdus Wahed BIOST 2041

Regression analysis for the data in Table 11.1 Sum of products: 17500 (1) Sum of X: 534 (2) Sum of Y: 992 (3) Sum of squared x: 9876 (4) Corrected Sum of products : (1) - (2)*(3)/n Lxy=412 (5) Corrected Sum of products : (4) - (2)*(2)/n Lxx=677.4194 (6) Regression coefficient: (5)/(6) b=Lxy/Lxx=0.60819 (7) Intercept: [(3) - (7)*(2)]/n a=21.52343 Estimated Regression Line: Birthweight (g/100) = 21.52 +0.61 *Estriol (mg/24hr) Abdus Wahed BIOST 2041

Regression Analysis: Interpretation There is a positive association (statistically significant or not, we will test later) between birthweight and estriol levels. For each mg increase in estriol level, the birthweight of the newborn is increased by 61 g. Abdus Wahed BIOST 2041

Prediction The predicted value of Y for a given value of x is Abdus Wahed BIOST 2041

Prediction What is the estimated (predicted) birthweight if a pregnant women has an estriol level of 15 mg/24hr? = 30.65 (g/100) = 3065 g Abdus Wahed BIOST 2041

Calibration If low birthweight is defined as <= 2500, for what estriol level would the newborn be low birthweight? That is to what value of estriol level does the predicted birthweight of 2500 correspond to? Abdus Wahed BIOST 2041

Calibration Women having estriol level of 5.72 or lower are expected to have low birthweight newborns Abdus Wahed BIOST 2041

Goodness of fit of a regression line How good is x in predicting Y? Estriol (mg/24hr) Birthweight (g/100) Predicted Residual x1=7 y1=25 25.78 r1=-0.78 x2=9 y2=25 26.99 r2=-1.99 x3=9 y3=25 r3=-1.99 x4=12 y4=27 28.82 r4=-1.82 . Abdus Wahed BIOST 2041

Goodness of fit of a regression line Residual sum of squares (Res SS) Summary Measure of Distance Between the Observed and Predicted The smaller the Res. SS, the better the regression line is in predicting Y Abdus Wahed BIOST 2041

Total variation in observed Y Total sum of squares Summary Measure of Variation in Y Abdus Wahed BIOST 2041

Total variation in predicted Y Total sum of squares Summary Measure of Variation in predicted Y Abdus Wahed BIOST 2041

Goodness of fit of a regression line Abdus Wahed BIOST 2041

Goodness of fit of a regression line It can be shown that The smaller the residual SS, the closer the total and regression sum of squares are, the better the regression is Abdus Wahed BIOST 2041

Coefficient of determination R2 R2 is the proportion of total variation in Y explained by the regression on x. R2 lies between 0 and 1. R2 = 1 implies a perfect fit (all the points are on the line). Abdus Wahed BIOST 2041

F-test Another way of formally looking at how good the regression of Y on x is, is through F-test. The F-test compares Reg. SS to Residual SS: Larger F indicates Better Regression Fit Abdus Wahed BIOST 2041

F-test Test Test statistic Reject H0 if F > F1,n-2,1-α Abdus Wahed BIOST 2041

Summary of Goodness of regression fit We need to compute three quantities Total SS Reg. SS Res. Ss Total SS = Lyy Reg. SS = b*Lxy Res. SS = Total SS – Reg.SS Abdus Wahed BIOST 2041

Example 11.12 Total SS : 674 Reg. SS : 250.57 R^2 : 0.37 => 37% of the variation in birthweight is explained by the regression on estriol level F :17.16 p-value : P(F1,29 > 17.16) = 0.0003 H0 is rejected => The slope of the regression line is significantly different from zero, implying a statistically significant linear relationship between estriol level and birthweight Abdus Wahed BIOST 2041

T-test Same hypothesis can be tested using a t-test. Abdus Wahed BIOST 2041

T-test Abdus Wahed BIOST 2041

T-test P-value = 2 Pr(tn-2 > |t|) 100(1-α)% CI for β Abdus Wahed BIOST 2041

Example 11.12 Is the regression coefficient (slope) for the estriol level significantly different from zero? S^2= 14.6 s= 3.82 SE(b)= 0.15 t= 4.14 p= 0.00027123 95% CI for reg coeff (0.31, 0.91) H0: β = 0 is rejected => The slope of the regression line is significantly different from zero, implying a statistically significant linear relationship between estriol level and birthweight Abdus Wahed BIOST 2041

Correlation Correlation refers to a quantitative measure of the strength of linear relationship between two variables Regression, on the other hand is used for prediction No distinction between dependent and independent variable is made when assessing the correlation Abdus Wahed BIOST 2041

Correlation: Example 11.14 Abdus Wahed BIOST 2041

Correlation Abdus Wahed BIOST 2041

Correlation coefficient Population correlation coefficient (See section 5.4.2 in my notes) If X and Y could be measured on everyone in the population, we could have calculated ρ. Abdus Wahed BIOST 2041

Interpretation of ρ ρ lies between −1 and 1, ρ = 0 implies no linear relationship, ρ = −1 implies perfect negative linear relationship, ρ = +1 implies perfect positive linear relationship. Abdus Wahed BIOST 2041

Sample correlation coefficient Unfortunately, we cannot measure X and Y on everyone in the population. We estimate ρ from the sample data as follows: Abdus Wahed BIOST 2041

Interpretation of r r lies between −1 and 1, r = 0 implies no linear relationship, r = −1 implies perfect negative linear relationship, r = +1 implies perfect positive linear relationship, The closer |r| is to 1, the stronger the relationship is. Abdus Wahed BIOST 2041

Sample correlation coefficient Abdus Wahed BIOST 2041

Sample correlation coefficient Abdus Wahed BIOST 2041

Sample correlation coefficient Abdus Wahed BIOST 2041

Sample correlation coefficient Abdus Wahed BIOST 2041

Sample correlation coefficient Abdus Wahed BIOST 2041

Sample correlation coefficient Abdus Wahed BIOST 2041

Correlation: Example 11.14 Sum of products: 5156.2 (1) Sum of X: 1872 (2) Sum of Y: 32.3 (3) Sum of squared X: 294320 (4) Sum of squared Y: 93.11 (5) Corrected Sum of products : (1) - (2)*(3)/n Lxy = 117.4 (6) Corrected Sum of squares of X : (4) - (2)*(2)/n Lxx = 2288 (7) Corrected Sum of squares of Y : (5) - (3)*(3)/n Lyy = 6.17 (8) Sample Correlation Coefficient (6)/sqrt[(7)*(8)] r = 0.988 Abdus Wahed BIOST 2041

Correlation: Example 11.14 Since r = 0.988 , there exists nearly perfect positive correlation between mean FEV and the height. The taller a person is the higher the FEV levels. Had we done a regression of one of the variables (FEV or height) on the other, the R2 would have been R2 = r2 = 0.976~98%. This implies that 98% of the variation in one variable is explained by the other. Abdus Wahed BIOST 2041

Correlation: Example 11.24 The sample correlation coefficient between estriol levels and the birth weights is calculated as r = 0.61, implying moderately strong positive linear relationship. The higher the estriol levels, the higher the birth weights. Remember, R2 = 0.37 (slide # 33) which is equal to r2 = (0.61)2. Abdus Wahed BIOST 2041

Statistical Significance of Correlation If |r| is close to 1, such as 0.988, one would believe that there is a strong linear relationship between the two variables. That means, there is no reason to believe that this strong association just happened by chance (sampling/observation). Abdus Wahed BIOST 2041

Statistical Significance of Correlation But If |r| = 0.23, what conclusion would you draw about the relationship? Is it possible that in truth there was no correlation (ρ = 0), but the sample by chance only shows that there is some sort of correlation between the two variables? Abdus Wahed BIOST 2041

Significance test for correlation coefficient Test the hypothesis H0: ρ = 0 vs. Ha: ρ ≠ 0. Under the assumption that both variables are normally distributed, Calculate two-sided p-value from a t distribution with (n-2) d.f. Abdus Wahed BIOST 2041

Correlation: Example 11.24 The sample correlation coefficient between estriol levels and the birth weights is calculated as r = 0.61. Is the correlation significant? (Is the correlation coefficient significantly different from zero?) Abdus Wahed BIOST 2041

Correlation: Example 11.24 Since p-value is very small, we reject the null hypothesis. The correlation is statistically significant at α = 0.0003. => We have enough evidence to conclude that the correlation coefficient is significantly different from zero. Did you notice that the t-statistic (t = 4.14) and p-value (0.00027) for testing H0: ρ = 0 are exactly same as the t-statistic calculated for H0: β = 0 in slide 37? Abdus Wahed BIOST 2041

Significance test for correlation coefficient Test the hypothesis H0: ρ = ρ0 vs. Ha: ρ ≠ ρ0. Let (Fisher’s Z transformation), Abdus Wahed BIOST 2041

Significance test for correlation coefficient Then under H0, The p-value for the test could then be calculated from a standard normal distribution We will mainly use this result to find confidence intervals for ρ Abdus Wahed BIOST 2041

Confidence Interval for ρ Abdus Wahed BIOST 2041