Correlation and Regression By Walden University Statsupport Team March 2011.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Lesson 10: Linear Regression and Correlation
Forecasting Using the Simple Linear Regression Model and Correlation
Correlation and Linear Regression.
Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Quantitative Techniques
Simple Linear Regression 1. Correlation indicates the magnitude and direction of the linear relationship between two variables. Linear Regression: variable.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Statistics for the Social Sciences
1-1 Regression Models  Population Deterministic Regression Model Y i =  0 +  1 X i u Y i only depends on the value of X i and no other factor can affect.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
The Simple Regression Model
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 13-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter Topics Types of Regression Models
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
Introduction to Probability and Statistics Linear Regression and Correlation.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Pertemua 19 Regresi Linier
Correlation and Regression Analysis
Chapter 7 Forecasting with Simple Regression
15: Linear Regression Expected change in Y per unit X.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Correlation & Regression
Chapter 8: Bivariate Regression and Correlation
Correlation and Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Relationship of two variables
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Review of Statistical Models and Linear Regression Concepts STAT E-150 Statistical Methods.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
©2006 Thomson/South-Western 1 Chapter 13 – Correlation and Simple Regression Slides prepared by Jeff Heyl Lincoln University ©2006 Thomson/South-Western.
Correlation & Regression
Examining Relationships in Quantitative Research
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Correlation & Regression Analysis
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Statistics for Managers Using Microsoft® Excel 5th Edition
Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Chapter 12: Correlation and Linear Regression 1.
رگرسیون چندگانه Multiple Regression
Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.
Regression Analysis AGEC 784.
Inference for Least Squares Lines
Correlation and Simple Linear Regression
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Correlation and Simple Linear Regression
CHAPTER 29: Multiple Regression*
Chapter 14 – Correlation and Simple Regression
Correlation and Simple Linear Regression
Simple Linear Regression
Simple Linear Regression and Correlation
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Correlation and Regression By Walden University Statsupport Team March 2011

Correlation and Regression Introduction Introduction Linear Correlation Linear Correlation Assumptions Assumptions Linear Regression Linear Regression Assumptions Assumptions

Correlation measures the strength and direction of relationship between two variables. It is used as a measure of association based on assumptions such as linearity of relationships, the same level of relationship throughout the range of the independent variable (homoscedasticity) and interval or near- interval data. Correlation measures the strength and direction of relationship between two variables. It is used as a measure of association based on assumptions such as linearity of relationships, the same level of relationship throughout the range of the independent variable (homoscedasticity) and interval or near- interval data. Homoscedasticity refers to constant conditional variance over time. Homoscedasticity refers to constant conditional variance over time. Regression deals with a functional relationship between a dependent variable and independent variable. Regression deals with a functional relationship between a dependent variable and independent variable. Regression analysis is used when you want to predict a continuous dependent variable from a number of independent variables. If the dependent variable is dichotomous, then logistic regression should be used. Regression analysis is used when you want to predict a continuous dependent variable from a number of independent variables. If the dependent variable is dichotomous, then logistic regression should be used. Introduction

The most commonly used measure of linear correlation is product-moment correlation (Pearson's r). The most commonly used measure of linear correlation is product-moment correlation (Pearson's r). Pearson's r is a measure of association which varies from -1 to +1, with 0 indicating no relationship (random pairing of values) and 1 indicating perfect relationship, taking the form: the more the x, the more the y, and vice versa. Pearson's r is a measure of association which varies from -1 to +1, with 0 indicating no relationship (random pairing of values) and 1 indicating perfect relationship, taking the form: the more the x, the more the y, and vice versa. A value of -1 is a perfect negative relationship, taking the form: the more the x, the less the y, and vice versa. A value of -1 is a perfect negative relationship, taking the form: the more the x, the less the y, and vice versa. Since it is a measure of association, the presence of significant linear correlation between two variables does not imply causation. Since it is a measure of association, the presence of significant linear correlation between two variables does not imply causation. Linear Correlation

In situations where the assumptions of linear correlation are violated, correlation becomes inadequate to explain a given relationship. The three crucial assumptions in linear correlation are: In situations where the assumptions of linear correlation are violated, correlation becomes inadequate to explain a given relationship. The three crucial assumptions in linear correlation are: 1. Normality 2. Linearity 3. Homoscedasticity The assumption of normality requires that the distribution of both variables approximates the normal distribution and is not skewed in either the positive or the negative direction. The assumption of normality requires that the distribution of both variables approximates the normal distribution and is not skewed in either the positive or the negative direction. The linearity assumption requires that the relationship between the two variables is linear and proportional. The linearity assumption requires that the relationship between the two variables is linear and proportional. Homoscedasticity assumption requires that for the variance to remain constant over time for each variable studied. In other words it calls for constancy of the variance of a measure over the levels of the factor under study. Homoscedasticity assumption requires that for the variance to remain constant over time for each variable studied. In other words it calls for constancy of the variance of a measure over the levels of the factor under study. Assumptions in Using Linear Correlation

Let us look at the linear relationship between percent of students receiving reduced-fee lunch and percent of students hearing a bicycle helmets. Here the X variable is socioeconomic status measured as the percentage of children in a neighborhood receiving free or reduced-fee lunches at school. The Y variable is bicycle helmet use measured as the percentage of bicycle riders in the neighborhood wearing helmets. The bicycle data is shown in the next slide. The first step in conducting linear correlation analysis is to use scatter plots to visually inspect the pattern of relationship between the two variables. To generate scatter plot in SPSS do the following: Graphs > Legacy Dialogs > Scatter/Dot… and then click on simple scatter. Then click on the Define button and then move percent receiving reduced-fee lunch to X-axis and percent wearing helmets to Y-axis. Then Finally click OK.

Data on the relationship between percent receiving reduced-fee lunch and percent wearing helmets.

Simple Scatter Plot Selected

X-axis and Y-axis variables selected for scatter plot

Scatter plot of percent receiving reduced-fee lunch and percent wearing helmets

The scatterplot looks fairly linear. The direction of relationship is such that the two variables are inversely related. We also observe some outliers in the scatter plots. An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. To obtain the linear correlation coefficient, do the following in SPSS: Analyze > correlate > Bivariate and then move both variables to the Variables box and then click OK. You would obtain the output indicated in the correlations Table.

A demonstration on how to execute the linear correlation coefficient calculation in SPSS

Demonstration on how to select the variables for which correlation is to be determined.

Correlations Percent receiving reduced or free meals percent wering helmets Percent receiving reduced or free meals Pearson Correlation * Sig. (2-tailed).037 N13 percent wering helmetsPearson Correlation-.581 * 1 Sig. (2-tailed).037 N13 *. Correlation is significant at the 0.05 level (2-tailed). As it can be seen in the correlations Table above, the Pearson correlation = and its p-value is which indicates that there is statistically significant linear relationship between percent receiving reduced-fee lunch and percent wearing bicycle helmets. Not that the negative sign indicates that there relationship is an inverse one. That means in neighborhoods that have lower percentage of students receiving Reduced-fee lunch there are higher percentage of students wearing helmets and vice versa.

Linear Regression  Linear regression models the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.  A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).  Regression is better suited than correlation for studying samples in which the investigator fixes the distribution of X. That means the investigator can control changes in the level of X so as to examine corresponding changes in Y.  The most common method for fitting a regression line is the method of least- squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the straight line.

Assumptions of Linear Regression There are four principal assumptions which justify the use of linear regression models : 1. linearity of the relationship between dependent and independent variables 2. independence of the errors (no serial correlation) 3. homoscedasticity (constant variance) of the errors (a) versus time (b) versus the predictions (or versus any independent variable) 4. normality of the error distribution. Nonlinearity can be detected by plotting the observed versus predicted values or by plotting of residuals versus predicted values, which are a part of standard regression output. The points should be symmetrically distributed around a diagonal line in the former plot or a horizontal line in the latter plot. Look carefully for evidence of a "bowed" pattern, indicating that the model makes systematic errors whenever it is making unusually large or small predictions.

 The best test method of detecting for the independence assumption is to examine the autocorrelation plot of the residuals. most of the residual autocorrelations should fall within the 95% confidence bands around zero, which are located at roughly plus-or-minus 2-over-the-square-root-of-n, where n is the sample size. The Durbin-Watson statistic can also help to test for significant residual autocorrelation.  Violations of the homoscedasticity assumption can be detected by looking at plots of residuals versus predicted value, and residuals that are getting larger (i.e., more spread-out) either as a function of time or as a function of the predicted value suggests the presence of heteroscedassticity. A plot of residuals versus some of the independent variables might also help to discern the presence of heteroscedasticity. A check for violation of the normality assumption can be done by normal probability plot of the residuals. The normal probability plot is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on this plot should fall close to the diagonal line. Assumptions of Linear Regression Continued…

An illustration of regression techniques will be given as follows using the bicycle data. The regression model and its parameter estimates can be generated in SPSS by Clicking : Analyze > Regression > Linear and then move percent receiving reduced-fee lunch to the independent(s) box and percent wearing helmets to the dependent box. Then Click OK. This gives us the important outputs such as model summary table, ANOVA table and Coefficients table.

A screen demonstrating the steps for running linear regression.

Demonstration of how to pick the dependent and independent variables for fitting the linear regression model.

Model Summary ModelRR SquareAdjusted R SquareStd. Error of the Estimate a a. Predictors: (Constant), Percent receiving reduced or free meals ANOVA b ModelSum of SquaresdfMean SquareFSig. 1Regression a Residual Total a. Predictors: (Constant), Percent receiving reduced or free meals b. Dependent Variable: percent wearing helmets Coefficients a Model Unstandardized Coefficients Standardized Coefficients tSig. BStd. ErrorBeta 1(Constant) Percent receiving reduced or free meals a. Dependent Variable: percent wering helmets Linear regression model output in which percent wearing helmets is estimated as a function of percent receiving reduced-fee.

Interpretation of the fitted regression model output: 1.The model summary table indicates that the R square value is  This can be viewed as poor model fit since it means that only about 34% of the proportion of variability in the percent wearing helmets is explained by percent receiving reduced-fee. 2. The ANOVA table indicates that the fitted regression model is statistically significant since the p-value is which is less than The coefficients table shows that the intercept is and the slope is The p-value for the slope is which is less than Therefore, percent receiving reduced-fee is a significant predictor of percent wearing helmets. The slope of the regression model is interpreted as the average change in Y per unit change in X. In this case, the slope of predicts fewer helmet users per 100 bicycle riders for each additional percentage of children receiving reduced-fee meals.

Final Remarks In regression analysis, residual analysis and the tasks of identifying the influence of outliers and influential points are crucial. For instance in this dataset, observation 13 was found to be an outlier from the scatter plot made earlier. If we remove this observation and refit the regression model, the model parameter estimates change significantly. A thorough analysis of the effects of outliers and influential points will be covered under multiple regression in Week 12. It is also important to note that statistical associations are not always causal. The distinction between causal and non-causal associations in health and disease has several practical relevance.