# Correlation and Regression By Walden University Statsupport Team March 2011.

## Presentation on theme: "Correlation and Regression By Walden University Statsupport Team March 2011."— Presentation transcript:

Correlation and Regression By Walden University Statsupport Team March 2011

Correlation and Regression Introduction Introduction Linear Correlation Linear Correlation Assumptions Assumptions Linear Regression Linear Regression Assumptions Assumptions

Correlation measures the strength and direction of relationship between two variables. It is used as a measure of association based on assumptions such as linearity of relationships, the same level of relationship throughout the range of the independent variable (homoscedasticity) and interval or near- interval data. Correlation measures the strength and direction of relationship between two variables. It is used as a measure of association based on assumptions such as linearity of relationships, the same level of relationship throughout the range of the independent variable (homoscedasticity) and interval or near- interval data. Homoscedasticity refers to constant conditional variance over time. Homoscedasticity refers to constant conditional variance over time. Regression deals with a functional relationship between a dependent variable and independent variable. Regression deals with a functional relationship between a dependent variable and independent variable. Regression analysis is used when you want to predict a continuous dependent variable from a number of independent variables. If the dependent variable is dichotomous, then logistic regression should be used. Regression analysis is used when you want to predict a continuous dependent variable from a number of independent variables. If the dependent variable is dichotomous, then logistic regression should be used. Introduction

The most commonly used measure of linear correlation is product-moment correlation (Pearson's r). The most commonly used measure of linear correlation is product-moment correlation (Pearson's r). Pearson's r is a measure of association which varies from -1 to +1, with 0 indicating no relationship (random pairing of values) and 1 indicating perfect relationship, taking the form: the more the x, the more the y, and vice versa. Pearson's r is a measure of association which varies from -1 to +1, with 0 indicating no relationship (random pairing of values) and 1 indicating perfect relationship, taking the form: the more the x, the more the y, and vice versa. A value of -1 is a perfect negative relationship, taking the form: the more the x, the less the y, and vice versa. A value of -1 is a perfect negative relationship, taking the form: the more the x, the less the y, and vice versa. Since it is a measure of association, the presence of significant linear correlation between two variables does not imply causation. Since it is a measure of association, the presence of significant linear correlation between two variables does not imply causation. Linear Correlation

In situations where the assumptions of linear correlation are violated, correlation becomes inadequate to explain a given relationship. The three crucial assumptions in linear correlation are: In situations where the assumptions of linear correlation are violated, correlation becomes inadequate to explain a given relationship. The three crucial assumptions in linear correlation are: 1. Normality 2. Linearity 3. Homoscedasticity The assumption of normality requires that the distribution of both variables approximates the normal distribution and is not skewed in either the positive or the negative direction. The assumption of normality requires that the distribution of both variables approximates the normal distribution and is not skewed in either the positive or the negative direction. The linearity assumption requires that the relationship between the two variables is linear and proportional. The linearity assumption requires that the relationship between the two variables is linear and proportional. Homoscedasticity assumption requires that for the variance to remain constant over time for each variable studied. In other words it calls for constancy of the variance of a measure over the levels of the factor under study. Homoscedasticity assumption requires that for the variance to remain constant over time for each variable studied. In other words it calls for constancy of the variance of a measure over the levels of the factor under study. Assumptions in Using Linear Correlation

Let us look at the linear relationship between percent of students receiving reduced-fee lunch and percent of students hearing a bicycle helmets. Here the X variable is socioeconomic status measured as the percentage of children in a neighborhood receiving free or reduced-fee lunches at school. The Y variable is bicycle helmet use measured as the percentage of bicycle riders in the neighborhood wearing helmets. The bicycle data is shown in the next slide. The first step in conducting linear correlation analysis is to use scatter plots to visually inspect the pattern of relationship between the two variables. To generate scatter plot in SPSS do the following: Graphs > Legacy Dialogs > Scatter/Dot… and then click on simple scatter. Then click on the Define button and then move percent receiving reduced-fee lunch to X-axis and percent wearing helmets to Y-axis. Then Finally click OK.

Data on the relationship between percent receiving reduced-fee lunch and percent wearing helmets.

Simple Scatter Plot Selected

X-axis and Y-axis variables selected for scatter plot

Scatter plot of percent receiving reduced-fee lunch and percent wearing helmets

The scatterplot looks fairly linear. The direction of relationship is such that the two variables are inversely related. We also observe some outliers in the scatter plots. An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. To obtain the linear correlation coefficient, do the following in SPSS: Analyze > correlate > Bivariate and then move both variables to the Variables box and then click OK. You would obtain the output indicated in the correlations Table.

A demonstration on how to execute the linear correlation coefficient calculation in SPSS

Demonstration on how to select the variables for which correlation is to be determined.

Correlations Percent receiving reduced or free meals percent wering helmets Percent receiving reduced or free meals Pearson Correlation1-.581 * Sig. (2-tailed).037 N13 percent wering helmetsPearson Correlation-.581 * 1 Sig. (2-tailed).037 N13 *. Correlation is significant at the 0.05 level (2-tailed). As it can be seen in the correlations Table above, the Pearson correlation = -0.581 and its p-value is 0.037 which indicates that there is statistically significant linear relationship between percent receiving reduced-fee lunch and percent wearing bicycle helmets. Not that the negative sign indicates that there relationship is an inverse one. That means in neighborhoods that have lower percentage of students receiving Reduced-fee lunch there are higher percentage of students wearing helmets and vice versa.

Linear Regression  Linear regression models the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.  A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).  Regression is better suited than correlation for studying samples in which the investigator fixes the distribution of X. That means the investigator can control changes in the level of X so as to examine corresponding changes in Y.  The most common method for fitting a regression line is the method of least- squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the straight line.

Assumptions of Linear Regression There are four principal assumptions which justify the use of linear regression models : 1. linearity of the relationship between dependent and independent variables 2. independence of the errors (no serial correlation) 3. homoscedasticity (constant variance) of the errors (a) versus time (b) versus the predictions (or versus any independent variable) 4. normality of the error distribution. Nonlinearity can be detected by plotting the observed versus predicted values or by plotting of residuals versus predicted values, which are a part of standard regression output. The points should be symmetrically distributed around a diagonal line in the former plot or a horizontal line in the latter plot. Look carefully for evidence of a "bowed" pattern, indicating that the model makes systematic errors whenever it is making unusually large or small predictions.

 The best test method of detecting for the independence assumption is to examine the autocorrelation plot of the residuals. most of the residual autocorrelations should fall within the 95% confidence bands around zero, which are located at roughly plus-or-minus 2-over-the-square-root-of-n, where n is the sample size. The Durbin-Watson statistic can also help to test for significant residual autocorrelation.  Violations of the homoscedasticity assumption can be detected by looking at plots of residuals versus predicted value, and residuals that are getting larger (i.e., more spread-out) either as a function of time or as a function of the predicted value suggests the presence of heteroscedassticity. A plot of residuals versus some of the independent variables might also help to discern the presence of heteroscedasticity. A check for violation of the normality assumption can be done by normal probability plot of the residuals. The normal probability plot is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on this plot should fall close to the diagonal line. Assumptions of Linear Regression Continued…

An illustration of regression techniques will be given as follows using the bicycle data. The regression model and its parameter estimates can be generated in SPSS by Clicking : Analyze > Regression > Linear and then move percent receiving reduced-fee lunch to the independent(s) box and percent wearing helmets to the dependent box. Then Click OK. This gives us the important outputs such as model summary table, ANOVA table and Coefficients table.

A screen demonstrating the steps for running linear regression.

Demonstration of how to pick the dependent and independent variables for fitting the linear regression model.

Model Summary ModelRR SquareAdjusted R SquareStd. Error of the Estimate 1.581 a.338.27714.2824 a. Predictors: (Constant), Percent receiving reduced or free meals ANOVA b ModelSum of SquaresdfMean SquareFSig. 1Regression1143.8471 5.607.037 a Residual2243.84211203.986 Total3387.68912 a. Predictors: (Constant), Percent receiving reduced or free meals b. Dependent Variable: percent wearing helmets Coefficients a Model Unstandardized Coefficients Standardized Coefficients tSig. BStd. ErrorBeta 1(Constant)43.6386.2826.946.000 Percent receiving reduced or free meals -.331.140-.581-2.368.037 a. Dependent Variable: percent wering helmets Linear regression model output in which percent wearing helmets is estimated as a function of percent receiving reduced-fee.

Interpretation of the fitted regression model output: 1.The model summary table indicates that the R square value is 0.338  0.34. This can be viewed as poor model fit since it means that only about 34% of the proportion of variability in the percent wearing helmets is explained by percent receiving reduced-fee. 2. The ANOVA table indicates that the fitted regression model is statistically significant since the p-value is 0.037 which is less than 0.05. 3. The coefficients table shows that the intercept is 43.638 and the slope is - 0.331. The p-value for the slope is 0.037 which is less than 0.05. Therefore, percent receiving reduced-fee is a significant predictor of percent wearing helmets. The slope of the regression model is interpreted as the average change in Y per unit change in X. In this case, the slope of -0.331 predicts fewer helmet users per 100 bicycle riders for each additional percentage of children receiving reduced-fee meals.

Final Remarks In regression analysis, residual analysis and the tasks of identifying the influence of outliers and influential points are crucial. For instance in this dataset, observation 13 was found to be an outlier from the scatter plot made earlier. If we remove this observation and refit the regression model, the model parameter estimates change significantly. A thorough analysis of the effects of outliers and influential points will be covered under multiple regression in Week 12. It is also important to note that statistical associations are not always causal. The distinction between causal and non-causal associations in health and disease has several practical relevance.

Download ppt "Correlation and Regression By Walden University Statsupport Team March 2011."

Similar presentations