Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation and Regression

Similar presentations


Presentation on theme: "Correlation and Regression"— Presentation transcript:

1 Correlation and Regression
CHAPTER 10 Correlation and Regression © Copyright McGraw-Hill 2004

2 © Copyright McGraw-Hill 2004
Objectives Draw a scatter plot for a set of ordered pairs. Compute the correlation coefficient. Test the hypothesis H0:   0. Compute the equation of the regression line. © Copyright McGraw-Hill 2004

3 © Copyright McGraw-Hill 2004
Objectives (cont’d.) Compute the coefficient of determination. Compute the standard error of estimate. Find a prediction interval. Be familiar with the concept of multiple regression. © Copyright McGraw-Hill 2004

4 © Copyright McGraw-Hill 2004
Introduction In addition to hypothesis testing and confidence intervals, inferential statistics involves determining whether a relationship between two or more numerical or quantitative variables exists. © Copyright McGraw-Hill 2004

5 © Copyright McGraw-Hill 2004
Statistical Methods Correlation is a statistical method used to determine whether a linear relationship between variables exists. Regression is a statistical method used to describe the nature of the relationship between variables—that is, positive or negative, linear or nonlinear. © Copyright McGraw-Hill 2004

6 Statistical Questions
Are two or more variables related? If so, what is the strength of the relationship? What type or relationship exists? What kind of predictions can be made from the relationship? © Copyright McGraw-Hill 2004

7 © Copyright McGraw-Hill 2004
Vocabulary A correlation coefficient is a measure of how variables are related. In a simple relationship, there are only two types of variables under study. In multiple relationships, many variables are under study. © Copyright McGraw-Hill 2004

8 © Copyright McGraw-Hill 2004
Scatter Plots A scatter plot is a graph of the ordered pairs (x,y) of numbers consisting of the independent variable, x, and the dependent variable, y. A scatter plot is a visual way to describe the nature of the relationship between the independent and dependent variables. © Copyright McGraw-Hill 2004

9 © Copyright McGraw-Hill 2004
Scatter Plot Example © Copyright McGraw-Hill 2004

10 Correlation Coefficient
The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is . © Copyright McGraw-Hill 2004

11 Correlation Coefficient (cont’d.)
The range of the correlation coefficient is from 1 to 1. If there is a strong positive linear relationship between the variables, the value of r will be close to 1. If there is a strong negative linear relationship between the variables, the value of r will be close to 1. © Copyright McGraw-Hill 2004

12 Correlation Coefficient (cont’d.)
When there is no linear relationship between the variables or only a weak relationship, the value of r will be close to 0. No linear relationship 1 1 Strong negative linear relationship Strong positive linear relationship © Copyright McGraw-Hill 2004

13 Formula for the Correlation Coefficient r
where n is the number of data pairs. © Copyright McGraw-Hill 2004

14 Population Correlation Coefficient
Formally defined, the population correlation coefficient, , is the correlation computed by using all possible pairs of data values (x, y) taken from a population. © Copyright McGraw-Hill 2004

15 © Copyright McGraw-Hill 2004
Hypothesis Testing In hypothesis testing, one of the following is true: H0:   0 This null hypothesis means that there is no correlation between the x and y variables in the population. H1:   0 This alternative hypothesis means that there is a significant correlation between the variables in the population. © Copyright McGraw-Hill 2004

16 t Test for the Correlation Coefficient
Formula for the t test for the correlation coefficient: with degrees of freedom equal to n  2. © Copyright McGraw-Hill 2004

17 Possible Relationships Between Variables
There is a direct cause-and-effect relationship between the variables: that is, x causes y. There is a reverse cause-and-effect relationship between the variables: that is, y causes x. The relationship between the variable may be caused by a third variable: that is, y may appear to cause x but in reality z causes x. © Copyright McGraw-Hill 2004

18 Possible Relationships Between Variables
There may be a complexity of interrelationships among many variables; that is, x may cause y but w, t, and z fit into the picture as well. The relationship may be coincidental: although a researcher may find a relationship between x and y, common sense may prove otherwise. © Copyright McGraw-Hill 2004

19 Interpretation of Relationships
When the null hypothesis is rejected, the researcher must consider all possibilities and select the appropriate relationship between the variables as determined by the study. Remember, correlation does not necessarily imply causation. © Copyright McGraw-Hill 2004

20 © Copyright McGraw-Hill 2004
Regression Line If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit. Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum. © Copyright McGraw-Hill 2004

21 Scatter Plot with Three Lines
© Copyright McGraw-Hill 2004

22 © Copyright McGraw-Hill 2004
A Linear Relation © Copyright McGraw-Hill 2004

23 © Copyright McGraw-Hill 2004
Equation of a Line In algebra, the equation of a line is usually given as , where m is the slope of the line and b is the y intercept. In statistics, the equation of the regression line is written as , where b is the slope of the line and a is the y' intercept. © Copyright McGraw-Hill 2004

24 © Copyright McGraw-Hill 2004
Regression Line Formulas for the regression line : where a is the y' intercept and b is the slope of the line. © Copyright McGraw-Hill 2004

25 © Copyright McGraw-Hill 2004
Rounding Rule When calculating the values of a and b, round to three decimal places. © Copyright McGraw-Hill 2004

26 Assumptions for Valid Predictions
For any specific value of the independent variable x, the value of the dependent variable y must be normally distributed about the regression line. The standard deviation of each of the dependent variables must be the same for each value of the independent variable. © Copyright McGraw-Hill 2004

27 © Copyright McGraw-Hill 2004
Limits of Predictions Remember when assumptions are made, they are based on present conditions or on the premise that present trends will continue. The assumption may not prove true in the future. © Copyright McGraw-Hill 2004

28 © Copyright McGraw-Hill 2004
Procedure Finding the correlation coefficient and the regression line equation Step 1 Make a table with columns for subject, x, y, xy, x2, and y2. Step 2 Find the values of xy, x2, and y Place them in the appropriate columns. Step 3 Substitute in the formula to find the value of r. © Copyright McGraw-Hill 2004

29 © Copyright McGraw-Hill 2004
Procedure (cont’d.) Step 4 When r is significant, substitute in the formulas to find the values of a and b for the regression line equation. © Copyright McGraw-Hill 2004

30 © Copyright McGraw-Hill 2004
Total Variation The total variation, , is the sum of the squares of the vertical distance each point is from the mean. The total variation can be divided into two parts: that which is attributed to the relationship of x and y, and that which is due to chance. © Copyright McGraw-Hill 2004

31 Two Parts of Total Variation
The variation obtained from the relationship (i.e., from the predicted y' values) is and is called the explained variation. Variation due to chance, found by , is called the unexplained variation. This variation cannot be attributed to the relationships. © Copyright McGraw-Hill 2004

32 Total Variation (cont’d.)
Hence, the total variation is equal to the sum of the explained variation and the unexplained variation. For a single point, the differences are called deviations. © Copyright McGraw-Hill 2004

33 Coefficient of Determination
The coefficient of determination is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable. The symbol for the coefficient of determination is r2. © Copyright McGraw-Hill 2004

34 Coefficient of Nondetermination
The coefficient of nondetermination is a measure of the unexplained variation. The formula for the coefficient of nondetermination is: © Copyright McGraw-Hill 2004

35 Standard Error of Estimate
The standard error of estimate, denoted by Sest is the standard deviation of the observed y values about the predicted y' values. The formula for the standard error of estimate is: © Copyright McGraw-Hill 2004

36 © Copyright McGraw-Hill 2004
Prediction Interval The standard error of estimate can be used for constructing a prediction interval about a y' value. The formula for the prediction interval is: The d.f.  n  2. © Copyright McGraw-Hill 2004

37 © Copyright McGraw-Hill 2004
Multiple Regression In multiple regression, there are several independent variables and one dependent variable, and the equation is: where x1, x2,…,xk are the independent variables. © Copyright McGraw-Hill 2004

38 Multiple Regression (cont’d.)
Multiple regression analysis is used when a statistician thinks there are several independent variables contributing to the variation of the dependent variable. This analysis then can be used to increase the accuracy of predictions for the dependent variable over one independent variable alone. © Copyright McGraw-Hill 2004

39 Assumptions for Multiple Regression
Normality assumption—for any specific value of the independent variable, the values of the y variable are normally distributed. Equal variance assumption—the variances for the y variable are the same for each value of the independent variable. Linearity assumption—there is a linear relationship between the dependent variable and the independent variables. © Copyright McGraw-Hill 2004

40 © Copyright McGraw-Hill 2004
Assumptions (cont’d.) Nonmulticolinearity assumption—the independent variables are not correlated. Independence assumption—the values for the y variable are independent. © Copyright McGraw-Hill 2004

41 Multiple Correlation Coefficient
In multiple regression, as in simple regression, the strength of the relationship between the independent variables and the dependent variable is measured by a correlation coefficient. This multiple correlation coefficient is symbolized by R. © Copyright McGraw-Hill 2004

42 Multiple Correlation Coefficient Formula
The formula for R is: where ryx1 is the correlation coefficient for the variables y and x1;ryx2 is the correlation coefficient for the variables y and x2; and rx1,x2 is the value of the correlation coefficient for the variables x1 and x2. © Copyright McGraw-Hill 2004

43 Coefficient of Multiple Determination
As with simple regression, R2 is the coefficient of multiple determination, and it is the amount of variation explained by the regression model. The expression 1-R2 represents the amount of unexplained variation, called the error or residual variation. © Copyright McGraw-Hill 2004

44 F Test for Significance of R
The formula for the F test is: where n is the number of data groups (x1, x2,…, y) and k is the number of independent variables. The degrees of freedom are d.f.N  n  k and d.f.D  n  k 1. © Copyright McGraw-Hill 2004

45 © Copyright McGraw-Hill 2004
Adjusted R 2 Since the value of R2 is dependent on n (the number of data pairs) and k (the number of variables), statisticians also calculate what is called an adjusted R2, denoted by R2adj. This is based on the number of degrees of freedom. © Copyright McGraw-Hill 2004

46 © Copyright McGraw-Hill 2004
Adjusted R 2 (Cont’d) The formula for the adjusted R2 is: © Copyright McGraw-Hill 2004

47 © Copyright McGraw-Hill 2004
Summary The strength and direction of the linear relationship between variables is measured by the value of the correlation coefficient r. r can assume values between and including 1 and 1. The closer the value of the correlation coefficient is to 1 or 1, the stronger the linear relationship is between the variables. A value of 1 or 1 indicates a perfect linear relationship. © Copyright McGraw-Hill 2004

48 © Copyright McGraw-Hill 2004
Summary (cont’d.) Relationships can be linear or curvilinear. To determine the shape, one draws a scatter plot of the variables. If the relationship is linear, the data can be approximated by a straight line, called the regression line or the line of best fit. © Copyright McGraw-Hill 2004

49 © Copyright McGraw-Hill 2004
Summary (cont’d.) In addition, relationships can be multiple. That is, there can be two or more independent variables and one dependent variable. A coefficient of correlation and a regression equation can be found for multiple relationships, just as they can be found for simple relationships. © Copyright McGraw-Hill 2004

50 © Copyright McGraw-Hill 2004
Summary (cont’d.) The coefficient of determination is a better indicator of the strength of a linear relationship than the correlation coefficient. It is better because it identifies the percentage of variation of the dependent variable that is directly attributable to the variation of the independent variable. The coefficient of determination is obtained by squaring the correlation coefficient and converting the result to a percentage. © Copyright McGraw-Hill 2004

51 © Copyright McGraw-Hill 2004
Summary (cont’d.) Another statistic used in correlation and regression is the standard error of estimate, which is an estimate of the standard deviation of the y values about the predicted y' values. The standard error of estimate can be used to construct a prediction interval about a specific value point estimate y' of the mean or the y values for a given x. © Copyright McGraw-Hill 2004

52 © Copyright McGraw-Hill 2004
Conclusion Many relationships among variables exist in the real world. One way to determine whether a relationship exists is to use the statistical techniques known as correlation and regression. © Copyright McGraw-Hill 2004


Download ppt "Correlation and Regression"

Similar presentations


Ads by Google