Correlation and Regression

Slides:



Advertisements
Similar presentations
Forecasting Using the Simple Linear Regression Model and Correlation
Advertisements

Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Correlation and Regression
Correlation and Regression
Correlation and Regression
© The McGraw-Hill Companies, Inc., 2000 CorrelationandRegression Further Mathematics - CORE.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Correlation and Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Linear Regression and Correlation
The Simple Regression Model
SIMPLE LINEAR REGRESSION
Linear Regression and Correlation Analysis
Chapter 9: Correlation and Regression
SIMPLE LINEAR REGRESSION
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Simple Linear Regression and Correlation
Relationships Among Variables
Lecture 5 Correlation and Regression
Correlation & Regression
SIMPLE LINEAR REGRESSION
Correlation and Regression
Introduction to Linear Regression and Correlation Analysis
Simple Linear Regression Models
Correlation.
Correlation and Regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
© The McGraw-Hill Companies, Inc., 2000 Business and Finance College Principles of Statistics Lecture 10 aaed EL Rabai week
INTRODUCTORY LINEAR REGRESSION SIMPLE LINEAR REGRESSION - Curve fitting - Inferences about estimated parameter - Adequacy of the models - Linear.
© The McGraw-Hill Companies, Inc., Chapter 11 Correlation and Regression.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Production Planning and Control. A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where.
Correlation & Regression
Examining Relationships in Quantitative Research
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Unit 10 Correlation and Regression McGraw-Hill, Bluman, 7th ed., Chapter 10 1.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Multiple Correlation and Regression
Lecture 10: Correlation and Regression Model.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Video Conference 1 AS 2013/2012 Chapters 10 – Correlation and Regression 15 December am – 11 am Puan Hasmawati Binti Hassan
REGRESSION AND CORRELATION SIMPLE LINEAR REGRESSION 10.2 SCATTER DIAGRAM 10.3 GRAPHICAL METHOD FOR DETERMINING REGRESSION 10.4 LEAST SQUARE METHOD.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Chapter 13 Simple Linear Regression
Regression and Correlation
Regression Analysis AGEC 784.
10.2 Regression If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is.
Linear Regression and Correlation Analysis
Correlation and Regression
Chapter 5 STATISTICS (PART 4).
SIMPLE LINEAR REGRESSION MODEL
Correlation and Regression
Elementary Statistics
CHAPTER 10 Correlation and Regression (Objectives)
Correlation and Simple Linear Regression
Chapter 10 Correlation and Regression
Multiple Regression Models
Correlation and Simple Linear Regression
Correlation and Regression
Simple Linear Regression
Correlation and Regression
SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION
Warsaw Summer School 2017, OSU Study Abroad Program
Chapter Thirteen McGraw-Hill/Irwin
Presentation transcript:

Correlation and Regression CHAPTER 10 Correlation and Regression © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Objectives Draw a scatter plot for a set of ordered pairs. Compute the correlation coefficient. Test the hypothesis H0:   0. Compute the equation of the regression line. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Objectives (cont’d.) Compute the coefficient of determination. Compute the standard error of estimate. Find a prediction interval. Be familiar with the concept of multiple regression. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Introduction In addition to hypothesis testing and confidence intervals, inferential statistics involves determining whether a relationship between two or more numerical or quantitative variables exists. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Statistical Methods Correlation is a statistical method used to determine whether a linear relationship between variables exists. Regression is a statistical method used to describe the nature of the relationship between variables—that is, positive or negative, linear or nonlinear. © Copyright McGraw-Hill 2004

Statistical Questions Are two or more variables related? If so, what is the strength of the relationship? What type or relationship exists? What kind of predictions can be made from the relationship? © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Vocabulary A correlation coefficient is a measure of how variables are related. In a simple relationship, there are only two types of variables under study. In multiple relationships, many variables are under study. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Scatter Plots A scatter plot is a graph of the ordered pairs (x,y) of numbers consisting of the independent variable, x, and the dependent variable, y. A scatter plot is a visual way to describe the nature of the relationship between the independent and dependent variables. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Scatter Plot Example © Copyright McGraw-Hill 2004

Correlation Coefficient The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is . © Copyright McGraw-Hill 2004

Correlation Coefficient (cont’d.) The range of the correlation coefficient is from 1 to 1. If there is a strong positive linear relationship between the variables, the value of r will be close to 1. If there is a strong negative linear relationship between the variables, the value of r will be close to 1. © Copyright McGraw-Hill 2004

Correlation Coefficient (cont’d.) When there is no linear relationship between the variables or only a weak relationship, the value of r will be close to 0. No linear relationship 1 1 Strong negative linear relationship Strong positive linear relationship © Copyright McGraw-Hill 2004

Formula for the Correlation Coefficient r where n is the number of data pairs. © Copyright McGraw-Hill 2004

Population Correlation Coefficient Formally defined, the population correlation coefficient, , is the correlation computed by using all possible pairs of data values (x, y) taken from a population. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Hypothesis Testing In hypothesis testing, one of the following is true: H0:   0 This null hypothesis means that there is no correlation between the x and y variables in the population. H1:   0 This alternative hypothesis means that there is a significant correlation between the variables in the population. © Copyright McGraw-Hill 2004

t Test for the Correlation Coefficient Formula for the t test for the correlation coefficient: with degrees of freedom equal to n  2. © Copyright McGraw-Hill 2004

Possible Relationships Between Variables There is a direct cause-and-effect relationship between the variables: that is, x causes y. There is a reverse cause-and-effect relationship between the variables: that is, y causes x. The relationship between the variable may be caused by a third variable: that is, y may appear to cause x but in reality z causes x. © Copyright McGraw-Hill 2004

Possible Relationships Between Variables There may be a complexity of interrelationships among many variables; that is, x may cause y but w, t, and z fit into the picture as well. The relationship may be coincidental: although a researcher may find a relationship between x and y, common sense may prove otherwise. © Copyright McGraw-Hill 2004

Interpretation of Relationships When the null hypothesis is rejected, the researcher must consider all possibilities and select the appropriate relationship between the variables as determined by the study. Remember, correlation does not necessarily imply causation. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Regression Line If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit. Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum. © Copyright McGraw-Hill 2004

Scatter Plot with Three Lines © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 A Linear Relation © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Equation of a Line In algebra, the equation of a line is usually given as , where m is the slope of the line and b is the y intercept. In statistics, the equation of the regression line is written as , where b is the slope of the line and a is the y' intercept. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Regression Line Formulas for the regression line : where a is the y' intercept and b is the slope of the line. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Rounding Rule When calculating the values of a and b, round to three decimal places. © Copyright McGraw-Hill 2004

Assumptions for Valid Predictions For any specific value of the independent variable x, the value of the dependent variable y must be normally distributed about the regression line. The standard deviation of each of the dependent variables must be the same for each value of the independent variable. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Limits of Predictions Remember when assumptions are made, they are based on present conditions or on the premise that present trends will continue. The assumption may not prove true in the future. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Procedure Finding the correlation coefficient and the regression line equation Step 1 Make a table with columns for subject, x, y, xy, x2, and y2. Step 2 Find the values of xy, x2, and y2. Place them in the appropriate columns. Step 3 Substitute in the formula to find the value of r. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Procedure (cont’d.) Step 4 When r is significant, substitute in the formulas to find the values of a and b for the regression line equation. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Total Variation The total variation, , is the sum of the squares of the vertical distance each point is from the mean. The total variation can be divided into two parts: that which is attributed to the relationship of x and y, and that which is due to chance. © Copyright McGraw-Hill 2004

Two Parts of Total Variation The variation obtained from the relationship (i.e., from the predicted y' values) is and is called the explained variation. Variation due to chance, found by , is called the unexplained variation. This variation cannot be attributed to the relationships. © Copyright McGraw-Hill 2004

Total Variation (cont’d.) Hence, the total variation is equal to the sum of the explained variation and the unexplained variation. For a single point, the differences are called deviations. © Copyright McGraw-Hill 2004

Coefficient of Determination The coefficient of determination is a measure of the variation of the dependent variable that is explained by the regression line and the independent variable. The symbol for the coefficient of determination is r2. © Copyright McGraw-Hill 2004

Coefficient of Nondetermination The coefficient of nondetermination is a measure of the unexplained variation. The formula for the coefficient of nondetermination is: © Copyright McGraw-Hill 2004

Standard Error of Estimate The standard error of estimate, denoted by Sest is the standard deviation of the observed y values about the predicted y' values. The formula for the standard error of estimate is: © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Prediction Interval The standard error of estimate can be used for constructing a prediction interval about a y' value. The formula for the prediction interval is: The d.f.  n  2. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Multiple Regression In multiple regression, there are several independent variables and one dependent variable, and the equation is: where x1, x2,…,xk are the independent variables. © Copyright McGraw-Hill 2004

Multiple Regression (cont’d.) Multiple regression analysis is used when a statistician thinks there are several independent variables contributing to the variation of the dependent variable. This analysis then can be used to increase the accuracy of predictions for the dependent variable over one independent variable alone. © Copyright McGraw-Hill 2004

Assumptions for Multiple Regression Normality assumption—for any specific value of the independent variable, the values of the y variable are normally distributed. Equal variance assumption—the variances for the y variable are the same for each value of the independent variable. Linearity assumption—there is a linear relationship between the dependent variable and the independent variables. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Assumptions (cont’d.) Nonmulticolinearity assumption—the independent variables are not correlated. Independence assumption—the values for the y variable are independent. © Copyright McGraw-Hill 2004

Multiple Correlation Coefficient In multiple regression, as in simple regression, the strength of the relationship between the independent variables and the dependent variable is measured by a correlation coefficient. This multiple correlation coefficient is symbolized by R. © Copyright McGraw-Hill 2004

Multiple Correlation Coefficient Formula The formula for R is: where ryx1 is the correlation coefficient for the variables y and x1;ryx2 is the correlation coefficient for the variables y and x2; and rx1,x2 is the value of the correlation coefficient for the variables x1 and x2. © Copyright McGraw-Hill 2004

Coefficient of Multiple Determination As with simple regression, R2 is the coefficient of multiple determination, and it is the amount of variation explained by the regression model. The expression 1-R2 represents the amount of unexplained variation, called the error or residual variation. © Copyright McGraw-Hill 2004

F Test for Significance of R The formula for the F test is: where n is the number of data groups (x1, x2,…, y) and k is the number of independent variables. The degrees of freedom are d.f.N  n  k and d.f.D  n  k 1. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Adjusted R 2 Since the value of R2 is dependent on n (the number of data pairs) and k (the number of variables), statisticians also calculate what is called an adjusted R2, denoted by R2adj. This is based on the number of degrees of freedom. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Adjusted R 2 (Cont’d) The formula for the adjusted R2 is: © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Summary The strength and direction of the linear relationship between variables is measured by the value of the correlation coefficient r. r can assume values between and including 1 and 1. The closer the value of the correlation coefficient is to 1 or 1, the stronger the linear relationship is between the variables. A value of 1 or 1 indicates a perfect linear relationship. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Summary (cont’d.) Relationships can be linear or curvilinear. To determine the shape, one draws a scatter plot of the variables. If the relationship is linear, the data can be approximated by a straight line, called the regression line or the line of best fit. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Summary (cont’d.) In addition, relationships can be multiple. That is, there can be two or more independent variables and one dependent variable. A coefficient of correlation and a regression equation can be found for multiple relationships, just as they can be found for simple relationships. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Summary (cont’d.) The coefficient of determination is a better indicator of the strength of a linear relationship than the correlation coefficient. It is better because it identifies the percentage of variation of the dependent variable that is directly attributable to the variation of the independent variable. The coefficient of determination is obtained by squaring the correlation coefficient and converting the result to a percentage. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Summary (cont’d.) Another statistic used in correlation and regression is the standard error of estimate, which is an estimate of the standard deviation of the y values about the predicted y' values. The standard error of estimate can be used to construct a prediction interval about a specific value point estimate y' of the mean or the y values for a given x. © Copyright McGraw-Hill 2004

© Copyright McGraw-Hill 2004 Conclusion Many relationships among variables exist in the real world. One way to determine whether a relationship exists is to use the statistical techniques known as correlation and regression. © Copyright McGraw-Hill 2004