Presentation on theme: "CORRELATON & REGRESSION"— Presentation transcript:
1 CORRELATON & REGRESSION Correlation and regression are concerned with the investigation of relationships between two or more variables.
2 We consider just two associated variables. We might want to know: If a relationship exists between those variablesIf so, how strong that relationship isWhat form that relationship takesCan we make use of that relationship for predictive purposes i.e. forecasting?
3 General method for investigating the relationship between 2 variables: Correlation is used to find the strength of the relationshipRegression describes the relationship itself in the form of an equation which best fits the dataGeneral method for investigating the relationship between 2 variables:
4 For an initial insight into the relationship between two variables: plot a scatter diagramIf there appears to be a linear relationship, quantify it:calculate the correlation coefficientThis is a measure of the strength of this linearrelationship.Its symbol is 'r' and its value lies between-1 and +1
5 If the relationship is found to be significantly strong: find the equation of the ‘line of best fit’ through the data, using linear regressionThe 'goodness of fit' statistic can be calculated to see how useful the regression equation is likely to beOnce defined by an equation, the relationship can be used for predictive purposes.
6 The data represents a sample of advertising ExampleThe data represents a sample of advertisingexpenditures and sales for ten randomlyselected months. See slide 12 for completedata.Month Advertising Salesexpenditure (£0.000’s) y(£0,000’s) xetc.Plot a scatter diagram of the data
7 The graph suggests a linear relationship between Note scales are not started at zeroThe graph suggests a linear relationship betweensales and advertising expenditure.The larger the amount spent on advertising the higher the sales in general.
8 If there is a relationship, we need to be able to measure the strength of that relationship.i.e. calculate the value of the correlation coefficient
9 Pearson's Product Moment Correlation Coefficient (r)is a measure of how close a linear relationship there is between x and y.can be produced directly from a calculator in LR (linear regression) modeFor the sales and advertising data the correlation coefficient: r =The value of r is always between + 1 and -1
10 r = -1 perfect negative correlation r = 0 no correlationr = +0.8r = +1 perfect positive correlation
11 Formula for correlation coefficient, r r = SxySxx SyywhereSxx = Sx2 - Sx SxnSyy = Sy2 - Sy SySxy = Sx2 - Sx Sy
12 Longhand calculations for correlation coefficient r. Step 1
13 Step 2 Sxx = Sx2 - Sx Sx = 9.28 - 9.4 x 9.4 = 0.444 n 10 Therefore:Sxx = Sx2 - Sx Sx = x =nSyy = Sy2 - Sy Sy = x =nSxy = Sxy - Sx Sy = x =nStep 3Therefore: r = Sxy = =Sxx Syy x
14 Hypothesis test for the value of r We shall not go into the details here!Null hypothesis (H0): A linear relationship does notexist between sales and advertisingAlternative hypothesis(H1): A linear relationship doesexist between sales and advertising.If we calculate a test statistic and critical value we discover that test statistic > critical valueso we reject H0Conclude that a linear relationship exists between sales and amount spent on advertising.
15 The Goodness of Fit Statistic (R2) This also measures of the closeness of the relationship between x and yR2 = 100r2R2 tells us what percentage of the total variation in y (here sales) is explained by the variation in x (here advertising expenditure)
16 Interpretation:If r = +1 or –1, then R2 =100%So 100% of the variation in y is explained by the variation in x.If r = 0, then R2 = 0%So none of the variation in y is explained by the variation in xFor the data above the goodness of fit statistic R2 = 100 r2 = x= 76.6%
17 76.6% of the variation in sales is explained by the variation in the amount spent on advertising. The remaining 23.4% of the variation is explained by other factors:e.g. pricecompetitor’s prices etc.
18 Regression equationSince we know, for the sample data, thatthere is a significant relationship betweenthe two variables,the next obvious step is to find its equation.We can then add the regression line to thescatter diagram and use it to predict futuresales, given advertising expenditure for aparticular month.The regression equation can be produceddirectly from a calculator in LR mode.
19 The regression line has the equation: y = a + bxx is the independent variabley is the dependent variablea is the intercept on the y-axisb is the gradient or slope of the line.
20 For the sales and advertising data, the values of a and b are 46.5 and 52.6.So regression equation is:y = xSales = advertising(a and b can be found using LR mode on your calculator or by calculation)
21 Formula for a and bThis is found by calculating the square of the differences between actual and expected values.We chose a and b so that the total difference is minimizied:b = Sxy a = y - b xSxx ( x , y )is called thecentroidWhere x , y are the means of the x and y dataand the S’s are defined as previously.
22 Calculations for the regression equation. In the regression equation y = a + bxb = Sxy = =Sxxa = y - b x = x = 46.5(As y = Sy = and x = Sx = = 0.94)n nTherefore the regression equation isy = x
23 Plotting the regression equation on the scatter diagram.The line y = a + bx can be plotted on the scatterdiagram by plotting three points.The centroid ( x , y ) and any other two points,which satisfy the regression equation.From the data (x, y) = (0.94, 95.9)When x = 0.6, y = ( x 0.6)=When x = 1.2, y = (52.6 x 1.2)=Plot (0.94,95.9)Plot (0.6, 78.6)Plot (1.3, 109.6)
25 Noteregression equation y = a + bx can only be used to calculate an estimate for y given the value of xThe linear relationship y = a + bx can only be assumed to exist between y and x for the range of values within the sample
26 Interpreting the coefficients in the regression equation -first the a valueThe intercept (a) is the estimate ofy when x = 0, but care is needed if using this – why?y = xSales = advertisingWhen x = 0, y = 46.5i.e. When nothing is spent on advertising,sales would be expected on average to be 46.5 units = x £10,0000=£ 465,000
27 If x = 0 y = 46.5, but care is needed here! the b valuey = xIf x = 0 y = 46.5, but care is needed here!If x = y = (52.6)(0.6) =If x = y = (52.6)(0.8) =If x = 1 y = =If x = 1.2 y = (52.6)(1. 2) =If x = 2 y = x 2 but care is neededhere also!etc.So if advertising expenditure is increasedby 1 unit, sales will be increased by 52.6units on average.
28 For each additional £10,000 spent on advertising, sales will increase by£52.6 x £10,000 = £526,000 on average.But we cannot estimate sales outside the range:E.g. we should not try to estimate salesfor x = 5 using this method.