Regression Analysis.

Regression Analysis

. In the study of correlation we examined the problem of measuring the closeness of the relationship between two variables. The strength was measured by the calculation of a correlation coefficient.

. Where a significant correlation exists between two variables, it is generally possible to predict the value of one of the variables from the other. If the correlation coefficient is zero, a knowledge of x is no help in predicting y; however if the correlation coefficient is +1 or -1, the value of y may be predicted with perfect accuracy.

. The magnitude of the correlation is a guide to the likely accuracy of prediction, although the value of the coefficient does not indicate how the prediction should be made.

y = some algebraic expression involving x
. An equation is required to predict y in terms of x. y = some algebraic expression involving x The prediction will not be totally accurate, but provides the best prediction that can reasonably be attained overall. i.e. the prediction will sometimes be too high and other times too low. A predictive equation of this kind is called a regression equation or regression line.

Linear regression Linear regression is the case where the relationship between variables is represented by a straight line. When ordered pairs are plotted on a scatter diagram, a straight line can be drawn through the points as centrally as practicable.

. A car manufacturer believes the quarterly repair bill for parts for a popular car make is strongly influenced by age. To investigate this, a random sample of 8 cars of varying ages is chosen and their average quarterly repair bills for the previous 12 months are recorded. Age 1 3 5 8 7 12 4 Repair $ 70 140 230 350 300 570 380 200 Eg bottom page s272 Croucher. This chart from 5ed p 536

Scatter diagram and regression line

Prediction To predict the value of the repair bill (y) for a 6 year old car (x), find the point on the regression line where x = 6. The corresponding value for y is almost $280. Some relationships are not linear and fitting a straight line to this plotted data would be of little use. Flick back to previous slide to view the value for y where x = 6

It can be seen that some actual y values are
. It can be seen that some actual y values are higher than they would be predicted to be on the basis of their corresponding x values, and that others are lower, but on average the regression line seems to give reasonable prediction. In many cases the relationship between two variables may well be described as linear and so linear regression will be appropriate.

Check first A simple way to check whether a straight line
may be appropriate is to draw a scatter diagram of the data and judge whether the points fall roughly in a straight line or not. The chart on the next slide shows the value of stock on hand by a company at 30 June for the years 1994 – 2002. Does the data lie in a straight line?

Data must be examined first before attempting to fit a straight line; a visual display of the data using a scatter diagram is an essential beginning. These points clearly do not lie in a straight line and regression analysis would not help to describe a trend or predict future values.

Calculation of the regression equation
Recall the convention previously applied for the two variables; x is the independent variable and y the dependent variable. We are trying to find the line that will predict the value of the y variable from knowledge of the x variable. This is known as least squares regression line of y on x.

Calculations To calculate the least squares regression line in any problem, calculations are based on data from random samples of values. The equation will not be exactly the same as it might be if the entire population was used. In fact, a different random sample would generate different data points for x and y and therefore a different regression line.

Y is the predicted value of y.
Population equation The population equation for the least squares regression line of y on x is ^ Y =  + x where  and  are constants and Y is the predicted value of y. In practice, the true values of  and  are not known unless the entire population is surveyed. Instead, we use whatever sample values we have for x and y to make an estimate of  and .

For samples Y to denote the best prediction of y that Y = a + bx
When analysing sample data we denote the estimate of  by 'a' and the estimate of  by ‘b’, where ‘a’ and ‘b’ are constants. We also use ^ Y to denote the best prediction of y that we can make for a given value of x Y = a + bx

Geometric interpretation
Looking at the scatter diagram on the next slide, the geometric interpretation of a and b is: a is the value of y where the line intersects the y axis, that is, where x = 0. b denotes the slope of the line, representing the increase in value of y when the value of x is increased by 1. The problem is how to find the ‘best’ values of a and b from a set of data, that is the values of a and b that best approximate the values of  and  respectively.

The problem is how to find the ‘best’ values of a and b from a set of data, that is the values of a and b that best approximate the values of  and  respectively.

Find the line that will minimise the errors
This data is from page 120 of the text. Get students to make calculations for a and b. The aim is to find the line that will minimise the value of the expression Σ(y – y)2 which is also called the sum of the squares of the residuals.

Minimise the errors There are 5 points on the scatter diagram on the previous slide and a line has been drawn through the centre of them. The vertical distance from each point to the line has also been drawn. These distances represent errors that the line would make in predicting the value of y for the x value of each data point. The closer the data points are to the line drawn, the smaller the errors will be. The aim is to construct a line that will minimise these errors for our set of data points.

Y - Y From a mathematical view, these errors (or
. From a mathematical view, these errors (or residuals) are the differences between the actual y values and the predicted y values. These differences can be denoted by the residuals ^ Y - Y And we need to find an overall minimum for their value. The most commonly used technique is least-squares. This simply states that the sum of the squares of the errors should be made as small as possible.

. Y = a + bx a = Σy – bΣx b = nΣxy - ΣxΣy n nΣx2 – (Σx)2 Calculate b first Write these on the board

Table for Quarterly repair bills
x y xy X2 1 70 3 140 420 9 5 230 1150 25 8 350 2800 64 7 300 2100 49 12 570 6480 144 380 3040 4 200 800 16 48 2240 17220 372 Distribute handout for calculations

. a = Σy – bΣx and b = nΣxy - ΣxΣy n nΣx2 – (Σx)2 b = 8 * (17220) – (48 * 2240) = 45 8 * 372 – (482) a = 2240 – (45 * 48) = 10 8

Scatter diagram and regression line for Quarterly repair bills

Example An accountant wishes to undertake a cost analysis of
Day 1 2 3 4 5 6 7 8 Max temp C0 26 31 25 14 20 16 34 Electricity used ‘000 units 35 37 24 42 41 40 17 An accountant wishes to undertake a cost analysis of electricity consumption of office heating for various maximum daily temperatures. The data above was recorded over 8 random days. Use the least-squares method to find a line of best fit.

. Maximum temperature x Electricity used y xy x2 26 35 910 676 31 20
620 961 25 37 925 625 24 624 14 42 588 196 41 820 400 16 40 640 256 34 17 578 1156 192 5705 4946 . Croucher p S279 example S9.4 Enter data into Fin Calc Stat Mode 1,1 …. 26, (x,y) 35, ENT, 31, (x,y), 20, ENT …… RCL 1, RCL . RCL a, and RCL b

. a = Σy – bΣx b = nΣxy - ΣxΣy n nΣx2 – (Σx)2 b = 8 * 5705 – (192 * 256) = * 4946 – (192 * 192) a = (-1.30 * 192) = ˆ Y = 63.2 – 1.30 x

More practice For the temperature and energy consumption data,
predict the electricity consumption of office heating when the maximum temperature is 300. = 63.2 – (1.3 x 30) = 24.2 That is, the regression line predicts that the electricity consumption will be 24,200 units. Using the calculator 30, 2nd F, ( - that is y’ – predicted value of Y

Electricity used at Accountant’s office

How well does a regression line fit the data?
The criterion is to use the line that has the smallest sum of the squares errors. To determine whether the line is a good fit we can use the coefficient of determination denoted by r2 Recall from last week the formula for the coefficient of correlation r

. R2 or r2 is the above formula squared. Since the value of r always lies between -1 and +1, the value of r2 must always lie between 0 and 1.

If the value of r2 is close to 1, a straight line fits the data well.
If the value of r2 is closer to 0, the straight line fits the data poorly. Calculate the coefficient of determination for the age and repair bill data.

Determine the correlation coefficient for the car
. Determine the correlation coefficient for the car manufacturer analysing quarterly repair bills. r2 = 3, = 14,288, = 84 x 172, ,448,000 As r2 is close to 1, the line appears to be a very good fit to the data. Handout spread sheet with relevant calculations already made Ask students to calculate coefficient using formula and / or calculator r =

Using the regression line for prediction
Having found the least squares regression line of y on x, a predicted value for y can be made for specific value of x. If the regression line is a poor fit of the data, the prediction will be of little use. Even if the regression line is a good fit of the data, it is always dangerous to make a prediction of a y for an x value outside the limits (smallest and largest) of the x values used in finding the equation of the line.

The reason for this is we have no guarantee
. The reason for this is we have no guarantee that the trend of points outside these limits will be linear, and indeed, the equation may well provide ridiculous answers. However to make a simple point prediction, just substitute the x value in question into the regression equation. Don’t show next slide – ask students to refer back to Age and Repair bill data sheet and calculate predicted repair bill for a 10 year old car.

Practice For the Age and Repair Bill data, use the least squares regression line to estimate the quarterly repair bill of a ten year old car. Substitute x = 10 in the following equation: = 10 + (45 x 10) = $460 That is, the quarterly predicted repair bill for a ten year old car is $460. Don’t show next slide – ask students to refer back to Temperature and energy consumption data sheet and predict energy used when the temperature is 30 degrees.

Example predict sales for an expenditure of $13,000 on advertising
$’000 Sales x y xy x2 10 200 2000 100 11 230 2530 121 12 250 3000 144 14 270 3780 196 15 280 4200 225 16 300 4800 256 78 1530 20310 1042 This data is from page 120 of the text. Get students to make calculations for a and b. a = 60,000 b = 15 a = 60,000 b = 15 Estimate Sales for an expenditure level of $13,000 advertising Y = 60,000 + (15 x 13,000) Y = 255,000 predicted sales

a = 60,000 b = 15 Estimate Sales for an expenditure level of $13,000 advertising Y = 60,000 + (15 x 13,000) See notes in textbook page Mode ,1,1 10,(x,y),200,ENT,11,(x,y),230,ENT,12,(x,y),250,ENT,14,(x,y),270,ENT,15,(x,y),280,ENT,16,(x,y),300,ENT, - then if we plug in an X value such as 13 "2nd F ( "- gives us the correlating y value = 255

. And yes our calculator will find “a” and “b” for us – see if you can find out what keys to press to find “a” and “b” The answer is …………….. To find “a” (notice its green) so press RCL , = 60 and “b” RCL, DEL = 15

This is what it looks like if you put Demand on the X axis and price on the Y axis – it shows a moderate to strong negative correlation – as price drops demand increases – which is what you would expect – but perhaps the slide on the next page is better – with Price on the x axis ….. Check this at $1.30

Check this using your calculator
This question asks students to estimate visually. If they do calculate a = and b =

Line of best fit using (1) scattergram and (2) Least Squares method

Use calculator to determine Product-Moment Correlation Coefficient

Long hand calculation of Product-Moment Correlation Coefficient

Or Y = -18.18 + 0.11 x if using 4 decimal places in calcs
Using 2 decimal places it is y = x

Suggested Questions from Textbook……
Select a range of questions from the Problems in this chapter – enough so that you feel comfortable with this topic

Regression Analysis.

Similar presentations

Presentation on theme: "Regression Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression Analysis.

Similar presentations

Presentation on theme: "Regression Analysis."— Presentation transcript:

Similar presentations

About project

Feedback