Linear Regression Dr. Richard Jackson

Dr. Richard Jackson jackson_r@mercer.edu
Linear Regression Dr. Richard Jackson This module covers a topic which you probably covered back in high school, dealing with linear regression. You might remember it as constructing a line through best fit through a set of data points. © Mercer University 2005 All Rights Reserved

Linear Regression Used to Predict One Variable from Another
Example: Predict Blood Level From Dose Linear regression is used to predict one variable from the other. In other words, to predict the blood level from a given dose of a drug. Using the data that has been collected one can then use that information to predict future variables one from the other.

Linear Regression Takes Observed Data From Group of Subjects on Two Variables (X and Y) Calculate Formula for Predicting Y from X Construct Graph with Regression Line (Line of Best Fit): Straight Line Linear regression takes observed data from a group of subjects on two variables. It then uses that data to calculate that formula for predicting why the dependent variable from x, the independent variable. It concerns the construction of the graph with a regression line or line of best fit. A straight line through a set of data points and then that straight line enables one to predict a variable from the other using the graph. Lets go back and review a few basic fundamentals about the formula for the line of best fit or a straight line on a graph.

Formula for Straight Line
Y = A + bx Y = Dependent Variable A = Y Intercept b = Slope of Line (Y with Unit  in X) x = Independent Variable Example: Y =4 +2X Lets take a look at the formula for a straight line. Y equals A plus bx. You may recall y as the dependent variable. A is the y intercept and b is the slope of the line. The slope of the line is how many unit changes that we see in y with every unit change or change of 1 on the x variable and x is the independent variable. If we consider the equation y equals 4 plus 2x we can identify the straight line that corresponds to that formula. By substituting two values for x in the equation y equals 4 plus 2x we can determine the corresponding y values.

Example of Straight Line
If x equals 0 then y equals 4 and if x equals 1 then y equals 6. Then by plotting those two points with coordinates on the x axis, x being 0 and y being 4 and on the x axis, x being 1 and y being 6, we can identify a particular straight line. Recalling that it only takes two points to identify specific straight lines. On your handout you can see correctly, represented the line that is identified by the equation y equals 4 plus 2x and as you can see it crosses the y axis at 4 and the slop of the line is 2. In other words, for every unit change of x that we have, for example, going form 0 to 1, we have a change on the y axis of 2 units. 1 2 3 4 5 6 If Then X= Y= 0 4 1 6

Clinical Example (See Scatter Diagram See Table I)
Independent Variable: Plasma Atenolol Dependent Variable: Maximum HR Lets take a look at a clinical example. Refer to table 1. This is a scatter diagram that represents the relationship between plasma atenolol and maximum exercise heart rate and as you can see as you increase the plasma atenolol blood level the maximum exercise heart rate decreases. This would be an example of negative correlation because it is downward sloping to the right and as one variable increases, that is the x variable, the other variable, y, decreases.

Clinical Example (See Scatter Diagram See Table I)
Patient 1 2 3 4 Plasma Atenolol 500 400 800 1000 Max. HR 80 75 70 65 The data represent the x and y measures for 4 patients in this study and it looks like there is about 30 patients in total but as you can see from the diagram, patient number 1 has a plasma atenolol level of 500, a maximum heart rate of 80. Patient number two, 475, patient 3, 870, and patient number 4, 1065.

Formula For Regression Line (Line of Best Fit)
Y’ = (-0.03) X It is possible through various formulas to draw a line of best fit through these data and through formulas we will not concern ourselves with, we will let our computer do that. It is possible to take these observed data on the x and y variable and determine the y intercept and slope for a line of best fit that will go through these data. This has been done and the y intercept or A is 100 and the slope or B is minus its minus because there is a negative change in the y variable. So this line of best fit can be drawn and it would cross the y axis at 100 and for every unit change on the x variable we would see a change of on the y axis. This is the formula that represents the line of best fit through test points.

Using Y’ =100 + (-0.03) X X Y If we then simply pick two values for x we can then plot or determine with the formula the value for y, then plot those two points. Those two points would then identify the line of best fit through these data. So if we choose just arbitrarily these x values lest say 400 and 1000 and solve for the corresponding y values we end up with 88 and 70 respectively and if we identify those two points on the graph as we have on the table with the asterisk, by connecting those points we would have identified the line of best fit through the data. It would then be possible to predict the y variable from a given x variable using the graph.

Predicted Y’s (y’) Not Same As Observed Y’s Y’ (or Y prime) is the Predicted Y
Patient PA(X) MHR(Y) Y’ We could also use the formula to calculate a y value by plugging in which we have just done, a value of x, 400, and then calculating a value for y. Note that our y in this case is designated y prime. It is also sometimes called y had and it is the predicted y value which differs from the observed y value. The predicted y values all fall along that straight line of best fit but as you can see the observed y values all vary about that straight line. The predicted y values designated y prime are not the same as the observed y values except in a circumstance that I will describe momentarily. For example, if you take a look at patient number 2, the plasma atenolol of the x variable is 400 and the y variable or maximum heart rate is 75 but the predicted y value or y prime is 88. For patient number 4 the x variable is 1000, the y variable is 65 and the predicted y variable or y prime is 70. So except in a particular circumstance, the particular y values are going to be different from the observed y values because all of the predicted y values fall along that straight line, the regression line or the line of best fit.

Predicted Y’s (Y’) Will Be Same as Observed Y’s when r=
+1.00 or -1.00 The circumstances where in the predicted y’s will be the same as the observed y’s occurs when the Pearson r between the two variables is either plus 1 or minus 1. Recall from our previous discussion of the Pearson r, that the closer the points fall along the straight line the closer the Pearson r will be to either plus 1 or minus 1. Well if the correlation between the two variables is plus 1 or minus 1 then all of the observed values would already fall along a straight line. So if that was the case then the predicted y’s would be the same as the observed y’s.

Accuracy of Prediction
The closer the points on scatter diagram fall on a straight line. Closer the Pearson r is to or –1.00 The More Accurate is the Prediction There are two ways that you can predict a y value from a x value using linear regression. One involves simply substituting an x value in the equation for a straight line and solving for the y value. The other simply involves reading a value off the x axis up to the regression line and then reading across to the corresponding y value on the y axis. Once a prediction is made one may question well how accurate is the prediction and the closer the points they fall on the straight line the more accurate the prediction. In other words the closer the Pearson r is to plus 1 or minus 1, the more accurate is the prediction. The more the points are spread out the more the Pearson r approaches zero then the less accurate is the prediction.

Accuracy of Prediction
Quantified by Standard Error of Estimate Syx. Formula: Syx = Sy Ö1-r2 The accuracy of prediction with linear regression is quantified with a statistic known as the standard error of estimate. The symbol is the letter S with a subscript yx. It is a standard deviation as you might guess with a symbol being a small letter s and the subscript sub yx means it is the standard error of estimate in predicting y from x. Most of the time one is involved in predicting the y variable from x though in certain circumstances one might want to predict x from y. However that would involve a different equation for a straight line. The formula for the standard error of estimate is s sub yx equals s sub y which is simply just the standard deviation of the y values multiplied by the square root of 1 minus r squared where r is the Pearson r between the two variables.

Clinical Example Sy = 5 r = +0.8 Syx = 5y Ö1-r2 Syx = 5 Ö1-(.8)2
Assume that our standard deviation of y values is 5 and the Pearson r is Plugging those numbers into the equation gives us a standard error of estimate of 3.

Interpretation of Syx At any point on Regression Line, 95% observed Y’s Fall plus and minus 2 Syx Example Y’ for Plasma Level of 800: Y’ = (-0.03) (800) = 76 Interpretation of the standard error of estimate is as follow. At any point in the regression line, 95% of the observed y’s, that is the actual values of the dots on the scatogram fall plus or minus 2 standard error of estimates from the regression line. The example predicted y for a plasma level of 800 is 76.

Standard Error of Estimate
Predicted Y = 76 Syx = 3 95% of Y’s fall plus or minus 2 times 3 or 70-82 From the data involving the plasma atenolol substituting 800 as the x value gives us a predicted y value of 76. So using the prediction equation our predicted y value is 76. The standard error of estimate is 3. This means that 95% of the observed y values fall plus or minus 2 times the standard error of estimate or 3 which means that 95% of our observed y’s fall within a range of 70 to 82 at the predicted y at the regression line of 76. So that gives us an idea of how accurate our prediction is if 95% of the y’s fall within the range of 70 to 82. That gives us some indication of how well we are predicting our y values from a given x value.

Other Observations About Syx
When r = or –1, Syx = 0 (See formula) When r = 0, Syx is at its maximum and = Sy Other observations about the standard error of estimate include when the Pearson r between the two variables is either plus one or minus one. When that is the case then the standard error of estimate is zero. If you take a look at the formula you will see when r is plus 1 or minus 1 and you square it. The term under the square root radical becomes zero and the standard error of estimate is zero. That follows from what we have said earlier if the Pearson r is plus one or minus one then all the observed y’s already follow along the straight line so there is no error in measurement when predicting one variable from the other because the observed y’s are not all spread out rather they are on a straight line already. Further if there is no relationship between the two variables. In other words if the Pearson r is equal to 0 then the standard error of estimate is at its maximum value and its maximum value is equal to the standard deviation of the y values. Again, take a look at the formula and substitute a zero for r, so you are left only with the square root of one which is one multiplied by the standard deviation of the y values.

Also, When r = 0 Regression Line Parallel to X axis
Crosses Y axis at the mean of the Y values Slope = 0, therefore the Y’ for any X value = A or the mean of the Y values Also when the Pearson r is zero the regression line can still be drawn but it will be parallel to the x axis and it crosses the y axis at the mean of the y values. The slope of that regression line is zero. Therefore, for the predicted y for any x value will be equal to A or the y intercept which is equal to the mean of the y values.

Summary of Linear Regression
Provides Formula and Graphic Device for Predicting One Variable from Another Accuracy of Prediction Indicated by Standard Error of Estimate Close Association With Pearson r To summarize linear regression it provides a formula and a graphic device in predicting one variable from another. The accuracy of the prediction is indicated by a statistic known as the standard error of estimate and there is a close association between linear regression, the Pearson r, and the accuracy of prediction.

How to Perform Linear Regression Using the Statistix Software
Enter Data for Two Variables Select Statistics, Linear Models, Linear Regression Highlight and Move Dependent and Independent Variable Names to Appropriate Boxes then OK Check on Results, Select Plots Read Prediction Equation at Bottom of Graph

Linear Regression Dr. Richard Jackson

Similar presentations

Presentation on theme: "Linear Regression Dr. Richard Jackson "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linear Regression Dr. Richard Jackson

Similar presentations

Presentation on theme: "Linear Regression Dr. Richard Jackson "— Presentation transcript:

Similar presentations

About project

Feedback