Pg 337..345: 3b, 6b (form and strength) Page 350..359: 10b, 12a, 16c, 16e
A straight line that describes how a response variable y changes as an explanatory variable x changes. Often used to predict the value of y for a given value of x. Also known as a line of best fit.
You can predict the humerus length when the femur length is 50 by following the dotted line to the regression line and then moving horizontally to the humerus values.
Since different people may draw regression lines differently, an equation may give a more accurate prediction. Because we want to predict y from x we want a line that is close to the points in the vertical (y) direction. We need a way to find from the data the equation of the line that comes closest to the points in the vertical direction. There are many ways to make the collection of vertical distances “as small as possible”…
Least-Squares Regression Line of y on x is the line that makes the sum of the squares of the vertical distance of the data points from the line as small as possible.
To find the proper placement of the line, look at the vertical distances of the points near the line you have drawn. Each data point d will represent the difference between the observed y-value of the point and the predicted y-value (where the line crossed). These differences can be positive, negative, or zero. When the point is above the line, the difference is positive, below the line is negative, on the line is zero. You then take the sum of all the squared differences. The regression line is placed where the sum of all the squared differences is the smallest.
Y = a + bx b = slope of the line (the amount by which y changes when x increases by 1 unit) a = the intercept (the value of y when x = 0) Can be used to predict points. We will be doing a calculator activity on Tuesday to see how to get these equations from scatterplots.
We often use several explanatory variables to predict a response. The basic properties of predicting responses of a least-squares regression line are: ◦ Prediction is based on fitting some “model” to a set of data. (The regression line) ◦ Prediction works best when the model fits the data closely. (More trustworthy if data is close together, if patterns are not strong, prediction may be very inaccurate). ◦ Prediction outside the range of the available data is risky. (Checking within the range is ok, but assuming outside the range may not work—a child’s height for example, if continues on the same rate of growth may make the child 10 feet tall at age 20).
Correlation and regression are closely connected; however correlation does not require you to choose an explanatory variable and regression does. Both correlation and regression are strongly affected by outliers… What do you think Hawaii is known for that is definitely an outlier compared to the other 49 states?
If Hawaii is included, r = 0.408; if Hawaii is not included, r = 0.195. If Hawaii is included, the LSRL is the solid line; if Hawaii is not included, the LSRL is the dotted line.
The usefulness of the regression line for prediction depends on the strength of the correlation between the variables. The square of the correlation is the right measure to use… r squared will be a number between 0 and 1. The higher the number, higher the amount it accounts for all the variation along the line (you want a high number)…example 0.972 = 97.2% successful in explaining the regression line.
A strong relationship between 2 variables does not always mean that changes in one variable cause changes in the other. The relationship between two variables is often influenced by other variables lurking in the background. The best evidence for causation comes from randomized comparative experiments. The observed relationship between 2 variables may be due to direct causation, common response, or confounding. An observed relationship can be used for prediction without worrying about causation as long as the patterns found in the past data continue to hold true.
There is a strong relationship between cigarette smoking and death rate from lung cancer. Does smoking cigarettes cause lung cancer? There is a strong association between the availability of handguns in a nation and that nation’s homicide rate from guns. Does easy access to hand guns cause more murders?
Does watching television extend your lifespan? ◦ Countries which are rich enough to have televisions are probably also fortunate enough to have better nutrition, clean water, better health care, etc. than poorer nations. ◦ This was called a “nonsense correlation”. The correlation is real, but the conclusion is nonsense.
Common Response: a lurking variable influences both x and y creates a high correlation even though there is no direct connection between x and y. Ex., obesity in children: a lurking variable can be TV viewing time, but explanatory variables may be inheritance from parents, overeating, or lack of physical activity,
Confounding: a child may be overweight not because of their poor eating habits but because their parents provide poor choices (their parents have bad eating habits themselves).
If an experiment is not possible, you must meet the following criteria to prove causation: ◦ The association between the variables is strong. ◦ The association between the variables is consistent. ◦ Higher doses are associated with stronger responses. ◦ The alleged cause precedes the effect in time. ◦ The alleged cause is plausible.