2 Bivariate datax – variable: is the independent or explanatory variabley- variable: is the dependent or response variableUse x to predict y
3 Be sure to put the hat on the y - (y-hat) means the predicted yb – is the slopeit is the amount by which y increases when x increases by 1 unita – is the y-interceptit is the height of the line when x = 0in some situations, the y-intercept has no meaningBe sure to put the hat on the y
4 Least Squares Regression Line LSRL The line that gives the best fit to the data setThe line that minimizes the sum of the squares of the deviations from the line
6 What is the sum of the deviations from the line? Will it always be zero?Use a calculator to find the line of best fit(0,0)(3,10)(6,2)6Find y - y-3The line that minimizes the sum of the squares of the deviations from the line is the LSRL.-3Sum of the squares = 54
7 Interpretations Slope: For each unit increase in x, there is an approximate increase/decrease of b in y.Correlation coefficient:There is a direction, strength, type of association between x and y.
8 The ages (in months) and heights (in inches) of seven children are given. xyFind the LSRL.Interpret the slope and correlation coefficient in the context of the problem.
9 Correlation coefficient: There is a strong, positive, linear association between the age and height of children.Slope:For an increase in age of one month, there is an approximate increase of .34 inches in heights of children.
10 Predict the height of a child who is 4.5 years old. The ages (in months) and heights (in inches) of seven children are given.xyPredict the height of a child who is 4.5 years old.Predict the height of someone who is 20 years old.Graph, find lsrl, also examine mean of x & y
11 ExtrapolationThe LSRL should not be used to predict y for values of x outside the data set.It is unknown whether the pattern observed in the scatterplot continues outside this range.
12 Will this point always be on the LSRL? The ages (in months) and heights (in inches) of seven children are given.xyCalculate x & y.Plot the point (x, y) on the LSRL.Graph, find lsrl, also examine mean of x & yWill this point always be on the LSRL?
13 The correlation coefficient and the LSRL are both non-resistant measures.
17 Suppose we found the age and weight of a sample of 10 adults. Create a scatterplot of the data below.Is there any relationship between the age and weight of these adults?Age24304128504649352039Wt256124320185158129103196110130
18 Create a scatterplot of the data below. Suppose we found the height and weight of a sample of 10 adults.Create a scatterplot of the data below.Is there any relationship between the height and weight of these adults?Is it positive or negative? Weak or strong?Ht74657772686062736164Wt256124320185158129103196110130
19 The farther away from a straight line – the weaker the relationship The closer the points in a scatterplot are to a straight line - the stronger the relationship.The farther away from a straight line – the weaker the relationship
20 Identify as having a positive association, a negative association, or no association. +Heights of mothers & heights of their adult daughters-Age of a car in years and its current value+Weight of a person and calories consumedHeight of a person and the person’s birth monthNONumber of hours spent in safety training and the number of accidents that occur-
21 Correlation Coefficient (r)- A quantitative assessment of the strength & direction of the linear relationship between bivariate, quantitative dataPearson’s sample correlation is used mostparameter – r (rho)statistic - r
22 Calculate r. Interpret r in context. Speed Limit (mph)555045403020Avg. # of accidents (weekly)28252117116Calculate r. Interpret r in context.There is a strong, positive, linear relationship between speed limit and average number of accidents per week.
23 Properties of r (correlation coefficient) legitimate values of r are [-1,1]Strong correlationNoCorrelationModerate CorrelationWeak correlation
24 The correlations are the same. value of r does not depend on the unit of measurement for either variablex (in mm)yFind r.Change to cm & find r.The correlations are the same.
25 value of r does not depend on which of the two variables is labeled x ySwitch x & y & find r.The correlations are the same.
26 value of r is non-resistant xyFind r.Outliers affect the correlation coefficient
27 r = 0, but has a definite relationship! value of r is a measure of the extent to which x & y are linearly relatedA value of r close to zero does not rule out any strong relationship between x and y.r = 0, but has a definite relationship!
28 Minister data:(Data on Elmo)r = .9999So does an increase in ministers cause an increase in consumption of rum?
31 Residuals (error) -The vertical deviation between the observations & the LSRLthe sum of the residuals is always zeroerror = observed - expected
32 Residual plot A scatterplot of the (x, residual) pairs. Residuals can be graphed against other statistics besides xPurpose is to tell if a linear association exist between the x & y variablesIf no pattern exists between the points in the residual plot, then the association is linear.
34 Residuals x Age Range of Motion 35 154 24 142 40 137 31 133 28 122 One measure of the success of knee surgery is post-surgical range of motion for the knee joint following a knee dislocation. Is there a linear relationship between age & range of motion?Sketch a residual plot.xResidualsSince there is no pattern in the residual plot, there is a linear relationship between age and range of motion
35 Age Range of MotionPlot the residuals against the y-hats. How does this residual plot compare to the previous one?Residuals
36 Residual plots are the same no matter if plotted against x or y-hat. ResidualsResidualsResidual plots are the same no matter if plotted against x or y-hat.
37 Coefficient of determination- gives the proportion of variation in y that can be attributed to an approximate linear relationship between x & yremains the same no matter which variable is labeled x
38 Sum of the squared residuals (errors) using the mean of y. Age Range of MotionLet’s examine r2.Suppose you were going to predict a future y but you didn’t know the x-value. Your best guess would be the overall mean of the existing y’s.Now, find the sum of the squared residuals (errors). L3 = (L )^2. Do 1VARSTAT on L3 to find the sum.Sum of the squared residuals (errors) using the mean of y.SSEy =
39 Sum of the squared residuals (errors) using the LSRL. Age Range of MotionNow suppose you were going to predict a future y but you DO know the x-value. Your best guess would be the point on the LSRL for that x-value (y-hat). Find the LSRL & store in Y1. In L3 = Y1(L1) to calculate the predicted y for each x-value.Now, find the sum of the squared residuals (errors). In L4 = (L2-L3)^2. Do 1VARSTAT on L4 to find the sum.Sum of the squared residuals (errors) using the LSRL.SSEy =
40 Age Range of MotionSSEy =SSEy =By what percent did the sum of the squared error go down when you went from just an “overall mean” model to the “regression on x” model?This is r2 – the amount of the variation in the y-values that is explained by the x-values.
41 How well does age predict the range of motion after knee surgery? Age Range of MotionHow well does age predict the range of motion after knee surgery?Approximately 30.6% of the variation in range of motion after knee surgery can be explained by the linear regression of age and range of motion.
42 Interpretation of r2Approximately r2% of the variation in y can be explained by the LSRL of x & y.
43 Be sure to convert r2 to decimal before taking the square root! Computer-generated regression analysis of knee surgery data:Predictor Coef Stdev T PConstantAges = R-sq = 30.6% R-sq(adj) = 23.7%Be sure to convert r2 to decimal before taking the square root!NEVER use adjusted r2!What is the equation of the LSRL?Find the slope & y-intercept.What are the correlation coefficient and the coefficient of determination?
44 Outlier –In a regression setting, an outlier is a data point with a large residual
45 Influential point- A point that influences where the LSRL is located If removed, it will significantly change the slope of the LSRL
46 Racket Resonance Acceleration (Hz) (m/sec/sec)One factor in the development of tennis elbow is the impact-induced vibration of the racket and arm at ball contact.Sketch a scatterplot of these data.Calculate the LSRL & correlation coefficient.Does there appear to be an influential point? If so, remove it and then calculate the new LSRL & correlation coefficient.
47 Which of these measures are resistant? LSRLCorrelation coefficientCoefficient of determinationNONE – all are affected by outliers
49 What would you expect for other heights? WeightWhat would you expect for other heights?How much would an adult female weigh if she were 5 feet tall?This distribution is normally distributed.(we hope)She could weigh varying amounts – in other words, there is a distribution of weights for adult females who are 5 feet tall.What about the standard deviations of all these normal distributions?We want the standard deviations of all these normal distributions to be the same.Where would you expect the TRUE LSRL to be?
50 Regression ModelThe mean response my has a straight-line relationship with x:Where: slope β and intercept α are unknown parametersFor any fixed value of x, the response y varies according to a normal distribution. Repeated responses of y are independent of each other.The standard deviation of y (sy) is the same for all values of x. (sy is also an unknown parameter)
51 The slope b of the LSRL is an unbiased estimator of the true slope β. We use to estimateThe slope b of the LSRL is an unbiased estimator of the true slope β.The intercept a of the LSRL is an unbiased estimator of the true intercept α.The standard error s is an unbiased estimator of the true standard deviation of y (sy).Note:df = n-2
53 HeightWeightSuppose you took many samples of the same size from this population & calculated the LSRL for each.Using the slope from each of these LSRLs – we can create a sampling distribution for the slope of the true LSRL.What is the standard deviation of the sampling distribution?What is the mean of the sampling distribution equal?What shape will this distribution have?bbbbbbbμb = b
54 Assumptions for inference on slope The observations are independentCheck that you have an SRSThe true relationship is linearCheck the scatter plot & residual plotThe standard deviation of the response is constant.The responses vary normally about the true regression line.Check a histogram or boxplot of residuals
55 Because there are two unknowns a & b Formulas:Confidence Interval:Hypothesis test:df = n -2Because there are two unknowns a & b
56 Hypotheses Be sure to define b! 1 H0: b = 0 Ha: b > 0 Ha: b < 0 This implies that there is no relationship between x & yOr that x should not be used to predict yWhat would the slope equal if there were a perfect relationship between x & y?1H0: b = 0Ha: b > 0Ha: b < 0Ha: b ≠ 0Be sure to define b!
57 Body fat = -27.376 + 0.250 weight r = 0.697 r2 = 0.485 Example: It is difficult to accurately determine a person’s body fat percentage without immersing him or her in water. Researchers hoping to find ways to make a good estimate immersed 20 male subjects, and then measured their weights.Find the LSRL, correlation coefficient, and coefficient of determination.Body fat = weightr = 0.697r2 = 0.485
58 b) Explain the meaning of slope in the context of the problem. There is approxiamtely .25% increase in body fat for every pound increase in weight.c) Explain the meaning of the coefficient of determination in context.Approximately 48.5% of the variation in body fat can be explained by the regression of body fat on weight.
59 a = -27.376 b = 0.25 s = 7.049 d) Estimate a, b, and s. e) Create a scatter plot and residual plot for the data.WeightBody fatWeightResiduals
60 f) Is there sufficient evidence that weight can be used to predict body fat? Assumptions:Have an SRS of male subjectsSince the residual plot is randomly scattered, weight & body fat are linearSince the points are evenly spaced across the LSRL on the scatterplot, sy is approximately equal for all values of weightSince the boxplot of residual is approximately symmetrical, the responses are approximately normally distributed.H0: b = 0 Where b is the true slope of the LSRL of weight Ha: b ≠ 0 & body fatSince the p-value < α, I reject H0. There is sufficient evidence to suggest that weight can be used to predict body fat.
61 Be sure to show all graphs! g) Give a 95% confidence interval for the true slope of the LSRL.Assumptions:Have an SRS of male subjectsSince the residual plot is randomly scattered, weight & body fat are linearSince the points are evenly spaced across the LSRL on the scatterplot, sy is approximately equal for all values of weightSince the boxplot of residual is approximately symmetrical, the responses are approximately normally distributed.We are 95% confident that the true slope of the LSRL of weight & body fat is between 0.12 and 0.38.Be sure to show all graphs!
62 What does “s” represent (in context)? h) Here is the computer-generated result from the data:Sample size: 20R-square = 43.83%s =df?What does “s” represent (in context)?ParameterEstimateStd. Err.InterceptWeightCorrelation coeficient?Be sure to write as decimal first!What does this number represent?What do these numbers represent?