Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stat 1510: Statistical Thinking and Concepts REGRESSION.

Similar presentations


Presentation on theme: "Stat 1510: Statistical Thinking and Concepts REGRESSION."— Presentation transcript:

1 Stat 1510: Statistical Thinking and Concepts REGRESSION

2 Agenda 2  Regression Lines  Least-Squares Regression Line  Facts about Regression  Residuals  Influential Observations  Cautions about Correlation and Regression  Correlation Does Not Imply Causation

3 Objectives 3  Quantify the linear relationship between an explanatory variable (x) and a response variable (y).  Use a regression line to predict values of (y) for values of (x).  Calculate and interpret residuals.  Describe cautions about correlation and regression.

4 Example: Predict the number of new adult birds that join the colony based on the percent of adult birds that return to the colony from the previous year.  If 60% of adults return, how many new birds are predicted? 4 A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We can use a regression line to predict the value of y for a given value of x. Regression Line

5 5 Regression equation: y = a + bx ^  x is the value of the explanatory variable.  “y-hat” is the predicted value of the response variable for a given value of x.  b is the slope, the amount by which y changes for each one-unit change in x.  a is the intercept, the value of y when x = 0. If the range of values of x does not include zero, intercept has no meaning.  We cannot use this fitted model to predict for any value of x that is outside the range of x values used to fit the model. Regression Line

6 Least-Squares Regression Line 6  Since we are trying to predict y, we want the regression line to be as close as possible to the data points in the vertical (y) direction. Least-Squares Regression Line (LSRL):  The line that minimizes the sum of the squares of the vertical distances of the data points from the line. Least-Squares Regression Line (LSRL):  The line that minimizes the sum of the squares of the vertical distances of the data points from the line.

7 Equation of the LSRL where s x and s y are the standard deviations of the two variables, and r is their correlation. Regression equation: y = a + bx ^

8 Prediction via Regression Line 8 u For the returning birds example, the LSRL is y-hat = 31.9343  0.3040x –y-hat is the predicted number of new birds for colonies with x percent of adults returning u Suppose we know that an individual colony has 60% returning. What would we predict the number of new birds to be for just that colony? u For colonies with 60% returning, we predict the average number of new birds to be: 31.9343  (0.3040)(60) = 13.69 birds

9 Facts about Least-Squares Regression 9  The distinction between explanatory and response variables is essential.  The slope b and correlation r always have the same sign.  The LSRL always passes through (x-bar, y-bar).  The square of the correlation, r 2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.  r 2 is called the coefficient of determination.

10 10 Regression Calculation Case Study Per Capita Gross Domestic Product and Average Life Expectancy for Countries in Western Europe

11 11 CountryPer Capita GDP (x)Life Expectancy (y) Austria21.477.48 Belgium23.277.53 Finland20.077.32 France22.778.63 Germany20.877.17 Ireland18.676.39 Italy21.578.51 Netherlands22.078.15 Switzerland23.878.99 United Kingdom21.277.37 Regression Calculation - Case Study

12 12 Linear regression equation: y = 68.716 + 0.420x ^ Regression Calculation- Case Study

13 13 Simplified Formula To avoid the rounding error, use the above Formulae for computing b

14 14 R – Output (using lm function) lm(formula = LE ~ GDP) Residuals: Min 1Q Median 3Q Max -0.92964 -0.24309 0.02843 0.25987 0.76440 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 68.7151 2.3238 29.571 1.85e-09 *** GDP 0.4200 0.1077 3.899 0.00455 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4951 on 8 degrees of freedom Multiple R-squared: 0.6552,Adjusted R-squared: 0.6121 F-statistic: 15.2 on 1 and 8 DF, p-value: 0.004554 ============================ The fitted model is LE = 68.7151 + 0.420 GDP

15 15 Coefficient of Determination (R 2 )  Measures usefulness of regression prediction  R 2 (or r 2, the square of the correlation): measures what fraction of the variation in the values of the response variable (y) is explained by the regression line v r=1: R 2 =1:regression line explains all (100%) of the variation in y v r=.7: R 2 =.49:regression line explains almost half (50%) of the variation in y

16 Residuals 16  A residual is the difference between an observed value of the response variable and the value predicted by the regression line: residual = y  y ^  A residual plot is a scatterplot of the regression residuals against the explanatory variable  used to assess the fit of a regression line  look for a “random” scatter around zero Gesell Adaptive Score and Age at First Word

17 Outliers and Influential Points 17  An outlier is an observation that lies far away from the other observations.  Outliers in the y direction have large residuals.  Outliers in the x direction are often influential for the least-squares regression line, meaning that the removal of such points would markedly change the equation of the line.

18 Outliers and Influential Points Chapter 5 18 Gesell Adaptive Score and Age at First Word From all of the data r 2 = 41% r 2 = 11% After removing child 18

19 Cautions about Correlation and Regression 19  Both describe linear relationships.  Both are affected by outliers.  Always plot the data before interpreting.  Beware of extrapolation.  predicting outside of the range of x  Beware of lurking variables.  These have an important effect on the relationship among the variables in a study, but are not included in the study.  Correlation does not imply causation!

20 Caution: Beware of E xtrapolation  Sarah’s height was plotted against her age.  Can you predict her height at age 42 months?  Can you predict her height at age 30 years (360 months)?

21 Caution: Beware of Extrapolation  Regression line: y-hat = 71.95 +.383 x  Height at age 42 months? y-hat = 88  Height at age 30 years? y-hat = 209.8  She is predicted to be 6’10.5” at age 30! 21

22 Example: Meditation and Aging (Noetic Sciences Review, Summer 1993, p. 28) 22  Explanatory variable: observed meditation practice (yes/no)  Response variable: level of age-related enzyme  General concern for one’s well being may also be affecting the response (and the decision to try meditation). Caution: Beware of Lurking Variables

23 23  Even very strong correlations may not correspond to a real causal relationship (changes in x actually causing changes in y).  Correlation may be explained by a lurking variable Correlation Does Not Imply Causation Social Relationships and Health House, J., Landis, K., and Umberson, D. “Social Relationships and Health,” Science, Vol. 241 (1988), pp 540-545. Does lack of social relationships cause people to become ill? (there was a strong correlation)  Or, are unhealthy people less likely to establish and maintain social relationships? (reversed relationship)  Or, is there some other factor that predisposes people both to have lower social activity and become ill? Social Relationships and Health House, J., Landis, K., and Umberson, D. “Social Relationships and Health,” Science, Vol. 241 (1988), pp 540-545. Does lack of social relationships cause people to become ill? (there was a strong correlation)  Or, are unhealthy people less likely to establish and maintain social relationships? (reversed relationship)  Or, is there some other factor that predisposes people both to have lower social activity and become ill?

24 Evidence of Causation 24  A properly conducted experiment may establish causation.  Other considerations when we cannot do an experiment:  The association is strong.  The association is consistent. The connection happens in repeated trials. The connection happens under varying conditions.  Higher doses are associated with stronger responses.  Alleged cause precedes the effect in time.  Alleged cause is plausible (reasonable explanation).

25 25  Multiple Linear Regression Models More than one independent variable  Logistic Regression Response of interest (y) belongs to two categories – good / bad, 0 /1 etc.  Poisson Regression Response of interest is positive integers (0,1,2,,3,etc.) Other Regression Models


Download ppt "Stat 1510: Statistical Thinking and Concepts REGRESSION."

Similar presentations


Ads by Google