Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis Section 12.1: How Can We Model How Two Variables Are Related?

Similar presentations


Presentation on theme: "1 Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis Section 12.1: How Can We Model How Two Variables Are Related?"— Presentation transcript:

1 1 Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis Section 12.1: How Can We Model How Two Variables Are Related?

2 2 Learning Objectives 1. Regression Analysis 2. The Scatterplot 3. The Regression Line Equation 4. Outliers 5. Influential Points 6. Residuals are Prediction Errors 7. Regression Model: A Line Describes How the Mean of y Depends on x 8. The Population Regression Equation 9. Variability about the Line 10. A Statistical Model

3 3 Learning Objective 1: Regression Analysis The first step of a regression analysis is to identify the response and explanatory variables We use y to denote the response variable We use x to denote the explanatory variable

4 4 Learning Objective 2: The Scatterplot The first step in answering the question of association is to look at the data A scatterplot is a graphical display of the relationship between the response variable (y-axis) and the explanatory variable (x-axis)

5 5 Learning Objective 2: Example: What Do We Learn from a Scatterplot in the Strength Study? An experiment was designed to measure the strength of female athletes The goal of the experiment was to find the maximum number of pounds that each individual athlete could bench press

6 6 57 high school female athletes participated in the study The data consisted of the following variables: x: the number of 60-pound bench presses an athlete could do y: maximum bench press Learning Objective 2: Example: What Do We Learn from a Scatterplot in the Strength Study?

7 7 For the 57 girls in this study, these variable are summarized by: x: mean = 11.0, st.deviation = 7.1 y: mean = 79.9 lbs, st.dev. = 13.3 lbs Learning Objective 2: Example: What Do We Learn from a Scatterplot in the Strength Study?

8 8

9 9 Learning Objective 3: The Regression Line Equation When the scatterplot shows a linear trend, a straight line can be fitted through the data points to describe that trend The regression line is: is the predicted value of the response variable y is the y-intercept and is the slope

10 10 Learning Objective 3: Example: What Do We Learn from a Scatterplot in the Strength Study?

11 11 The MINITAB output shows the following regression equation: BP = 63.5 + 1.49 (BP_60) The y-intercept is 63.5 and the slope is 1.49 The slope of 1.49 tells us that predicted maximum bench press increases by about 1.5 pounds for every additional 60-pound bench press an athlete can do Learning Objective 3: Example: What Do We Learn from a Scatterplot in the Strength Study?

12 12 Learning Objective 4: Outliers Check for outliers by plotting the data The regression line can be pulled toward an outlier and away from the general trend of points

13 13 Learning Objective 5: Influential Points An observation can be influential in affecting the regression line when two thing happen: Its x value is low or high compared to the rest of the data It does not fall in the straight-line pattern that the rest of the data have

14 14 Learning Objective 6: Residuals are Prediction Errors The regression equation is often called a prediction equation The difference between an observed outcome and its predicted value is the prediction error, called a residual

15 15 Learning Objective 6: Residuals Each observation has a residual A residual is the vertical distance between the data point and the regression line

16 16 Learning Objective 6: Residuals We can summarize how near the regression line the data points fall by The regression line has the smallest sum of squared residuals and is called the least squares line

17 17 Learning Objective 7: Regression Model: A Line Describes How the Mean of y Depends on x At a given value of x, the equation: Predicts a single value of the response variable But… we should not expect all subjects at that value of x to have the same value of y Variability occurs in the y values

18 18 Learning Objective 7: The Regression Line The regression line connects the estimated means of y at the various x values In summary, Describes the relationship between x and the estimated means of y at the various values of x

19 19 Learning Objective 8: The Population Regression Equation The population regression equation describes the relationship in the population between x and the means of y The equation is:

20 20 Learning Objective 8: The Population Regression Equation In the population regression equation, α is a population y-intercept and β is a population slope These are parameters In practice we estimate the population regression equation using the prediction equation for the sample data

21 21 Learning Objective 8: The Population Regression Equation The population regression equation merely approximates the actual relationship between x and the population means of y It is a model A model is a simple approximation for how variables relate in the population

22 22 Learning Objective 8: The Regression Model

23 23 Learning Objective 8: The Regression Model If the true relationship is far from a straight line, this regression model may be a poor one

24 24 Learning Objective 9: Variability about the Line At each fixed value of x, variability occurs in the y values around their mean, µ y The probability distribution of y values at a fixed value of x is a conditional distribution At each value of x, there is a conditional distribution of y values An additional parameter σ describes the standard deviation of each conditional distribution

25 25 Learning Objective 10: A Statistical Model A statistical model never holds exactly in practice. It is merely an approximation for reality Even though it does not describe reality exactly, a model is useful if the true relationship is close to what the model predicts

26 26 Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis Section 12.2: How Can We Describe Strength of Association?

27 27 Learning Objectives 1. Correlation and Slope 2. Example: What’s the Correlation for Predicting Strength? 3. The Squared Correlation

28 28 Learning Objective 1: Correlation The correlation, denoted by r, describes linear association The correlation ‘r’ has the same sign as the slope ‘b’ The correlation ‘r’ always falls between -1 and +1 The larger the absolute value of r, the stronger the linear association

29 29 Learning Objective 1: Correlation and Slope We can’t use the slope to describe the strength of the association between two variables because the slope’s numerical value depends on the units of measurement The correlation is a standardized version of the slope The correlation does not depend on units of measurement.

30 30 Learning Objective 1: Correlation and Slope The correlation and the slope are related in the following way:

31 31 Learning Objective 2: Example: What’s the Correlation for Predicting Strength? For the female athlete strength study: x: number of 60-pound bench presses y: maximum bench press x: mean = 11.0, st.dev.=7.1 y: mean= 79.9 lbs., st.dev. = 13.3 lbs. Regression equation:

32 32 The variables have a strong, positive association Learning Objective 2: Example: What’s the Correlation for Predicting Strength?

33 33 Learning Objective 3: The Squared Correlation Another way to describe the strength of association refers to how close predictions for y tend to be to observed y values The variables are strongly associated if you can predict y much better by substituting x values into the prediction equation than by merely using the sample mean and ignoring x

34 34 Learning Objective 3: The Squared Correlation Consider the prediction error: the difference between the observed and predicted values of y Using the regression line to make a prediction, each error is: Using only the sample mean,, to make a prediction, each error is:

35 35 Learning Objective 3: The Squared Correlation When we predict y using (that is, ignoring x), the error summary equals: This is called the total sum of squares

36 36 Learning Objective 3: The Squared Correlation When we predict y using x with the regression equation, the error summary is: This is called the residual sum of squares

37 37 Learning Objective 3: The Squared Correlation When a strong linear association exists, the regression equation predictions tend to be much better than the predictions using We measure the proportional reduction in error and call it, r 2

38 38 Learning Objective 3: The Squared Correlation We use the notation r 2 for this measure because it equals the square of the correlation r

39 39 Learning Objective 3: The Squared Correlation Example: What Does r 2 Tell Us in the Strength Study? For the female athlete strength study: x: number of 60-pund bench presses y: maximum bench press The correlation value was found to be r = 0.80 We can calculate r 2 from r: (0.80) 2 =0.64 For predicting maximum bench press, the regression equation has 64% less error than has

40 40 Learning Objective 3: The Squared Correlation Properties: r 2 falls between 0 and 1 r 2 =1 when. This happens only when all the data points fall exactly on the regression line r 2 =0 when. This happens when the slope b=0, in which case each The closer r 2 is to 1, the stronger the linear association: the more effective the regression equation is compared to in predicting y

41 41 Learning Objective 3: Correlation r and Its Square r 2 Both r and r 2 describe the strength of association ‘r’ falls between -1 and +1 It represents the slope of the regression line when x and y have been standardized ‘r 2 ’ falls between 0 and 1 It summarizes the reduction in sum of squared errors in predicting y using the regression line instead of using

42 42 Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis Section 12.3: How Can We Make Inferences About the Association?

43 43 Learning Objectives 1. Descriptive and Inferential Parts of Regression 2. Assumptions for Regression Analysis 3. Testing Independence between Quantitative Variables 4. A Confidence Interval for β

44 44 Learning Objective 1: Descriptive and Inferential Parts of Regression The sample regression equation, r, and r 2 are descriptive parts of a regression analysis The inferential parts of regression use the tools of confidence intervals and significance tests to provide inference about the regression equation, the correlation and r-squared in the population of interest

45 45 Learning Objective 2: Assumptions for Regression Analysis Basic assumption for using regression line for description: The population means of y at different values of x have a straight-line relationship with x, that is: This assumption states that a straight-line regression model is valid This can be verified with a scatterplot.

46 46 Learning Objective 2: Assumptions for Regression Analysis Extra assumptions for using regression to make statistical inference: The data were gathered using randomization The population values of y at each value of x follow a normal distribution, with the same standard deviation at each x value

47 47 Learning Objective 2: Assumptions for Regression Analysis Models, such as the regression model, merely approximate the true relationship between the variables A relationship will not be exactly linear, with exactly normal distributions for y at each x and with exactly the same standard deviation of y values at each x value

48 48 Learning Objective 3: Testing Independence between Quantitative Variables Suppose that the slope β of the regression line equals 0 Then… The mean of y is identical at each x value The two variables, x and y, are statistically independent: The outcome for y does not depend on the value of x It does not help us to know the value of x if we want to predict the value of y

49 49 Learning Objective 3: Testing Independence between Quantitative Variables

50 50 Learning Objective 3: Testing Independence between Quantitative Variables Steps of Two-Sided Significance Test about a Population Slope β: 1. Assumptions: The population satisfies regression line: Randomization The population values of y at each value of x follow a normal distribution, with the same standard deviation at each x value

51 51 Learning Objective 3: Testing Independence between Quantitative Variables Steps of Two-Sided Significance Test about a Population Slope β: 2. Hypotheses: H 0 : β = 0, H a : β ≠ 0 3. Test statistic: Software supplies sample slope b and its se

52 52 Learning Objective 3: Testing Independence between Quantitative Variables Steps of Two-Sided Significance Test about a Population Slope β: 4. P-value: Two-tail probability of t test statistic value more extreme than observed: Use t distribution with df = n-2 5. Conclusions: Interpret P-value in context If decision needed, reject H 0 if P-value ≤ significance level

53 53 Learning Objective 3: Example: Is Strength Associated with 60-Pound Bench Press?

54 54 Conduct a two-sided significance test of the null hypothesis of independence Assumptions: A scatterplot of the data revealed a linear trend so the straight-line regression model seems appropriate The scatter of points have a similar spread at different x values The sample was a convenience sample, not a random sample, so this is a concern Learning Objective 3: Example: Is Strength Associated with 60-Pound Bench Press?

55 55 Hypotheses: H 0 : β = 0, H a : β ≠ 0 Test statistic: P-value: 0.000 Conclusion: An association exists between the number of 60-pound bench presses and maximum bench press Learning Objective 3: Example: Is Strength Associated with 60-Pound Bench Press?

56 56 Learning Objective 4: A Confidence Interval for β A small P-value in the significance test of H 0 : β = 0 suggests that the population regression line has a nonzero slope To learn how far the slope β falls from 0, we construct a confidence interval:

57 57 Learning Objective 4: Example: Estimating the Slope for Predicting Maximum Bench Press Construct a 95% confidence interval for β Based on a 95% CI, we can conclude, on average, the maximum bench press increases by between 1.2 and 1.8 pounds for each additional 60-pound bench press that an athlete can do

58 58 Let’s estimate the effect of a 10-unit increase in x: Since the 95% CI for β is (1.2, 1.8), the 95% CI for 10β is (12, 18) On the average, we infer that the maximum bench press increases by at least 12 pounds and at most 18 pounds, for an increase of 10 in the number of 60-pound bench presses Learning Objective 4: Example: Estimating the Slope for Predicting Maximum Bench Press

59 59 Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis Section 12.4: What Do We Learn from How the Data Vary Around the Regression Line?

60 60 Learning Objectives 1. Residuals and Standardized Residuals 2. Analyzing Large Standardized Residuals 3. The Residual Standard Deviation 4. Confidence Interval for µ y 5. Prediction Interval for y 6. Prediction Interval for y vs Confidence Interval for µ y

61 61 Learning Objective 1: Residuals and Standardized Residuals A residual is a prediction error – the difference between an observed outcome and its predicted value The magnitude of these residuals depends on the units of measurement for y A standardized version of the residual does not depend on the units

62 62 Learning Objective 1: Standardized Residuals Standardized residual: The se formula is complex, so we rely on software to find it A standardized residual indicates how many standard errors a residual falls from 0 If the relationship is truly linear and the standardized residuals have approximately a bell-shaped distribution, observations with standardized residuals larger than 3 in absolute value often represent outliers

63 63 Learning Objective 1: Example: Detecting an Underachieving College Student Data was collected on a sample of 59 students at the University of Georgia Two of the variables were: CGPA: College Grade Point Average HSGPA: High School Grade Point Average

64 64 A regression equation was created from the data: x: HSGPA y: CGPA Equation: Learning Objective 1: Example: Detecting an Underachieving College Student

65 65 MINITAB highlights observations that have standardized residuals with absolute value larger than 2: Learning Objective 1: Example: Detecting an Underachieving College Student

66 66 Consider the reported standardized residual of -3.14 This indicates that the residual is 3.14 standard errors below 0 This student’s actual college GPA is quite far below what the regression line predicts Learning Objective 1: Example: Detecting an Underachieving College Student

67 67 Learning Objective 2: Analyzing Large Standardized Residuals Does it fall well away from the linear trend that the other points follow? Does it have too much influence on the results? Note: Some large standardized residuals may occur just because of ordinary random variability-even if the model is perfect, we’d expect about 5% of the standardized residuals to have absolute values > 2 by chance.

68 68 Learning Objective 2: Histogram of Residuals A histogram of residuals or standardized residuals is a good way of detecting unusual observations A histogram is also a good way of checking the assumption that the conditional distribution of y at each x value is normal Look for a bell-shaped histogram

69 69 Learning Objective 2: Histogram of Residuals Suppose the histogram is not bell-shaped: The distribution of the residuals is not normal However…. Two-sided inferences about the slope parameter still work quite well The t- inferences are robust

70 70 Learning Objective 3: The Residual Standard Deviation For statistical inference, the regression model assumes that the conditional distribution of y at a fixed value of x is normal, with the same standard deviation at each x This standard deviation, denoted by σ, refers to the variability of y values for all subjects with the same x value

71 71 Learning Objective 3: The Residual Standard Deviation The estimate of σ, obtained from the data, is:

72 72 Learning Objective 3: Example: How Variable are the Athletes’ Strengths? From MINITAB output, we obtain s, the residual standard deviation of y: For any given x value, we estimate the mean y value using the regression equation and we estimate the standard deviation using s: s = 8.0

73 73 Learning Objective 4: Confidence Interval for µ y We estimate µ y, the population mean of y at a given value of x by: We can construct a 95% confidence interval for µ y using: where the t-score has df=n-2

74 74 Learning Objective 5: Prediction Interval for y The estimate for the mean of y at a fixed value of x is also a prediction for an individual outcome y at the fixed value of x Most regression software will form this interval within which an outcome y is likely to fall

75 75 Learning Objective 6: Prediction Interval for y vs Confidence Interval for µ y The prediction interval for y is an inference about where individual observations fall Use a prediction interval for y if you want to predict where a single observation on y will fall for a particular x value

76 76 The confidence interval for µ y is an inference about where a population mean falls Use a confidence interval for µ y if you want to estimate the mean of y for all individuals having a particular x value Learning Objective 6: Prediction Interval for y vs Confidence Interval for µ y

77 77 Learning Objective 6: Prediction Interval for y vs Confidence Interval for µ y Note that the prediction interval is wider than the confidence interval - you can estimate a population mean more precisely than you can predict a single observation Caution: in order for these intervals to be valid, the true relationship must be close to linear with about the same variability of y-values at each fixed x-value

78 78 Learning Objective 6: Example: Predicting Maximum Bench Press and Estimating its Mean

79 79 Use the MINITAB output to find and interpret a 95% CI for the population mean of the maximum bench press values for all female high school athletes who can do x = 11 sixty-pound bench presses For all female high school athletes who can do 11 sixty-pound bench presses, we estimate the mean of their maximum bench press values falls between 78 and 82 pounds Learning Objective 6: Example: Predicting Maximum Bench Press and Estimating its Mean

80 80 Use the MINITAB output to find and interpret a 95% Prediction Interval for a single new observation on the maximum bench press for a randomly chosen female high school athlete who can do x = 11 sixty- pound bench presses For all female high school athletes who can do 11 sixty-pound bench presses, we predict that 95% of them have maximum bench press values between 64 and 96 pounds Learning Objective 6: Example: Predicting Maximum Bench Press and Estimating its Mean

81 81 Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis Section 12.5: Exponential Regression: A Model for Nonlinearity

82 82 Learning Objectives 1. Nonlinear Regression Models 2. Exponential Regression Model 3. Interpreting Exponential Regression Models

83 83 Learning Objective 1: Nonlinear Regression Models If a scatterplot indicates substantial curvature in a relationship, then equations that provide curvature are needed Occasionally a scatterplot has a parabolic appearance: as x increases, y increases then it goes back down More often, y tends to continually increase or continually decrease but the trend shows curvature

84 84 Learning Objective 1: Example: Exponential Growth in Population Size Since 2000, the population of the U.S. has been growing at a rate of 2% a year The population size in 2000 was 280 million The population size in 2001 was 280 x 1.02 The population size in 2002 was 280 x (1.02 )2 … The population size in 2010 is estimated to be 280 x (1.02 )10 This is called exponential growth

85 85 Learning Objective 2: Exponential Regression Model An exponential regression model has the formula: For the mean µ y of y at a given value of x, where α and β are parameters

86 86 Learning Objective 2: Exponential Regression Model In the exponential regression equation, the explanatory variable x appears as the exponent of a parameter The mean µ y and the parameter β can take only positive values

87 87 Learning Objective 2: Exponential Regression Model As x increases, the mean µ y increases when β>1 It continually decreases when 0 < β<1

88 88 Learning Objective 2: Exponential Regression Model For exponential regression, the logarithm of the mean is a linear function of x When the exponential regression model holds, a plot of the log of the y values versus x should show an approximate straight-line relation with x

89 89 Learning Objective 2: Example: Explosion in Number of People Using the Internet

90 90 Learning Objective 2: Example: Explosion in Number of People Using the Internet Plot of Number of People Using Internet between 1995 and 2001

91 91 Learning Objective 2: Example: Explosion in Number of People Using the Internet Plot of Log Number of People Using Internet between 1995 and 2001

92 92 Using regression software, we can create the exponential regression equation: x: the number of years since 1995. Start with x = 0 for 1995, then x=1 for 1996, etc y: number of internet users Equation: Learning Objective 2: Example: Explosion in Number of People Using the Internet

93 93 Learning Objective 3: Interpreting Exponential Regression Models In the exponential regression model, the parameter α represents the mean value of y when x = 0; The parameter β represents the multiplicative effect on the mean of y for a one-unit increase in x

94 94 Learning Objective 3: Example: Explosion in Number of People Using the Internet In this model: The predicted number of Internet users in 1995 (for which x = 0) is 20.38 million The predicted number of Internet users in 1996 is 20.38 times 1.7708


Download ppt "1 Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis Section 12.1: How Can We Model How Two Variables Are Related?"

Similar presentations


Ads by Google