1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 5 Summarizing Bivariate Data.

Slides:



Advertisements
Similar presentations
Lesson 10: Linear Regression and Correlation
Advertisements

13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Chapter 3 Bivariate Data
Warm up Use calculator to find r,, a, b. Chapter 8 LSRL-Least Squares Regression Line.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. Relationships Between Quantitative Variables Chapter 5.
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression.
Statistics for the Social Sciences
The Simple Regression Model
SIMPLE LINEAR REGRESSION
REGRESSION AND CORRELATION
SIMPLE LINEAR REGRESSION
Correlation and Regression Analysis
Simple Linear Regression and Correlation
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Terms, Scatterplots and Correlation.
Least Squares Regression Line (LSRL)
Correlation and Linear Regression
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 13 Linear Regression and Correlation.
Linear Regression.
SIMPLE LINEAR REGRESSION
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Linear Regression and Correlation
Chapter 6 & 7 Linear Regression & Correlation
M23- Residuals & Minitab 1  Department of ISM, University of Alabama, ResidualsResiduals A continuation of regression analysis.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Least-Squares Regression Section 3.3. Why Create a Model? There are two reasons to create a mathematical model for a set of bivariate data. To predict.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Introduction to Linear Regression
1 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 5 Summarizing Bivariate Data.
Section 5.2: Linear Regression: Fitting a Line to Bivariate Data.
Chapter 3 Section 3.1 Examining Relationships. Continue to ask the preliminary questions familiar from Chapter 1 and 2 What individuals do the data describe?
Summarizing Bivariate Data
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Non-linear Regression Example.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Chapter 9: Correlation and Regression Analysis. Correlation Correlation is a numerical way to measure the strength and direction of a linear association.
 Find the Least Squares Regression Line and interpret its slope, y-intercept, and the coefficients of correlation and determination  Justify the regression.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School.
Business Statistics for Managerial Decision Making
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.3 Predicting the Outcome.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Unit 4 Lesson 3 (5.3) Summarizing Bivariate Data 5.3: LSRL.
Regression Analysis Presentation 13. Regression In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships.
Describing Relationships. Least-Squares Regression  A method for finding a line that summarizes the relationship between two variables Only in a specific.
Chapter 4 More on Two-Variable Data. Four Corners Play a game of four corners, selecting the corner each time by rolling a die Collect the data in a table.
1. Analyzing patterns in scatterplots 2. Correlation and linearity 3. Least-squares regression line 4. Residual plots, outliers, and influential points.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Day 2 Lecture Review of Descriptive Statistics.
Chapter 3: Describing Relationships
Regression and Correlation
Inference for Least Squares Lines
Chapter 5 Lesson 5.3 Summarizing Bivariate Data
Regression and Residual Plots
Lecture Slides Elementary Statistics Thirteenth Edition
CHAPTER 29: Multiple Regression*
CHAPTER 3 Describing Relationships
SIMPLE LINEAR REGRESSION
CHAPTER 3 Describing Relationships
Algebra Review The equation of a straight line y = mx + b
Honors Statistics Review Chapters 7 & 8
Presentation transcript:

1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 5 Summarizing Bivariate Data

2 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Multivariate data set Bivariate data set Scatterplot Response Variable – usually “y” Explanatory Variable – usually “x” Terms

3 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example Example Calculators ready?! A sample of one-way Greyhound bus fares from Rochester, NY to cities less than 750 miles was taken by going to Greyhound’s website. The following table gives the destination city, the distance and the one- way fare. Distance should be the x axis and the Fare should be the y axis.

4 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example Scatterplot

5 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Comments The axes need not intersect at (0,0). For each of the axes, the scale should be chosen so that the minimum and maximum values on the scale are convenient and the values to be plotted fit nicely between them without a lot of extra room. The calculator will do this automatically when you do ZOOM-Stat. Notice that for this example, 1.The x axis (distance) runs from 50 to 650 miles where the data points are between 69 and The y axis (fare) runs from $10 to $100 where the data points are between $17 and $96. 3.What window did the calculator choose?

6 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Further Comments  It is possible that two points might have the same x value with different y values.  In this example, the y value tends to increase a x increases.  It appears that the y value (fare) could be predicted reasonably well from the x value (distance).

7 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Association Positive Association - Two variables are positively associated when above-average values of one tend to accompany above- average values of the other and below- average values tend similarly to occur together. (i.e., Generally speaking, the y values tend to increase as the x values increase.) Negative Association - Two variables are negatively associated when above-average values of one accompany below-average values of the other, and vice versa. (i.e., Generally speaking, the y values tend to decrease as the x values increase.)

8 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The Pearson Correlation Coefficient A measure of the strength of the linear relationship between the two variables is called the Pearson correlation coefficient. The Pearson sample correlation coefficient is defined by

9 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example Calculation

10 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures

11 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures

12 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures

13 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures

14 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures

15 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Some Correlation Pictures

16 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Properties of r  The value of r does not depend on the unit of measurement for each variable.  The value of r does not depend on which of the two variables is labeled x.

17 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Properties of r 1.The value of r is between –1 and The correlation coefficient is a)–1 only when all the points lie perfectly on a downward- sloping line, and b)+1 only when all the points lie perfectly on an upward-sloping line.

18 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Properties of r The value of r is a measure of the extent to which x and y are linearly related. The value of r is NOT the slope of the line of best fit. I’m growing tired of this bird!!!!!!!!!!!!!!!!!

19 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Remember!!!!!!!!! Association does not imply Causation! Just because x and y are correlated doesn’t mean x causes y. Values of r that are close to 1 or –1 mean only that x and y are strongly associated.

20 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.       Why did the pirate love correlation and regression? ‘Cause me likes findin’ rrrrrrrrrrrrrrrrrr!

21 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Consider the following bivariate data set: An Interesting Example

22 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. An Interesting Example Computing the Pearson correlation coefficient, we find that r = 0.001

23 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. With a sample Pearson correlation coefficient, r = 0.001, one would note that there seems to be little or no linearity to the relationship between x and y. Be careful that you do not infer that there is no relationship between x and y. An Interesting Example

24 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Note (below) that there appears to be an almost perfect quadratic relationship between x and y when the scatterplot is drawn. An Interesting Example

End of lesson 1 Hmw pg 207:1, 5, 9, 11, 15, Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

26 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Linear Relations The relationship y = a + bx is the equation of a straight line. The value b, called the slope of the line, is the amount by which y increases when x increase by 1 unit. The value of a, called the intercept (or sometimes the vertical intercept) of the line, is the height of the line above the value x = 0.

27 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example x y y = 7 + 3x a = 7 x increases by 1 y increases by b = 3

28 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example y y = x x increases by 1 y changes by b = -4 (i.e., changes by –4) a = 17

29 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Least Squares Line Many scatter plots show a linear pattern but not a perfect line. How do you find the “trend line” in these cases? Is there one best “trend line”? Terms: Predicted y value Y-hat Actual y value Residual

30 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Least Squares Line The trend line that gives the best fit to the data is the one that minimizes the sum of these squares; it is called the least squares line or sample regression line.least squares

31 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Least Squares Line (nasty formulas) The most widely used criterion for measuring the goodness of fit of a line y = a + bx to bivariate data (x 1, y 1 ), (x 2, y 2 ), , (x n, y n ) is the sum of the of the squared deviations about the line: The line that gives the best fit to the data is the one that minimizes this sum; it is called the least squares line or sample regression line.

32 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Coefficients a and b (these are on the sheet) The slope of the least squares line is And the y intercept is We write the equation of the least squares line as where the ^ above y emphasizes that (read as y-hat) is a prediction of y resulting from the substitution of a particular value into the equation.

33 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The slope, b, of the least squares line is interpreted as the average rate of change in y associated with an incremental increase in x. The y-intercept, a, does not always have a meaningful interpretation, but when it does it is the predicted value of y when x is zero. The ^ above y emphasizes that (read as y-hat) is a prediction of y resulting from the substitution of a particular value into the equation. Interpreting the slope and y-intercept of the LSRL.

34 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Notice the messy slope equation involves some weird looking variables. So What Are X-bar and Y-bar?

35 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Continued

36 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Calculations From the previous slide, we have

37 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Minitab Graph The following graph is a copy of the output from a Minitab command to graph the regression line. So, interpret the slope and y-intercept and predict the Fare for a trip of 300 miles.

38 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Why is it called a “regression” line? It’s called a regression line because the predicted y values always have z-scores that less than or equal to the z-scores of their corresponding x-values. In fact the relationship yields the equation: Z y-hat = r * Z x This in turn gives a new, simpler, formula for the slope after a few calculations: b = r * (s y /s x )

End of Lesson Two Hmw pg.217:19, 21, 25, 31, Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

40 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Three Important Questions To examine how useful or effective the line summarizing the relationship between x and y, we consider the following three questions. 1.Is a line an appropriate way to summarize the relationship between the two variables? 2.Are there any unusual aspects of the data set that we need to consider before proceeding to use the regression line to make predictions? 3.If we decide that it is reasonable to use the regression line as a basis for prediction, how accurate can we expect predictions based on the regression line to be?

41 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Review - Terminology

42 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Continued

43 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Residual Plot A residual plot is a scatter plot of the data pairs (x, residual). The following plot was produced by Minitab from the Greyhound example.

44 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Residual Plot - What to look for. Isolated points or patterns indicate potential problems. Ideally the the points should be randomly spread out above and below zero. This residual plot indicates no systematic bias using the least squares line to predict the y value. Generally this is the kind of pattern that you would like to see. Note: 1.Values below 0 indicate over prediction 2.Values above 0 indicate under prediction.

45 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The Greyhound example continued For the Greyhound example, it appears that the line systematically predicts fares that are too high for cities close to Rochester and predicts fares that are too little for most cities between 200 and 500 miles. Predicted fares are too high. Predicted fares are too low.

46 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. More Residual Plots

47 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Definition formulae

48 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Calculational formulae SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas:

49 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Coefficient of Determination The coefficient of determination, denoted by r 2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y. Note that the coefficient of determination is the square of the Pearson correlation coefficient.

50 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Greyhound Example Revisited

51 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. We can say that 93.5% of the variation in the Fare (y) can be attributed to the least squares linear relationship between distance (x) and fare. Greyhound Example Revisited

52 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. More on variability The standard deviation about the least squares line is denoted s e and given by s e is interpreted as the “typical” amount by which an observation deviates from the least squares line.

53 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The “typical” deviation of actual fare from the prediction is $6.80. Greyhound Example Revisited

54 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Minitab output for Regression Regression Analysis: Standard Fare versus Distance The regression equation is Standard Fare = Distance Predictor Coef SE Coef T P Constant Distance S = R-Sq = 93.5% R-Sq(adj) = 92.9% Analysis of Variance Source DF SS MS F P Regression Residual Error Total SSTo SSResid sese r2r2 ab Least squares regression line

55 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Transformations They’re not hard “atoll”!

56 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. IslandArea Anticosti3066 Ascension34 Azores902 Bahamas5380 Bermuda20 Bioko785 Block10 Canary2808 Cape Breton3981 Cape Verde1750 Faeroe540 Falkland4700 Fernando de Noronha7 Greenland IslandArea Iceland39769 Long Island1396 Madeira307 Marajo15528 Martha's Vineyard91 Mount Desert108 Nantucket46 Newfoundland42030 Prince Edward2184 St. Helena47 South Georgia1450 Tierra del Fuego18800 Tristan de Cunha40

57 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Fathom Dotplot of Island Areas

58 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Fathom Dotplot of Log(base 2) of Areas

59 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Fathom Dotplot of ln(area)

60 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. How to “fix” the skew: Left skew Raise the data to positive powers greater than 1. X 2 X 3 Right skew Square root or Cube root the data. Log or ln the data. Reciprocate the data.

61 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The Greyhound problem with additional data The sample of fares and mileages from Rochester was extended to cover a total of 20 cities throughout the country. The resulting data and a scatterplot are given on the next few slides.

62 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Extended Greyhound Fare Example

63 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Extended Greyhound Fare Example

64 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Minitab reports the correlation coefficient, r=0.921, R 2 =0.849, s e =$17.42 and the regression line Standard Fare = Distance Notice that even though the correlation coefficient is reasonably high and 84.9 % of the variation in the Fare is explained, the linear model is not very usable. Extended Greyhound Fare Example

65 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Histograms of the X’s and Y’s X values Y values A quick examination of the histograms for the x values (distance) and y values (fare) shows a prominent right skew for the distribution of distances.

66 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. How to “fix” the skew: Left skew Raise the data to positive powers greater than 1. X 2 X 3 Right skew Square root or Cube root the data. Log or ln the data. Reciprocate the data. So we should try a right skew “fix” for the x values! We’ll try log x.

67 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Nonlinear Regression Example

68 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. When we look at how the prediction curve looks on a graph that has the Standard Fare and log10(Distance) axes, we see the result looks reasonably linear. Nonlinear Regression Example

69 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. From the first graph we can see that the plot did not look linear, it appears to have a curved shape. We sometimes replace the one of both of the variables with a transformation of that variable and then perform a linear regression on the transformed variables. This can sometimes lead to developing a useful prediction equation. For this particular data, the shape of the curve is almost logarithmic so we replaced the distance with log 10 (distance) [the logarithm to the base 10) of the distance]. Nonlinear Regression Example

70 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Minitab provides the following output. Regression Analysis: Standard Fare versus Log10(Distance) The regression equation is Standard Fare = Log10(Distance) Predictor Coef SE Coef T P Constant Log10(Di S = R-Sq = 96.9% R-Sq(adj) = 96.7% High r % of the variation attributed to the model Typical Error = $7.87 Reasonably good Nonlinear Regression Example

71 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The rest of the Minitab output follows Analysis of Variance Source DF SS MS F P Regression Residual Error Total Unusual Observations Obs Log10(Di Standard Fit SE Fit Residual St Resid R R denotes an observation with a large standardized residual The only outlier is Orlando and as you’ll see from the next two slides, it is not too bad. Nonlinear Regression Example

72 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Looking at the plot of the residuals against distance, we see some problems. The model over estimates fares for middle distances (1000 to 2000 miles) and under estimates for longer distances (more than 2000 miles Nonlinear Regression Example

73 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. When we look at how the prediction curve looks on a graph that has the Standard Fare and Distance axes, we see the result appears to work fairly well. By and large, this prediction model for the fares appears to work reasonable well. Nonlinear Regression Example

74 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Textbook Diagram Transformation Ladder! Exponential Power No Change Roots Logs Reciprocal Up Down y x y x Up Down