Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4.

Similar presentations


Presentation on theme: "Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4."— Presentation transcript:

1 Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4

2 Chap 22

3 Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Section Scatter Diagrams and Correlation 4.1

4 4-4 The response (dependent) or “output” variable is the variable whose value can be “predicted” or explained by the value of the explanatory/predictor (independent) or “input” variable.

5 4-5 A scatter diagram is a graph that shows the relationship between two quantitative variables. The predictor (independent / “x”) variable is plotted on the horizontal axis, and the response (dependent / “y”) variable is plotted on the vertical axis.

6 4-6 EXAMPLE Drawing and Interpreting a Scatter Diagram The data to the right are based on a study for drilling thru rock. The researchers wanted to determine whether the time it takes to drill thru 5 feet of rock increases with the depth at which the drilling begins. Depth at which drilling begins is the predictor variable, “x”, and time (min) to drill five feet is the response variable “y”. Draw a scatter diagram of the data.

7 4-7

8 4-8 Various Types of Relations in a Scatter Diagram

9 4-9 Two variables that are linearly related are positively correlated when higher values of one variable are associated with higher values of the other (positive slope), and lower values of one variable are associated with lower values of the other. That is, two variables are “positively correlated” if, as one variable increases, the other variable also increases.

10 4-10

11 4-11 Two variables that are linearly related are negatively correlated when higher values of one variable are associated with lower values of the other (negative slope), and lower values of one variable are associated with higher values of the other. That is, two variables are “negatively correlated” if, as one variable increases, the other variable decreases.

12 4-12 The linear correlation coefficient or Pearson Correlation Coefficient is a measure of the strength and direction of the linear relation between two quantitative variables. The Greek letter “ρ” (rho) represents the population correlation coefficient, and “r” represents the sample correlation coefficient.

13 Larson & Farber, Elementary Statistics: Picturing the World, 3e 13 Linear Correlation Coefficient A measure of the strength and direction of a linear relationship between two variables The range of r is from –1 to +1 If r is close to 1 there is a strong positive correlation. If r is close to –1 there is a strong negative correlation. If r is close to 0 there is no linear correlation. –1 0 1

14 4-14 Properties of the Pearson Correlation Coefficient 1. –1 ≤ r ≤ 1. 2.If r = + 1, then a perfect positive correlation exists between the two variables. 3.If r = –1, then a perfect negative correlation exists between the two variables. 4.The closer r is to +1, the stronger is the positive correlation between the two variables. 5.The closer r is to –1, the stronger is the negative correlation between the two variables.

15 4-15 6.If r is close to 0, then little or no evidence exists of correlation between the two variables. So r close to 0 does not imply no relation, just no linear relation. 7.The “r” coefficient is dimensionless. 8.The Pearson correlation coefficient is not resistant. Therefore, just one observation that does not follow the overall data pattern (think outlier) could affect the value of r.

16 4-16

17 4-17 EXAMPLE Determining the Pearson Correlation Coefficient Determine the Pearson correlation coefficient “r” of the drilling data: 1.By algebra (boo!) 2.By calculator (yeah!)

18 4-18

19 4-19

20 20Chap 2 Better way … 1.Enter all the “x” data in List 1, and all the “y” data in List 2. Make sure you keep the related pairs of x,y data in the same order. 2. Set your Calc to “Diagnostics On” 3.Go to Stat: Calc:4 LinReg: L1,L2 4.Look for value of Pearson “r”

21 Chap 9 21 TI-84 Line of Regression (LOR) X data (horiz) to (L1), Y (vert) data to (L2) X data (horiz) to (L1), Y (vert) data to (L2) STAT PLOT: Plot 1: Scatter Plot STAT PLOT: Plot 1: Scatter Plot Zoom:9:Stat Zoom:9:Stat STAT:Calc:4: LinReg(ax+b): L1, L2, Y1 STAT:Calc:4: LinReg(ax+b): L1, L2, Y1 This will generate the LOR on the Stat Plot thru the points, and will show the equation at Y1 This will generate the LOR on the Stat Plot thru the points, and will show the equation at Y1 To predict “y” value when x=9, find Y1(9) To predict “y” value when x=9, find Y1(9) Y1 is found at VARS, Y-VARS, Func, Y1 Y1 is found at VARS, Y-VARS, Func, Y1

22 4-22 Testing for a Linear Relation 1.Determine the absolute value of the Pearson correlation coefficient: |r|. 2.Find the critical value in Table II from Appendix A (or handout) for the given sample size. 3. If |r| is greater than the critical value, then a usable (make predictions) linear relation exists between the two variables. Otherwise, no linear relation exists.

23 4-23 EXAMPLE Does a Linear Relation Exist? Determine whether a linear relation exists between time and depth of the drilling. What type of relation appears to exist between time to drill five feet and depth at which drilling begins? The Pearson |r| value for the two variables (time/depth) is 0.773. The critical value for n = 12 observations is 0.576. Since 0.773 > 0.576, there is a positive linear correlation between time to drill five feet and depth at which drilling begins. We can use this correlation to make predictions.

24 4-24 Another way that two variables can be related even though there is not a causal relation is through a “lurking variable”. A lurking variable is related to both the explanatory and response variable. For example, ice cream sales and crime rates have a very high positive correlation. Does this mean that sales of ice cream causes crime rates to go up? The lurking variable is temperature. As temperatures rise, both ice cream sales and crime rates rise.

25 25Chap 2 Something to remember… Correlation between variables does not imply “causation” (the independent causes the dependent) unless the results come from a controlled experiment. Correlation of variables in an observational study only implies “association” between the variables and not “causation” of one by the other.

26 Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Section Least-squares Regression 4.2

27 4-27 Use points: (2, 5.7) and (6, 1.9) EXAMPLE Finding an Equation that Describes Linearly Correlated Data Find a linear equation that relates x (predictor variable) and y ( response variable) by selecting any two points and finding the equation of the line between those points.

28 4-28 Graph the equation on the scatter diagram. Use the equation to predict y if x = 3 Note: (3, 5.2) is actual data point

29 4-29 } (3, 5.2) residual = observed y – predicted y = 5.2 – 4.75 = 0.45 The difference between the observed value of y and the predicted value of y is the error, or residual. Using the line and the predicted value at x = 3 : residual = observed y – predicted y = 5.2 – 4.75 = 0.45 (error)

30 4-30 Least-Squares Regression Criterion The least-squares regression line (LOR or COBF) is the line that minimizes the sum of the squared errors (residuals). This LOR line minimizes the sum of the squared vertical distance between the observed values of y and those predicted by the line (“y-hat”), In other words: minimize Σ residuals 2

31 31Chap 2 Key Concepts LOR stands for “Line of Regression” LOR stands for “Line of Regression” COBF stands for “Curve of Best Fit” Both terms refer to the Least-Squares Regression Line and are used interchangeably.

32 4-32 EXAMPLE Finding the Least-squares Regression Line Find the LOR line. Predict the drilling time if drilling starts at 130 feet. Is the observed drilling time at 130 feet above, or below predicted? (a)Draw the LOR on the scatter diagram of the data.

33 4-33 We agree to round the estimates of the slope and intercept to four decimal places. (b) (c)The observed drilling time is 6.93 seconds. The predicted drilling time is 7.035 seconds. The LOR-predicted drilling time is 1.52% above observed.

34 4-34

35 4-35 Interpretation of Slope of a line: The slope of the LOR regression line is 0.0116. Therefore, for each additional one foot of depth we start the drilling, the time to drill five feet increases by 0.0116 min (~ 0.7 sec), on average.

36 4-36 If the LOR is used to make predictions based on values of the predictor (independent) variable that are significantly outside the observed values, then the researcher is working outside the scope of the model. Never use an LOR to make predictions outside the scope of the model because the linear relation may not still exist.

37 Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Section Diagnostics on the Least-squares Regression (LOR) Line 4.3

38 4-38 The coefficient of determination, R 2, measures the proportion of total variation in the response variable that is explained by the LOR line. The coefficient of determination is a number between 0 and 1, inclusive. 0 < R 2 < 1. If R 2 = 0 the LOR has no prediction value If R 2 = 1 it means 100% of the variation in the response variable is caused by a change in the predictor variable.

39 4-39 Depth at which drilling begins is the predictor variable, “x” Time (min) to drill five feet is the response variable, y.

40 4-40

41 4-41 Regression Analysis The regression equation (LOR) is: y (time) = 0.0116x (depth) + 5.53 (min) Sample Statistics Mean Standard Deviation Depth126.252.2 Time6.990.781 Correlation Between Depth and Time: 0.773

42 4-42 Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”? ANSWER: The mean time to drill additional 5 feet: 6.99 minutes (see Sample Statistics)

43 4-43 Now suppose that we are asked to predict the time to drill an additional 5 feet if we know that the current depth of the drill is 160 feet? ANSWER: Our “guess” increased from 6.99 minutes to 7.39 minutes because we knew the drill depth and the LOR equation.

44 4-44

45 45Chap 2 Definitions The “observed” value of the response (dependent) variable: The “predicted” value (by the LOR) “ “ “ The “mean” value (of all the “y” values) “

46 4-46 Total Deviation Unexplained Deviation Explained Deviation +=

47 4-47 Total Variation = Unexplained Variation + Explained Variation Unexplained Variation Explained Variation Total Variation = 1 – R 2 =

48 4-48 To determine R 2 for the linear regression model simply square the value of the Pearson correlation coefficient “r ”. The TI-84 gives you both “r” and R 2 when you use the “LinReg subroutine (remember to set “Diagnostics On”) To determine R 2 for the linear regression model simply square the value of the Pearson correlation coefficient “r ”. The TI-84 gives you both “r” and R 2 when you use the “LinReg subroutine (remember to set “Diagnostics On”)

49 4-49 EXAMPLE Determining the Coefficient of Determination Find and interpret the coefficient of determination for the drilling data. The Pearson correlation coefficient, “r ” = 0.773, R 2 = 0.773 2 = 0.5975 = 59.75%. So, 59.75% of the variance in drilling time is explained by the variance of drilling depth.

50 4-50 Data Set A Data Set B Data Set C A: 99.99% of the variation in y is explained by the variation in x (LOR) B: 94.7% of the variation in y is explained by the variation in x (LOR) C: 9.4% of the variation in y is explained by the variation in x (LOR)

51 4-51 Residuals play an important role in determining the adequacy of the linear model. If a plot of the residuals against the predictor (indep) variable shows a discernable pattern, such as a curve, then the response (dep) and predictor variable may not be linearly related.

52 4-52

53 4-53 If a plot of the residuals versus the predictor (x) variable shows the spread of the residuals increasing or decreasing as the “x” variable increases, then a requirement for a linear model is violated. This requirement is called “constant error variance”

54 4-54

55 4-55 A plot of residuals against the predictor (indep) variable may also reveal outliers. These values will be easy to identify because: the residual will lie far from others in the plot.

56 4-56

57 4-57 EXAMPLE Residual Analysis Draw a residual plot of the drilling time data. Comment on the appropriateness (validity) of the LOR least-squares model.

58 4-58

59 4-59 An influential observation is: an observation (data pair) that significantly affects either: 1.the LOR’s slope and/or y-intercept, or 2. the value of the Pearson linear correlation coefficient “r”.

60 4-60 Predictor/Explanatory, x Influential observations typically exist when the point is an outlier relative to the LOR. So, Case 3 is likely to be influential.

61 4-61 Suppose an additional data point is added to the drilling data. At a depth of 300 feet, it took 12.49 minutes to drill 5 feet. Is this point influential? EXAMPLE Influential Observations

62 4-62

63 4-63 LOR with influential LOR without influential

64 Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Section Contingency Tables and Association 4.4

65 4-65 A college professor conducted a study to assess the effectiveness of teaching a statistics course via (1) traditional lecture method, (2) online delivery (no classroom meetings), and (3) hybrid instruction (online course + weekly meetings) The grades (A – F) that students received in each of the courses were tallied. The table is referred to as a contingency table. The row (response) variable is “grade” and the column (predictor) variable is “delivery method”. Each position inside the table is referred to as a cell.

66 4-66 A marginal distribution of a variable is either a freq or rel freq distribution of the row or column variable from the contingency table. (it gets its name from the fact that the freq’s are displayed in either the bottom or right margins of the table)

67 4-67 EXAMPLE Frequency Marginal Distributions Find the frequency marginal distributions for course grade (rows) and delivery method (cols).

68 4-68 EXAMPLE Relative Frequency Marginal Distributions Determine the relative frequency marginal distribution for course grade and delivery method.

69 4-69 A conditional distribution lists the relative frequency of each category of the response variable “y” for a given value of the predictor variable “x” in the contingency table. In other words, each cell contains a rel freq value.

70 4-70 EXAMPLE Determining a Conditional Distribution Comment on any association that may exist between course grade and delivery method. “It appears that students in the hybrid course are more likely to pass than in the other two methods.”

71 4-71 EXAMPLE Drawing a Bar Graph of a Conditional Distribution Using the results of the previous example, draw a bar graph that represents the conditional distribution of method of delivery (y) by grade earned (x).

72 4-72 The following contingency table shows the survival status by category of passenger on the RMS Titanic on 15 Apr 1912. The actual total death toll was 1502/2224 or 67.5% Draw a conditional bar graph of survival status (y) by pax cat (x).

73 4-73 Simpson’s Paradox represents a situation in which an association between two variables inverts or disappears when the effect of a third (“lurking”) variable is introduced to the analysis. For ex, UC Berkely was sued for favoring males over females (“gender bias”) in its acceptance rates because the A rate for males was 0.460 and for females was 0.304. However, when the variable “Program of Study” was included, the acceptance rate for females in most programs was actually higher.

74 Chap 274


Download ppt "Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4."

Similar presentations


Ads by Google