Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2010 Pearson Education, Inc. Chapter 3 Scatterplots, Correlation and Least Squares Regression Slide 7 - 1.

Similar presentations


Presentation on theme: "Copyright © 2010 Pearson Education, Inc. Chapter 3 Scatterplots, Correlation and Least Squares Regression Slide 7 - 1."— Presentation transcript:

1 Copyright © 2010 Pearson Education, Inc. Chapter 3 Scatterplots, Correlation and Least Squares Regression Slide 7 - 1

2 Copyright © 2010 Pearson Education, Inc. Intro Vocab Response Variable Measures an outcome of a study Explanatory Variable Attempts to explain the observed outcome Independent Variable X (explanatory) Dependent Variable Y (Response) Slide 7 - 2

3 Copyright © 2010 Pearson Education, Inc. Slide 7 - 3 Looking at Scatterplots Scatterplots may be the most common and most effective display for data. In a scatterplot, you can see patterns, trends, relationships, and even the occasional extraordinary value sitting apart from the others. Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables. Always plot the explanatory variable on the horizontal axis

4 Copyright © 2010 Pearson Education, Inc. Slide 7 - 4 Looking at Scatterplots (cont.) When looking at scatterplots, we will look for direction, form, strength, and unusual features. Direction: A pattern that runs from the upper left to the lower right is said to have a negative direction. A trend running the other way has a positive direction.

5 Copyright © 2010 Pearson Education, Inc. Slide 7 - 5 Looking at Scatterplots (cont.) Form: If there is a straight line (linear) relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form.

6 Copyright © 2010 Pearson Education, Inc. Slide 7 - 6 Looking at Scatterplots (cont.) Form: If the relationship isn’t straight, but curves gently, while still increasing or decreasing steadily, we can often find ways to make it more nearly straight.

7 Copyright © 2010 Pearson Education, Inc. Slide 7 - 7 Looking at Scatterplots (cont.) Form: If the relationship curves sharply, the methods cannot really help us.

8 Copyright © 2010 Pearson Education, Inc. Slide 7 - 8 Looking at Scatterplots (cont.) Strength: At one extreme, the points appear to follow a single stream (whether straight, curved, or bending all over the place).

9 Copyright © 2010 Pearson Education, Inc. Slide 7 - 9 Looking at Scatterplots (cont.) Strength: At the other extreme, the points appear as a vague cloud with no discernable trend or pattern: Note: we will quantify the amount of scatter soon.

10 Copyright © 2010 Pearson Education, Inc. Slide 7 - 10 Looking at Scatterplots (cont.) Unusual features: Look for the unexpected. Often the most interesting thing to see in a scatterplot is the thing you never thought to look for. One example of such a surprise is an outlier standing away from the overall pattern of the scatterplot. Clusters or subgroups should also raise questions.

11 Copyright © 2010 Pearson Education, Inc. Slide 7 - 11 Roles for Variables It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis. This determination is made based on the roles played by the variables. When the roles are clear, the explanatory or predictor variable (independent) goes on the x- axis, and the response variable (variable of interest) (dependent) goes on the y-axis.

12 Copyright © 2010 Pearson Education, Inc. Slide 7 - 12 Roles for Variables (cont.) The roles that we choose for variables are more about how we think about them rather than about the variables themselves. Just placing a variable on the x-axis doesn’t necessarily mean that it explains or predicts anything. And the variable on the y-axis may not respond to it in any way.

13 Copyright © 2010 Pearson Education, Inc. Slide 7 - 13 TI-Tips Scatterplots First let me show you how to name a list … Enter the years 1990 to 2000 as 0,1,2, …, 10 in L1 On the next column enter the following values: 6546, 6996, 6996, 7350, 7500, 7978, 8377, 8710, 9110, 9411, 9800 STATPLOT, choose the scatterplot icon Identify which Xlist and Ylist (L1 and L2). Choose a symbol for displaying the points. ZoomStat (If you ever get a ERR:DIM MISMATCH means you don’t have the same number of x’s as y’s. or you may have another STATPLOT on.) TRACE will show you the value of each point.

14 Copyright © 2010 Pearson Education, Inc. Slide 7 - 14 Data collected from students in Statistics classes included their heights (in inches) and weights (in pounds): Here we see a positive association and a fairly straight form, although there seems to be a high outlier. Correlation

15 Copyright © 2010 Pearson Education, Inc. Slide 7 - 15 How strong is the association between weight and height of Statistics students? If we had to put a number on the strength, we would not want it to depend on the units we used. A scatterplot of heights (in centimeters) and weights (in kilograms) doesn’t change the shape of the pattern: Correlation (cont.)

16 Copyright © 2010 Pearson Education, Inc. Slide 7 - 16 Correlation (cont.) Since the units don’t matter, why not remove them altogether? We could standardize both variables and write the coordinates of a point as (z x, z y ). Here is a scatterplot of the standardized weights and heights:

17 Copyright © 2010 Pearson Education, Inc. Slide 7 - 17 Correlation (cont.) Note that the underlying linear pattern seems steeper in the standardized plot than in the original scatterplot. That’s because we made the scales of the axes the same. Equal scaling gives a neutral way of drawing the scatterplot and a fairer impression of the strength of the association.

18 Copyright © 2010 Pearson Education, Inc. Slide 7 - 18 Correlation (cont.) Some points (those in green) strengthen the impression of a positive association between height and weight. Other points (those in red) tend to weaken the positive association. Points with z-scores of zero (those in blue) don’t vote either way.

19 Copyright © 2010 Pearson Education, Inc. Slide 7 - 19 Correlation (cont.) The correlation coefficient (r) gives us a numerical measurement of the strength of the linear relationship between the explanatory and response variables.

20 Copyright © 2010 Pearson Education, Inc. Slide 7 - 20 Correlation Conditions Correlation measures the strength of the linear association between two quantitative variables. Before you use correlation, you must check several conditions: Quantitative Variables Condition Straight Enough Condition Outlier Condition

21 Copyright © 2010 Pearson Education, Inc. Slide 7 - 21 Correlation Conditions (cont.) Quantitative Variables Condition: Correlation applies only to quantitative variables. Don’t apply correlation to categorical data masquerading as quantitative. Check that you know the variables’ units and what they measure.

22 Copyright © 2010 Pearson Education, Inc. Slide 7 - 22 Correlation Conditions (cont.) Straight Enough Condition: You can calculate a correlation coefficient for any pair of variables. But correlation measures the strength only of the linear association, and will be misleading if the relationship is not linear.

23 Copyright © 2010 Pearson Education, Inc. Slide 7 - 23 Correlation Conditions (cont.) Outlier Condition: Outliers can distort the correlation dramatically. An outlier can make an otherwise small correlation look big or hide a large correlation. It can even give an otherwise positive association a negative correlation coefficient (and vice versa). When you see an outlier, it’s often a good idea to report the correlations with and without the point.

24 Copyright © 2010 Pearson Education, Inc. Slide 7 - 24 TI Tips – Finding Correlations You must first tell your calculator to find correlations. Do this once and it should be done until you change your batteries: 2 nd CATALOG. Scroll down until you find DiagnosticOn. Hit Enter. It should say DONE. Check the conditions first by looking at a scatterplot. Does the association look linear, are there outliers? STAT CALC, select 8:LinReg(a+bx), ENTER Add the names of your x and y lists (2 nd STAT) separate them by a comma. Then hit enter. You will see lots of numbers. For now we will use only r. What does it mean? Let’s find out!

25 Copyright © 2010 Pearson Education, Inc. Slide 7 - 25 Correlation Properties The sign of a correlation coefficient gives the direction of the association. Correlation is always between –1 and +1. Correlation can be exactly equal to –1 or +1, but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line. A correlation near zero corresponds to a weak linear association.

26 Copyright © 2010 Pearson Education, Inc. Slide 7 - 26 Correlation Properties (cont.) Correlation treats x and y symmetrically: The correlation of x with y is the same as the correlation of y with x. Correlation has no units. Correlation is not affected by changes in the center or scale of either variable. Correlation depends only on the z-scores, and they are unaffected by changes in center or scale.

27 Copyright © 2010 Pearson Education, Inc. Slide 7 - 27 Correlation Properties (cont.) Correlation measures the strength of the linear association between the two variables. Variables can have a strong association but still have a small correlation if the association isn’t linear. Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a large one small.

28 Copyright © 2010 Pearson Education, Inc. Slide 7 - 28 Correlation ≠ Causation Whenever we have a strong correlation, it is tempting to explain it by imagining that the predictor variable has caused the response to help. Scatterplots and correlation coefficients never prove causation. A hidden variable that stands behind a relationship and determines it by simultaneously affecting the other two variables is called a lurking variable.

29 Copyright © 2010 Pearson Education, Inc. Slide 7 - 29 Straightening Scatterplots Straight line relationships are the ones that we can measure with correlation. When a scatterplot shows a bent form that consistently increases or decreases, we can often straighten the form of the plot by re-expressing one or both variables.

30 Copyright © 2010 Pearson Education, Inc. Slide 7 - 30 What Can Go Wrong? Don’t say “correlation” when you mean “association.” More often than not, people say correlation when they mean association. The word “correlation” should be reserved for measuring the strength and direction of the linear relationship between two quantitative variables.

31 Copyright © 2010 Pearson Education, Inc. Slide 7 - 31 What Can Go Wrong? Don’t correlate categorical variables. Be sure to check the Quantitative Variables Condition. Don’t confuse “correlation” with “causation.” Scatterplots and correlations never demonstrate causation. These statistical tools can only demonstrate an association between variables.

32 Copyright © 2010 Pearson Education, Inc. Slide 7 - 32 What Can Go Wrong? (cont.) Be sure the association is linear. There may be a strong association between two variables that have a nonlinear association.

33 Copyright © 2010 Pearson Education, Inc. Slide 7 - 33 What Can Go Wrong? (cont.) Don’t assume the relationship is linear just because the correlation coefficient is high. Here the correlation is 0.979, but the relationship is actually bent.

34 Copyright © 2010 Pearson Education, Inc. Slide 7 - 34 What Can Go Wrong? (cont.) Beware of outliers. Even a single outlier can dominate the correlation value. Make sure to check the Outlier Condition.

35 Copyright © 2010 Pearson Education, Inc. Slide 8 - 35 The Linear Model The linear model (line of best fit, least squares line, regression line) is just an equation of a straight line through the data to show us how the values are associated. Using this line we will be able to predict values. Predicted values are denoted as: (also called y-hat). The hat tells you they are predicted values. The difference between the observed-value and the predicted-value is called the residual. residual = observed – predicted = y – y(hat)

36 Copyright © 2010 Pearson Education, Inc. Slide 8 - 36 A negative residual means the predicted value’s too big (an overestimate). A positive residual means the predicted value’s too small (an underestimate). In the figure, the estimated fat of the BK Broiler chicken sandwich is 36 g, while the true value of fat is 25 g, the residual=?

37 Copyright © 2010 Pearson Education, Inc. Slide 8 - 37 “Best Fit” Means Least Squares Some residuals are positive, others are negative, and, on average, they cancel each other out. To calculate how well the line fits the data we square the residuals (to eliminate the negatives) then find the sum of the squares. The smaller the sum, the better the fit. That is why another name is least squares line.

38 Copyright © 2010 Pearson Education, Inc. Slide 8 - 38 If the variables are standardized (zscores or standard deviations): The equation of the line of best fit is: Correlation (also called r) is the same for x and y because it is standardized. Therefore:

39 Copyright © 2010 Pearson Education, Inc. Slide 8 - 39 Sometimes we are given the regression line in REAL UNITS!!! The regression line for the Burger King data fits the data well: The equation is Example: What is predicted fat content for a BK Broiler chicken sandwich (with 30 g of protein)?

40 Copyright © 2010 Pearson Education, Inc. Slide 8 - 40 To find the regression line (in real units): You may be given the standard deviations, correlation and means. OR …You may be given raw data.

41 Copyright © 2010 Pearson Education, Inc. Slide 8 - 41 First make sure a regression is appropriate: Since regression and correlation are closely related, we need to check the same conditions for regressions as we did for correlations: Quantitative Variables Condition Straight Enough Condition (look at scatterplot) Outlier Condition (look at scatterplot)

42 Copyright © 2010 Pearson Education, Inc. Slide 8 - 42 To create the Regression Line in Real Units given the standard deviations, correlation (r), and means: You know the equation of a line. In Statistics we use a slightly different notation: We write b 1 and b 0 for the slope and intercept of the line. (slope is always in units of y per unit of x)

43 Copyright © 2010 Pearson Education, Inc. Slide 8 - 43 To find a regression line (linear model) with raw data: Use your calculator! First, be sure to check: Quantitative Linear (scatterplot) No outliers (scatterplot). If it is not quantitative, not linear or it has outliers, you will NOT be able to model the data with a linear model.

44 Copyright © 2010 Pearson Education, Inc. Slide 8 - 44 TI Tips: Equation of the Regression Line STAT, CALC, Choose LinReg(a + bx) (Not the first one … the second one … scroll down!) Specify that x and y are YR and TUIT (we put these in our calculator before.)

45 Copyright © 2010 Pearson Education, Inc. Slide 8 - 45 TI Tips: Equation of the Regression Line Graphed on the Scatterplot STAT, CALC, Choose LinReg(a + bx) (Not the first one … the second one … scroll down!) Specify that x and y are YR and TUIT (we put these in our calculator before.) We want the screen to say: LinReg(a+bx) YR, TUIT, Y1 (this will send the equation to Y1 and then we will see it on our graph) To add Y1 to the end: VARS, Y-VARS, 1:Function and choose Y1 ENTER See the equation. It has also been placed in Y1. Hit GRAPH.

46 Copyright © 2010 Pearson Education, Inc. Slide 8 - 46 Example: Using the relationship between house price (in thousands of dollars) and house size (in thousands of square feet) the regression model is: a. What is the slope and what does it mean? b. What are the units of the slope? c. Your house is 2000 square feet bigger than your neighbor’s house. How much more do you expect it to be worth? d. Is the y-intercept of -3.117 meaningful, explain?

47 Copyright © 2010 Pearson Education, Inc. Slide 8 - 47 More about Residuals A scatterplot of all the residuals the graph should be completely random! It should show no bends and should have no outliers.

48 Copyright © 2010 Pearson Education, Inc. Slide 8 - 48 TI Tips – Residual Plots You look at the scatterplot to make sure it is linear. Sometimes it is hard to tell. After you do a regression do a residual plot. If the residual plot is completely random, you know your scatterplot was linear. The calculator automatically stores the residuals in a list named RESID after you run a regression. To look at them … STAT EDIT cursor over to RESID. To create the residual plot … STAT PLOT, Plot2, Xlist:YR and Ylist: RESID Y= may still have your regression line in it. You can either turn it off or remove it. ZoomStat Do you see a curve?

49 Copyright © 2010 Pearson Education, Inc. Slide 8 - 49 Example: Our linear model for homes uses the model: predicted price = -3.117 + (94.454)(size) a. Would you prefer to find a home with a negative or a positive residual? Explain. b. You plan to look for a home of about 3000 square feet. How much should you expect to have to pay? c. You find a nice home that size selling for $300,000. What’s the residual?

50 Copyright © 2010 Pearson Education, Inc. Slide 8 - 50 The Residual Standard Deviation The standard deviation of the residuals, s e, measures how much the points spread around the regression line. S e = “Errors in predictions based on this model have a standard deviation of s (standard deviation in y units).” We estimate the SD of the residuals using:

51 Copyright © 2010 Pearson Education, Inc. Slide 8 - 51 R 2 —The Variation Accounted For (cont.) All regression analyses include this statistic, although by tradition, it is written R 2 (pronounced “R-squared”). An R 2 of 0 means that none of the variance in the data is in the model; all of it is still in the residuals. When interpreting a regression model you need to Tell what R 2 means. “The % of variability in y that is explained by x is” R 2 R 2 is always between 0% and 100%. What makes a “good” R 2 value depends on the kind of data you are analyzing and on what you want to do with it. Always report slope and intercept for a regression and R 2 so that readers can judge for themselves how successful the regression is at fitting the data.

52 Copyright © 2010 Pearson Education, Inc. Slide 8 - 52 Assumptions and Conditions Quantitative Variables Condition: Regression can only be done on two quantitative variables (and not two categorical variables). Straight Enough Condition: The linear model assumes that the relationship between the variables is linear. (check by scatterplot)

53 Copyright © 2010 Pearson Education, Inc. Slide 8 - 53 Assumptions and Conditions (cont.) If the scatterplot is not straight enough, stop here. You can only use a linear model on two variables that are related linearly! Some nonlinear relationships can be saved by re- expressing the data to make the scatterplot more linear.

54 Copyright © 2010 Pearson Education, Inc. Slide 8 - 54 Assumptions and Conditions (cont.) It’s a good idea to check linearity again after computing the regression when we can examine the residuals. Does the Plot Thicken? Condition: Residual plots should be scattered. Don’t confuse this with Normal Probability Plots from unit one (to see if it is a normal curve) should be a straight line.

55 Copyright © 2010 Pearson Education, Inc. Slide 8 - 55 Assumptions and Conditions (cont.) Outlier Condition: Watch out for outliers. Outlying points can dramatically change a regression model. Outliers can even change the sign of the slope, misleading us about the underlying relationship between the variables.

56 Copyright © 2010 Pearson Education, Inc. Slide 8 - 56 What Can Go Wrong? Don’t fit a straight line to a nonlinear relationship. Beware extraordinary points (y-values that stand off from the linear pattern or extreme x-values). Don’t extrapolate beyond the data—the linear model may no longer hold outside of the range of the data. Don’t infer that x causes y just because there is a good linear model for their relationship— association is not causation. Don’t choose a model based on R 2 alone.

57 Copyright © 2010 Pearson Education, Inc. Slide 8 - 57 A few IMPORTANT things to remember: “The percentage of variability in y that is explained by x is: r 2 ” (an example of this will be homework problem #7) Correlation = r = +/- squareroot of r 2 (you need to decide if it is + or – for a positive or negative correlation) residual = observed – predicted = y – y(hat) R 2 tells you how well the actual data fits the model (1 is perfect, zero is no correlation) 1 – r 2 is the fraction of the original variance left in the residuals Be careful not to use a regression to extrapolate (predict values beyond the scope/time frame of the model)


Download ppt "Copyright © 2010 Pearson Education, Inc. Chapter 3 Scatterplots, Correlation and Least Squares Regression Slide 7 - 1."

Similar presentations


Ads by Google