Summarizing Bivariate Data

Summarizing Bivariate Data
Chapter 5 Summarizing Bivariate Data

Create a scatterplot of the data below.
Suppose we found the age and weight for each person in a sample of 10 adults. Is there any relationship between the age and weight of these adults? Create a scatterplot of the data below. There does not appear to be a relationship between age and weight in adults. Age Weight Do you think there is a relationship? If so, what kind? If not, why not? Students should doubt that any relationship exists between age and weight. Ask students what feature of the scatterplot indicates that there is little to no relationship between age and weight in adults. Age 24 30 41 28 50 46 49 35 20 39 Wt 256 124 320 185 158 129 103 196 110 130

Create a scatterplot of the data below.
Suppose we found the height and weight for each person in a sample of 10 adults. Is there any relationship between the height and weight of these adults? Create a scatterplot of the data below. Height Weight Do you think there is a relationship? If so, what kind? If not, why not? Is it positive or negative? Weak or strong? Have students explain what features of the graph indicate that there is a relationship between height and weight in adults. Ht 74 65 77 72 68 60 62 73 61 64 Wt 256 124 320 185 158 129 103 196 110 130

What does it mean if the relationship is positive?
Correlation What feature(s) of the graph would indicate a weak or strong relationship? The relationship between bivariate numerical variables May be positive or negative May be weak or strong A positive relationship is one where as x increases, y increases. A negative relationship is one where as x increases, y dereases. What does it mean if the relationship is positive? Negative?

Identify the strength and direction of the following data sets.
Set B Set C Set A Set D Direction should be positive or negative (or none) Strength should be weak or strong Be sure to ask what feature(s) of the graph indicate the relationship. Set A has a strong, positive linear relationship. Set B shows little or no relationship. Set C has a moderate, negative linear relationship. Set D has a strong positive curved relationship. Set D shows a strong, positive curved relationship. Set A shows a strong, positive linear relationship. Set B shows little or no relationship. Set C shows a weaker (moderate), negative linear relationship.

Identify as having a positive relationship, a negative relationship, or no relationship.
+ Heights of mothers and heights of their adult daughters - Age of a car in years and its current value + Weight of a person and calories consumed Height of a person and the person’s birth month no Students need to think about the relationships between two variables. Number of hours spent in safety training and the number of accidents that occur -

Correlation Coefficient (r)-
A quantitative assessment of the strength and direction of the linear relationship in bivariate, quantitative data Pearson’s sample correlation is used the most Population correlation coefficient - r (rho) statistic correlation coefficient – r Equation: What are these values called? These are the z-scores for x and y. Note the parentheses in the formula are the z-scores for x and y.

Example 5.1 For the six primarily undergraduate universities in California with enrollments between 10,000 and 20,000, six-year graduation rates (y) and student-related expenditures per full-time students (x) for 2003 were reported as follows: Create a scatterplot and calculate r. Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 Use the formula to compute the correlation coefficient by using the lists in the graphing calculator. Show students how to calculate r by using the linear regression command in the calculator.

Example 5.1 Continued Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 Expenditures Graduation Rates r = 0.05 Use the formula to compute the correlation coefficient by using the lists in the graphing calculator. Show students how to calculate r by using the linear regression command in the calculator. In order to interpret what this number tells us, let’s investigate the properties of the correlation coefficient

Properties of r (correlation coefficient)
1) legitimate values are -1 < r < 1 Strong correlation No Correlation Moderate Correlation Weak correlation

2) value of r is not changed by any linear transformation
Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 Suppose that the graduation rates were changed from percents to decimals (divide by 100). Transform the graduation rates and calculate r. Do the following transformations and calculate r 1) x’ = 5(x + 14) 2) y’ = (y + 30) ÷ 4 Investigate how transformations affect the correlation coefficient. 2) value of r is not changed by any linear transformation r = 0.05 It is the same! Why?

Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 Suppose we wanted to estimate the expenditures per student for given graduation rates. Switch x and y, then calculate r. Investigate how switching x and y affect the correlation coefficient. r = 0.05 It is the same! 3) value of r does not depend on which of the two variables is labeled x

4) value of r is affected by extreme values.
Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 63.9 Plot a revised scatterplot and find r. Expenditures Graduation Rates Expenditures Graduation Rates Suppose the 33.9 was REALLY What do you think would happen to the value of the correlation coefficient? r = 0.42 Investigate how extreme values affect the correlation coefficient. Extreme values affect the correlation coefficient 4) value of r is affected by extreme values.

Does this mean that there is NO relationship between these points?
Find the correlation for these points: x Y Compute the correlation coefficient? Sketch the scatterplot r = 0 x y r = 0, but the data set has a definite relationship! Discover that the value of the correlation coefficient is about the linear relationship of the data points Does this mean that there is NO relationship between these points? 5) value of r is a measure of the extent to which x and y are linearly related

Recap the Properties of r:
legitimate values of r are -1 < r < 1 value of r is not changed by any transformation value of r does not depend on which of the two variables is labeled x value of r is affected by extreme values value of r is a measure of the extent to which x and y are linearly related

Example 5.1 Continued Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 Expenditures Graduation Rates Interpret r = 0.05 A quantitative assessment of the strength and direction of the linear relationship between bivariate, quantitative data There is a weak, positive, linear relationship between expenditures and graduation rates. Use the definition to write an interpretation In order to interpret r, recall the definition of the correlation coefficient.

Does a value of r close to 1 or -1 mean that a change in one variable cause a change in the other variable? Consider the following examples: The relationship between the number of cavities in a child’s teeth and the size of his or her vocabulary is strong and positive. Consumption of hot chocolate is negatively correlated with crime rate. Causality can only be shown by carefully controlling values of all variables that might be related to the ones under study. In other words, with a well-controlled, well-designed experiment. Should we all drink more hot chocolate to lower the crime rate? Both are responses to cold weather Discuss why these variables are related See page 207. These variables are both strongly related to the age of the child So does this mean I should feed children more candy to increase their vocabulary?

Correlation does not imply causation

What is the objective of regression analysis?
x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable We will use values of x to predict values of y. The objective of regression analysis is to use information about one variable, x, to draw some sort of a conclusion about a second variable, y. Suppose that we have two variables: x = the amount spent on advertising y = the amount of sales for the product during a given period What question might I want to answer using this data?

Be sure to put the hat on the y
The LSRL is Scatterplots frequently exhibit a linear pattern. When this is the case, it makes sense to summarize the relationship between the variables by finding a line that is as close as possible to the plots in the plot. This is done by calculating the line of best fit or Least Square Regression Line (LSRL). - (y-hat) means the predicted y b – is the slope it is the approximate amount by which y increases when x increases by 1 unit a – is the y-intercept it is the approximate height of the line when x = 0 in some situations, the y-intercept has no meaning The LSRL is the line that minimizes the sum of the squares of the deviations from the line Be sure to put the hat on the y The slope of the LSRL is Let’s explore what this means . . . The intercept of the LSRL is

Suppose we have a data set that consists of the observations (0,0), (3,10) and 6,2).
Let’s just fit a line to the data by drawing a line through what appears to be the middle of the points. Now find the vertical distance from each point to the line. Find the sum of the squares of these deviations. (3,10) (6,2) y =.5(6) + 4 = 7 2 – 7 = -5 4.5 y =.5(0) + 4 = 4 0 – 4 = -4 -5 y =.5(3) + 4 = 5.5 10 – 5.5 = 4.5 -4 (0,0) Sum of the squares = 61.25

What is the sum of the deviations from the line?
Will it always be zero? Use a calculator to find the line of best fit Find the sum of the squares of the deviations from the line (0,0) (3,10) (6,2) 6 Find the vertical deviations from the line -3 The line that minimizes the sum of the squares of the deviations from the line is the LSRL. -3 Sum of the squares = 54

Sketch a scatterplot for this data set.
Researchers are studying pomegranate's antioxidants properties to see if it might be helpful in the treatment of cancer. In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time. (x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm3) x y Sketch a scatterplot for this data set.

Calculate the LSRL and the correlation coefficient.
Remember that an interpretation is stating the definition in context. Pomegranate study continued x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume x y Calculate the LSRL and the correlation coefficient. Interpret the slope and the correlation coefficient in context. The average volume of the tumor increases by approximately mm3 for each day increase in the number of days after injection. There is a strong, positive, linear relationship between the average tumor volume and the number of days since injection. Does the intercept have meaning in this context? Why or why not?

This is the danger of extrapolation
This is the danger of extrapolation. The least-squares line should not be used to make predictions for y using x-values outside the range in the data set. Pomegranate study continued x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume x y Predict the average volume of the tumor for 20 days after injection. Predict the average volume of the tumor for 5 days after injection. Why? It is unknown whether the pattern observed in the scatterplot continues outside the range of x-values. Can volume be negative?

Is this the appropriate regression line to answer this question?
Pomegranate study continued x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume x y Suppose we want to know how many days after injection of cancer cells would the average tumor size be 500 mm3? No, the slope of the line for predicting x is not and the intercepts are almost always different. Here is the appropriate regression line: The regression line of y on x should not be used to predict x, because it is not the line that minimizes the sum of the squared deviations in the x direction. Is this the appropriate regression line to answer this question?

Will the point of averages always be on the regression line?
Pomegranate study continued x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume x y Find the mean of the x-values (x) and the mean of the y-values (y). Will the point of averages always be on the regression line? + x = 19 and y = 438 Plot the point of averages (x,y) on the scatterplot.

Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data set Suppose we have the following data set. x y Sketch a scatterplot. Calculate the LSRL and the correlation coefficient.

Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data set Suppose we have the following data set. x y Suppose we add the point (5,8) to the data set. What happens to the regression line and the correlation coefficient? 5 8 What happened?

Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data set Suppose we have the following data set. x y Suppose we add the point (12,12) to the data set. What happens to the regression line and the correlation coefficient? 12 What happened?

Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data set Suppose we have the following data set. x y Suppose we add the point (12,0) to the data set. What happens to the regression line and the correlation coefficient? 12 What happened?

The correlation coefficient and the LSRL are both measures that are affected by extreme values.

We will discuss what these numbers mean in the Chapter 13.
Pomegranate study revisited x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume x y Minitab, a statistical software package, was used to fit the least-squares regression line. Part of the resulting output is shown below. We will discuss what these numbers mean in the Chapter 13. slope intercept The regression equation is Predicted volume = days Predictor Coef SE Coef T P Constant 0.0014 Days 37.25 0.000

Assessing the fit of the LSRL
Once the LSRL is obtained, the next step is to examine how effectively the line summarizes the relationship between x and y. Important questions are: Is the line an appropriate way to summarize the relationship between x and y. Are there any unusual aspects of the data set that we need to consider before proceeding to use the line to make predictions? If we decide to use the line as a basis for prediction, how accurate can we expect predictions based on the line to be? We will look at graphical and numerical methods to answer these questions.

Plot the data, including the regression line.
In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. Minitab was used to fit the least-squares regression line. From the partial output, identify the regression line. x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 y 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65 Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to debris 3.234 1.782 1.82 0.112 S= R-Sq = 32.0% R-Sq(adj) = 22.3% Plot the data, including the regression line.

If the point is above the line, the residual will be positive.
In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. The vertical deviation between the point and the LSRL is called the residual. If the point is above the line, the residual will be positive. Residuals are calculated by subtracting the predicted y from the observed y. If the point is below the line the residual will be negative. x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 y 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65 Distance traveled Distance to debris

Use the LSRL to calculate the predicted distance traveled.
In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. Use the LSRL to calculate the predicted distance traveled. Subtract to find the residuals. What does this remind you of? Distance from debris Distance traveled (y) Predicted distance traveled Residual 6.94 0.00 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 22.72 9.19 20.11 9.05 26.16 9.36 30.65 What does the sum of the residuals equal? 14.76 9.23 9.16 15.28 18.70 10.10 22.04 21.58 22.59 -14.76 -3.10 2.13 -0.93 -6.67 12.62 -1.93 4.58 8.06 Will the sum of the residuals always equal zero? This should remind students of the fact that the sum of the deviations from the mean is always zero.

Residual plots Is a scatterplot of the (x, residual) pairs.
Residuals can also be graphed against the predicted y-values The purpose is to determine if a linear model is the best way to describe the relationship between the x & y variables If no pattern exists between the points in the residual plot, then the linear model is appropriate.

This residual shows no pattern so it indicates that the linear model is appropriate.
This residual shows a curved pattern so it indicates that the linear model is not appropriate.

Plot the residuals against the distance from debris (x)
In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. Use the values in this table to create a residual plot for this data set. Is a linear model appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for food? Distance from debris Distance traveled (y) Predicted distance traveled Residual 6.94 0.00 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 22.72 9.19 20.11 9.05 26.16 9.36 30.65 14.76 9.23 9.16 15.28 18.70 10.10 22.04 21.58 22.59 -14.76 -3.10 2.13 -0.93 -6.67 12.62 -1.93 4.58 8.06 This should remind students of the fact that the sum of the deviations from the mean is always zero. Plot the residuals against the distance from debris (x)

Now plot the residuals against the predicted distance from food.
Since the residual plot displays no pattern, a linear model is appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for food. Now plot the residuals against the predicted distance from food.

What do you notice about the general scatter of points on this residual plot versus the residual plot using the x-values? Residual plots can be plotted against either the x-values or the predicted y-values.

Let’s examine the following data set:
The following data is for 12 black bears from the Boreal Forest. x = age (in years) and y = weight (in kg) Sketch a scatterplot with the fitted regression line. This point is considered an influential point because it affects the placement of the least-squares regression line. x 10.5 6.5 28.5 7.5 5.5 11.5 9.5 Y 54 40 62 51 55 56 42 59 50 Do you notice anything unusual about this data set? What would happen to the regression line if this point is removed? Influential observation

Let’s examine the following data set:
The following data is for 12 black bears from the Boreal Forest. x = age (in years) and y = weight (in kg) An observation is an outlier if it has a large residual. x 10.5 6.5 28.5 7.5 5.5 11.5 9.5 Y 54 40 62 51 55 56 42 59 50 Notice that this observation has a large residual.

Coefficient of determination-
Denoted by r2 gives the proportion of variation in y that can be attributed to an approximate linear relationship between x & y

SS stands for “sum of squares” So this is the total sum of squares.
Let’s explore the meaning of r2 by revisiting the deer mouse data set. x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 y 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65 Distance to Debris Distance traveled Suppose you didn’t know any x-values. What distance would you expect deer mice to travel? SS stands for “sum of squares” So this is the total sum of squares. Why do we square the deviations? What is total amount of variation in the distance traveled (y-values)? Hint: Find the sum of the squared deviations. Total amount of variation in the distance traveled is m2.

The points vary from the LSRL by 526.27 m2.
x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 y 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65 Distance traveled Distance to debris Now suppose you DO know the x-values. Your best guess would be the predicted distance traveled (the point on the LSRL). By how much do the observed points vary from the LSRL? Hint: Find the sum of the residuals squared. The points vary from the LSRL by m2.

x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 y 6.13 11.29 14.35 12.03 22.72 20.11 26.16 30.65 Total amount of variation in the distance traveled is m2. The points vary from the LSRL by m2. Approximately what percent of the variation in distance traveled can be explained by the regression line? Or approximately 32%

Let’s review the values from this output and their meanings.
Partial output from the regression analysis of deer mouse data: Let’s review the values from this output and their meanings. Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to debris 3.234 1.782 1.82 0.112 S = R-sq = 32.0% R-sq(adj) = 22.3% What does this number represent? The standard deviation (s): This is the typical amount by which an observation deviates from the least squares regression line. It’s found by: The y-intercept (a): This value has no meaning in context since it doesn't make sense to have a negative distance. The slope (b): The distance traveled to food increases by approxiamtely meters for an increase of 1 meter to the nearest debris pile. The coefficient of determination (r2) Only 32% of the observed variability in the distance traveled for food can be explained by the approximate linear relationship between the distance traveled for food and the distance to the nearest debris pile.

The least-squares quadratic regression is
Let’s examine this data set: x = representative age y = average marathon finish time Create a scatterplot for this data set. Because of the curved pattern, a straight line would not accurately describe the relationship between average finish time and age. Since this curve resembles a parabola, a quadratic function can be used to describe this relationship. Using Minitab: The least-squares quadratic regression is Age 15 25 35 45 55 65 Time 302.38 193.63 185.46 198.49 224.30 288.71 This curve minimizes the sum of the squares of the residuals (similar to least-squares linear regression). Representative Age Average Finish Time

Let’s examine this data set:
x = representative age y = average marathon finish time Here is the residual plot- Since there is no pattern in the residual plot, the quadratic regression is an appropriate model for this data set. Notice the residuals from the quadratic regression. Age 15 25 35 45 55 65 Time 302.38 193.63 185.46 198.49 224.30 288.71 Representative Age Average Finish Time Age Residuals

Let’s examine this data set:
x = representative age y = average marathon finish time The measure R2 is useful for assessing the fit of the quadratic regression. Age 15 25 35 45 55 65 Time 302.38 193.63 185.46 198.49 224.30 288.71 Representative Age Average Finish Time R2 = .921 92.1% of the variation in average marathon finish times can be explained by the approximate quadratic relationship between average finish time and age.

Depending on the data set, other regression models, such as cubic regression, may be used. Statistical software (like Minitab) is commonly used to calculate these regression models. Another method for fitting regression models to non-linear data sets is to transform the data, making it linear. Then a least-squares regression line can be fit to the transformed data.

Commonly Used Transformations
Equation No transformation Square root of x Log of x * Reciprocal of x Log of y * Exponential growth or decay *Natural log may also be used

Pomegranate study revisited:
x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume Sketch a scatterplot for this data set. Since the data appears to be exponential growth, let’s try the “log of y” transformation x 11 15 19 23 27 31 35 39 y 40 75 90 210 230 330 450 600 Number of days Average tumor volume There appears to be a curve in the data points. Let’s use a transformation to linearize the data.

The LSRL is Pomegranate study revisited:
x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume Sketch a scatterplot of the log(y) and x. x 11 15 19 23 27 31 35 39 Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78 Notice that the relationship now appears linear. Let’s fit an LSRL to the transformed data. Number of days Log of Average tumor volume The LSRL is

The LSRL is Pomegranate study revisited:
x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume Sketch a scatterplot of the log(y) and x. What would the predicted average tumor size be 30 days after injection of cancer cells? x 11 15 19 23 27 31 35 39 Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78 The LSRL is Number of days Log of Average tumor volume

Power Transformation Ladder
Another useful transformation is the power transformation. The power transformation ladder and the scatterplot (both below) can be used to help determine what type of transformation is appropriate. Power Transformation Ladder Power Transformed Value Name 3 (Original value)3 Cube 2 (Original value)2 Square 1 (Original value) No transformation Square root 1/3 Cube root Log(Original value) Logarithm -1 Reciprocal Suppose that the scatterplot looks like the curve labeled 1. Suppose that the scatterplot looks like the curve labeled 2. Then we would use a power that is up the ladder from the no transformation row for both the x and y variables. Then we would use a power that is up the ladder from the no transformation row for the x variable and a power down the ladder for the y variable.

Logistic Regression (Optional)
Can be used if the dependent variable is categorical with just two possible values Used to describe how the probability of “success” changes as a numerical predictor variable, x, changes With p denoting the probability of success, the logistic regression equation is The graph of this equation has an “S” shape. For any value of x, the value of p is always between 0 and 1. Comment on the ±b. Where a and b are constants

In a study on wolf spiders, researchers were interested in what variables might be related to a female wolf spider’s decision to kill and consume her partner during courtship or mating. Data was collected for 53 pairs of courting wolf spiders. (Data listed on page 287) x = the difference in body width (female – male) y = cannibalism; coded 0 for no cannibalism and 1 for cannibalism Minitab was used to construct a scatterplot and to fit a logistic regression to the data. What is the probability of cannibalism if the male & female spiders are the same width (difference of 0)? This equation can be used to predict the probability of the male spider being cannibalized based on the difference in size. Note that the plot was constructed so that if two plots fell in the exact same location they would be offset a little bit so that all points would be visible (called jittering).

Summarizing Bivariate Data

Similar presentations

Presentation on theme: "Summarizing Bivariate Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Summarizing Bivariate Data

Similar presentations

Presentation on theme: "Summarizing Bivariate Data"— Presentation transcript:

Similar presentations

About project

Feedback