Bivariate Data Explores relationships between two quantitative variables.
The explanatory variable attempts to explain the observed outcomes. (In algebra this is your independent variable – “x”)
The response variable measures an outcome of a study. (In algebra this is your dependent variable – “y”)
When we gather data, we usually have in mind which variables are which. Beware! – this explanatory/response relationship suggests a cause and effect relationship that may not exist in all data sets. Use common sense!!
Displaying the Variables We always graph our data right? You use a scatterplot to graph the relationship between 2 quantitative variables. Each point represents an individual.
Remember that not all relationships are linear!!! We will talk about non-linear in the next unit.
Interpret a Scatterplot Here is what we look for: 1) direction (positive, negative) D 2) form ( linear, or not linear) S 3) strength ( correlation, r) S 4) deviations from the pattern (outliers) U SUDS!!
Remember on outlier is an individual observation that falls outside the overall pattern of the graph. There is no outlier test for bivariate data. It’s a judgment call
Categorical variables can be added to scatterplots by changing the symbols in the plot. (See P. 199 for examples) Visual inspection is often not a good judge of how strong a linear relationship is. Changing the plotting scales or the amount of white space around a cloud of points can be deceptive. So….
Facts about Correlation: 1) positive r – positive association (positive slope) negative r – negative association (negative slope) 2) r must fall between –1 and 1 inclusive. 3) r values close to –1 or 1 indicate that the points lie close to a straight line. 4) r values close to 0 indicate a weak linear relationship. 5) r values of –1 or 1 indicate a perfect linear relationship. 6) correlation only measures the strength in linear relationships (not curves). 7) correlation can be strongly affected by extreme values (outliers).
Least-Squares Regression Line The least-squares regression line (LSRL) is a mathematical model for the data. This line is also known as the line of best fit or the regression line.
Formal definition… The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Why do we do regression? The purpose of regression is to determine a model that we can use for making predictions.
Communication is always the goal!!! When we write the equation for a LSRL we do not use x & y, we use the variable names themselves… For example: Predicted score = 52 + 1.5(hours studied)
Another measure of strength… The coefficient of determination, r 2, is the fraction of the variation in the value of y that is explained by the linear model. When we explain r 2 then we say… ___% of the variability in ___(y) can be explained by this linear model.
Deviations for single points A residual is the vertical difference between an actual point and the LSRL at one specific value of x. That is, Residual = observed y – predicted y or Residual = y – The mean of the residuals is always zero.
A new plot… A residual plot plots the residuals on the vertical axis against the explanatory variables on the horizontal axis. Such a plot magnifies the residuals and makes patterns easier to see.
Why do I need a residual plot? Remember that all data is not linear in shape!!! The residual plot clearly shows if linear is appropriate. A residual plot show good linear fit when the points are randomly scattered about y = 0 with no obvious patterns.
To create a residual plot on the calculator: 1)You must have done a linear regression with the data you wish to use. 2) From the Stat-Plot, Plot # menu choose scatterplot and leave the x list with the x values. 3) Change the y-list to “RESID” chosen from the list menu. 4) Zoom – 9
In scatterplots we can have points that are outliers or influential points or both. An observation can be an outlier in the x direction, the y direction, or in both directions. An observation is influential if removing it or adding it) would markedly change the position of the regression line.
Extrapolation is the use of a regression model for prediction outside the domain of values of the explanatory variable x. Such predictions cannot be trusted.
Association vs. Causation A strong association between two variables is NOT enough to draw conclusions about cause & effect.
Association vs Causation Strong association between two variables x and y can reflect: A) Causation – Change in x causes change in y B) Common response – Both x and y are Responding to some other unobserved factor C) Confounding – the effect on y of the explanatory variable x is hopelessly mixed up with the effects on y of other variables.
A Lurking Variable is a variable that has an important effect on the relationship among the variables in a study but is not included among the variables being studied. Lurking variables can suggest a relationship when there isn’t one or can hide a relationship that exists.
Association vs Causation Cause and Effect can only be determined from a well designed experiment.
Data with no apparent linear relationship can also be examined in two ways to see if a relationship still exists: 1) Check to see if breaking the data down into subsets or groups makes a difference. 2) If the data is curved in some way and not linear, a relationship still exists. We will explore that in the next chapter.