# Chapter 4 The Relation between Two Variables

## Presentation on theme: "Chapter 4 The Relation between Two Variables"— Presentation transcript:

Chapter 4 The Relation between Two Variables
Prof. Felix Apfaltrer Office:N518 Phone: X 7421 Office hours: Tue/Thu 1:30-3pm

Mathematical model is a mathematical expression
that represents some phenomenon. It can be deterministic model or probabilistic model Often describe the relationship between 2 variables.

Learning objectives Draw and interpret scatter diagrams
Understand the properties of the linear correlation coefficient Compute and interpret the linear correlation coefficient 1 2 3

4.1. Scatter Diagrams and Correlation
When dealing with 2 variables: We try to see the relationship between the 2 variables Sometimes there is a 3rd variable that is not considered, that affects the results (lurking variable). Shoe size does not cause height to change (age affects both the two variables) Therefore, we can’t conclude that variable A causes B Some examples are: Rainfall amounts and plant growth (possible lurking var. Sunlight) Exercise and cholesterol levels for a group of people (possible lurking var. Diet) Height and weight for a group of people Height and fast speed you have ever driven a car. When we have two variables, they could be related in one of several different ways They could be unrelated One variable (the explanatory or predictor variable) could be used to explain the other (the response or dependent variable) One variable could be thought of as causing the other variable to change

Scatter Diagrams The scatter diagram is a graph that shows the relationship visually between 2 quantitative variables. The explanatory variable is plotted on the horizontal axis, the response variable on the vertical axis The response variable (y-axis) is the variable whose value can be explained by the value of the explanatory variable (x-axis).

Linear Correlation The linear correlation coefficient is a measure of the strength and direction of linear relation between two quantitative variables The sample correlation coefficient “r” is This should be computed with software (and not by hand) whenever possible

Answer ‘How Strong Is the Linear Relationship Between 2 Variables?’
Coefficient of Correlation Used Population Correlation Coefficient Denoted  (Rho) Values Range from -1 to +1 Measures Degree of Association The sign of r indicates the direction of the relationship: Positive the two variables tend to increase together. Negative one variable increases, the other is likely to decrease. Used Mainly for Understanding

Perfect Negative Correlation Perfect Positive Correlation
No Correlation -1.0 -.5 +.5 +1.0 Increasing degree of negative correlation Increasing degree of positive correlation

Examples of positive correlation
Strong Positive r = .8 Moderate Positive r = .5 Very Weak r = .1 Examples of negative correlation Strong Negative r = –.8 Moderate Negative r = –.5 Very Weak r = –.1 In general, if the correlation is visible to the eye, then it is likely to be strong

r = r = r = nxy – (x)(y) n(x2) – (x)2 n(y2) – (y)2
1 2 8 3 6 5 4 Data x y nxy – (x)(y) r = (Shorcut formula) n(x2) – (x) n(y2) – (y)2 4(48) – (10)(20) r = 4(36) – (10) (120) – (20)2 –8 r = = –0.135 59.329

Correlation is not causation!
Just because two variables are correlated does not mean that one causes the other to change There is a strong correlation between shoe sizes and vocabulary sizes for grade school children Clearly larger shoe sizes do not cause larger vocabularies Clearly larger vocabularies do not cause larger shoe sizes Often lurking variables result in confounding

Summary: Chapter 4 – Section 1
Visual methods Scatter diagrams Analogous to histograms for single variables Numeric methods Linear correlation coefficient Analogous to mean and variance for single variables Care should be taken in the interpretation of linear correlation (nonlinearity and causation) Correlation between two variables can be described with both visual and numeric methods

Chapter 4 – Section 2 Learning objectives
Find the least-squares regression line and use the line to make predictions and estimations Interpret the slope and the y-intercept of the least squares regression line Compute the sum of squared residuals 1 2 3

If we have two variables X and Y, we often would like to model the relation as a line
Draw a line through the scatter diagram We want to find the line that “best” describes the linear relationship … the regression line

Linear Equations We want to use a linear model
Linear models can be written in several different (equivalent) ways y = m x + b y – y1 = m (x – x1) y = b1 x + b0 Because the slope and the intercept are important to analyze, we will use

Linear Equations BMCC PROFESSOR

The formula for the residual is always Residual = Observed – Predicted
One difference between math and stat is that statistics assumes that the measurements are not exact, that there is an error or residual The formula for the residual is always Residual = Observed – Predicted What the residual is on the scatter diagram The residual The model line The observed value y The predicted value y The x value of interest The equation for the least-squares regression line is given by y = b1x + b0 b1 is the slope of the least-squares regression line (marginal change) b0 is the y-intercept of the least-squares regression line

y = 5 + 4x x 1 2 4 5 y 4 24 8 32 ^ Least-Squares Property
A straight line satisfies this property if the sum of the squares of the residuals is the smallest sum possible.

calculators or computers can compute these values
(slope of the least-squares regression line) (Shorcut) n(xy) – (x) (y) b1 = (slope) n(x2) – (x)2 b0 = y – b1 x (y-intercept) calculators or computers can compute these values

Finding the values of b1 and b0, by hand, is a very tedious process
You should use software for this Finding the coefficients b1 and b0 is only the first step of a regression analysis We need to interpret the slope b1 We need to interpret the y-intercept b0 We need to do quite a bit more statistical analysis … this is covered in Section 4.3 and also in Chapter 14

the regression line is:
1 2 8 3 6 5 4 Data x y n(xy) – (x) (y) n(x2) –(x)2 b1 = 4(48) – (10) (20) 4(36) – (10)2 –8 44 = – n = 4 x = 10 y = 20 x2 = 36 y2 = 120 xy = 48 b0 = y – b1 x 5 – (– )(2.5) = 5.45 The estimated equation of the regression line is: y = 5.45 – 0.182x ^

Guidelines for Using The
Regression Equation 1. If there is no significant linear correlation, don’t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data. 3. A regression equation based on old data is not necessarily valid now. 4. Don’t make predictions about a population that is different from the population from which the sample data was drawn.

Chapter 4 – Section 3 Total Deviation = Explained + Unexplained
Learning objectives Compute and interpret the coefficient of determination Perform residual analysis on a regression model Identify influential observations 1 2 3 The relationship is The larger the explained deviation, the better the model is at prediction / explanation The larger the unexplained deviation, the worse the model is at prediction / explanation Total Deviation = Explained + Unexplained

We began with y – y or the total deviation Our regression model reduces this to or the unexplained deviation The amount of reduction is the explained deviation

Instead of straight deviations, we use variations
Variation = Deviation2 It is also true that A measure of the explanatory power of the model is the proportion of variation that is explained: Total Variation = Explained + Unexplained

Y Unexplained sum of squares (Y -Y)2 ^ Total sum of squares (Y -Y)2

Proportion of Variation ‘Explained’ by Relationship Between X & Y
Simply Square Correlation r r 2 is called coefficient of determination. 0  r 2  1 (%) (percentage explained by X)

How can we tell how good is our model?
To check to see if a linear model is appropriate, plot the residuals (error) on the vertical axis against the explanatory variable (the x) on the horizontal axis If the plot shows a pattern (such as a curve), then the response (y) and explanatory (x) variables may have a nonlinear relationship If there is no obvious pattern, we could be ok …

Two example residual plots
The least-squares regression model assumes that the variance of the residuals are constant across values of the explanatory variable To check to see if the variance of the residuals are constant, plot the residuals (error) on the vertical axis against the explanatory variable (the x) on the horizontal axis This is the same plot as the plot checking linearity Two example residual plots If there is a spread (the dotted blue line), then a linear relationship is not very reliable No spread Spread

Definition Outliers for a least-squares regression are those observations that are unusually far away from the model line There are several ways to identify outliers The scatter diagram may show the outlier as a point away from the main pattern of points The residual plot may show the outlier as a unusually high or unusually low residual The boxplot of residuals may identify the outlier as a value outside the upper or lower fence

Three ways to identify outliers
From a scatter diagram From a residual plot From a boxplot

Influential Points: An influential point strongly affects the graph of the regression line
Usually influential observations are those with unusually high or unusually low values of the predictor (x) variable A significant affect on the value of the slope, or A significant affect on the value of the intercept outlier definitely influential influential The x value is large compared to the others It is not along the general linear pattern of the data It is not along the general linear pattern of the data However, it is likely not to be influential The x and y values are large compared to the others However, it is along the general linear pattern of the data

If a particular observation is influential, we should investigate that observation
If the observation is a valid observation, we have a variety of options We could collect additional points near the influential observation We could collect additional points between the main part of our data and the influential point (to check whether the data is nonlinear, for example) We could use techniques that are resistant to influential observations

Summary: Chapter 4 – Section 3
Diagnostics are very important in assessing the quality of a least-squares regression model The coefficient of determination measures the percent of total variation explained by the model The plot of residuals can detect nonlinear patterns, error variances that are not constant, and outliers We must be careful when there are influential observations because they have an unusually large effect on the computation of our model parameters