Presentation on theme: "Chapter 4 The Relation between Two Variables"— Presentation transcript:
1Chapter 4 The Relation between Two Variables Prof. Felix ApfaltrerOffice:N518Phone: X 7421Office hours:Tue/Thu 1:30-3pm
2Mathematical model is a mathematical expression that represents some phenomenon.It can be deterministic model or probabilistic modelOften describe the relationship between 2 variables.
3Learning objectives Draw and interpret scatter diagrams Understand the properties of the linear correlation coefficientCompute and interpret the linear correlation coefficient123
44.1. Scatter Diagrams and Correlation When dealing with 2 variables:We try to see the relationship between the 2 variablesSometimes there is a 3rd variable that is not considered, that affects the results (lurking variable). Shoe size does not cause height to change (age affects both the two variables)Therefore, we can’t conclude that variable A causes BSome examples are:Rainfall amounts and plant growth (possible lurking var. Sunlight)Exercise and cholesterol levels for a group of people (possible lurking var. Diet)Height and weight for a group of peopleHeight and fast speed you have ever driven a car.When we have two variables, they could be related in one of several different waysThey could be unrelatedOne variable (the explanatory or predictor variable) could be used to explain the other (the response or dependent variable)One variable could be thought of as causing the other variable to change
5Scatter DiagramsThe scatter diagram is a graph that shows the relationship visually between 2 quantitative variables. The explanatory variable is plotted on the horizontal axis, the response variable on the vertical axisThe response variable (y-axis) is the variable whose value can be explained by the value of the explanatory variable (x-axis).
6Linear CorrelationThe linear correlation coefficient is a measure of the strength and direction of linear relation between two quantitative variablesThe sample correlation coefficient “r” isThis should be computed with software (and not by hand) whenever possible
7Answer ‘How Strong Is the Linear Relationship Between 2 Variables?’ Coefficient of Correlation UsedPopulation Correlation Coefficient Denoted (Rho)Values Range from -1 to +1Measures Degree of AssociationThe sign of r indicates the direction of the relationship: Positive the two variables tend to increase together. Negative one variable increases, the other is likely to decrease.Used Mainly for Understanding
8Perfect Negative Correlation Perfect Positive Correlation No Correlation-1.0-.5+.5+1.0Increasing degree of negative correlationIncreasing degree of positive correlation
9Examples of positive correlation Strong Positiver = .8Moderate Positiver = .5Very Weakr = .1Examples of negative correlationStrong Negativer = –.8Moderate Negativer = –.5Very Weakr = –.1In general, if the correlation is visible to the eye, then it is likely to be strong
11Correlation is not causation! Just because two variables are correlated does not mean that one causes the other to changeThere is a strong correlation between shoe sizes and vocabulary sizes for grade school childrenClearly larger shoe sizes do not cause larger vocabulariesClearly larger vocabularies do not cause larger shoe sizesOften lurking variables result in confounding
12Summary: Chapter 4 – Section 1 Visual methodsScatter diagramsAnalogous to histograms for single variablesNumeric methodsLinear correlation coefficientAnalogous to mean and variance for single variablesCare should be taken in the interpretation of linear correlation (nonlinearity and causation)Correlation between two variables can be described with both visual and numeric methods
13Chapter 4 – Section 2 Learning objectives Find the least-squares regression line and use the line to make predictions and estimationsInterpret the slope and the y-intercept of the least squares regression lineCompute the sum of squared residuals123
15If we have two variables X and Y, we often would like to model the relation as a line Draw a line through the scatter diagramWe want to find the line that “best” describes the linear relationship … the regression line
16Linear Equations We want to use a linear model Linear models can be written in several different (equivalent) waysy = m x + by – y1 = m (x – x1)y = b1 x + b0Because the slope and the intercept are important to analyze, we will use
18The formula for the residual is always Residual = Observed – Predicted One difference between math and stat is that statistics assumes that the measurements are not exact, that there is an error or residualThe formula for the residual is alwaysResidual = Observed – PredictedWhat the residual is on the scatter diagramThe residualThe model lineThe observed value yThe predicted value yThe x value of interestThe equation for the least-squaresregression line is given byy = b1x + b0b1 is the slope of the least-squares regression line (marginal change)b0 is the y-intercept of the least-squares regression line
19y = 5 + 4x x 1 2 4 5 y 4 24 8 32 ^ Least-Squares Property A straight line satisfies this property if the sum of the squares of the residuals is the smallest sum possible.
20calculators or computers can compute these values (slope of the least-squares regression line)(Shorcut)n(xy) – (x) (y)b1 = (slope)n(x2) – (x)2b0 = y – b1 x (y-intercept)calculators or computers can compute these values
21Finding the values of b1 and b0, by hand, is a very tedious process You should use software for thisFinding the coefficients b1 and b0 is only the first step of a regression analysisWe need to interpret the slope b1We need to interpret the y-intercept b0We need to do quite a bit more statistical analysis … this is covered in Section 4.3 and also in Chapter 14
23Guidelines for Using The Regression Equation1. If there is no significant linear correlation, don’t use the regression equation to make predictions.2. When using the regression equation for predictions, stay within the scope of the available sample data.3. A regression equation based on old data is not necessarily valid now.4. Don’t make predictions about a population that is different from the population from which the sample data was drawn.
24Chapter 4 – Section 3 Total Deviation = Explained + Unexplained Learning objectivesCompute and interpret the coefficient of determinationPerform residual analysis on a regression modelIdentify influential observations123The relationship isThe larger the explained deviation, the better the model is at prediction / explanationThe larger the unexplained deviation, the worse the model is at prediction / explanationTotalDeviation=Explained+Unexplained
25We began withy – yor the total deviationOur regression model reduces this toor the unexplained deviationThe amount of reductionis the explained deviation
26Instead of straight deviations, we use variations Variation = Deviation2It is also true thatA measure of the explanatory power of the model is the proportion of variation that is explained:TotalVariation=Explained+Unexplained
27Y Unexplained sum of squares (Y -Y)2 ^ Total sum of squares (Y -Y)2
28Proportion of Variation ‘Explained’ by Relationship Between X & Y Simply SquareCorrelation rr 2 is called coefficient of determination.0 r 2 1 (%) (percentage explained by X)
29How can we tell how good is our model? To check to see if a linear model is appropriate, plot the residuals (error) on the vertical axis against the explanatory variable (the x) on the horizontal axisIf the plot shows a pattern (such as a curve), then the response (y) and explanatory (x) variables may have a nonlinear relationshipIf there is no obvious pattern, we could be ok …
30Two example residual plots The least-squares regression model assumes that the variance of the residuals are constant across values of the explanatory variableTo check to see if the variance of the residuals are constant, plot the residuals (error) on the vertical axis against the explanatory variable (the x) on the horizontal axisThis is the same plot as the plot checking linearityTwo example residual plotsIf there is a spread (the dotted blue line), then a linear relationship is not very reliableNo spreadSpread
31DefinitionOutliers for a least-squares regression are those observations that are unusually far away from the model lineThere are several ways to identify outliersThe scatter diagram may show the outlier as a point away from the main pattern of pointsThe residual plot may show the outlier as a unusually high or unusually low residualThe boxplot of residuals may identify the outlier as a value outside the upper or lower fence
32Three ways to identify outliers From a scatter diagramFrom a residual plotFrom a boxplot
33Influential Points: An influential point strongly affects the graph of the regression line Usually influential observations are those with unusually high or unusually low values of the predictor (x) variableA significant affect on the value of the slope, orA significant affect on the value of the interceptoutlierdefinitelyinfluentialinfluentialThe x value is largecompared to the othersIt is not along the general linear pattern of the dataIt is not along the general linear pattern of the dataHowever, it is likely not to be influentialThe x and y values are large compared to the othersHowever, it is along the general linear pattern of the data
34If a particular observation is influential, we should investigate that observation If the observation is a valid observation, we have a variety of optionsWe could collect additional points near the influential observationWe could collect additional points between the main part of our data and the influential point (to check whether the data is nonlinear, for example)We could use techniques that are resistant to influential observations
35Summary: Chapter 4 – Section 3 Diagnostics are very important in assessing the quality of a least-squares regression modelThe coefficient of determination measures the percent of total variation explained by the modelThe plot of residuals can detect nonlinear patterns, error variances that are not constant, and outliersWe must be careful when there are influential observations because they have an unusually large effect on the computation of our model parameters