2 Descriptive Methods in Regression and Correlation Chapter 4Descriptive Methods inRegression and Correlation
3 C4S1 – Linear Equation with One Independent Variable Linear EquationsLinear equations with one independent variable can be written as y = b0 + b1xb0 and b1 are constants (fixed numbers) and x is the independent variable and y is the dependent variable.The graph of a linear equation is a straight line. y = mx + bInsert Definition 14.1 from pg 696
5 Figure 4.6 Insert Definition 14.1 from pg 696 Positive Slope Falls right to leftNegative SlopeFalls left to rightHorizontal LineHas a slope of 0
6 C4S2 – The Regression Equation Plotting the data in a scatterplot helps us visualize any apparent relationship between x and y. Generally speaking, a scatterplot (or scatter diagram) is a graph of data from two quantitative variables of a population. To construct a scatterplot, we use a horizontal axis for the observations of one variable and a vertical axis for the observations of the other. Each pair of observations is then plotted as a point.Change to page 699
7 Because we could draw many different lines through the cluster of data points, we need a method to choose the “best” line.The method, called the least-squares criterion, is based on an analysis of the errors made in using a line to fit the data points.Change to page 699
8 To avoid confusion, we use to denote the y-value predicted for a value of x. To measure quantitatively how well a line fits the data, we first consider the errors, e, made in using the line to predict the y-values of the data points.In general, an error, e, is the signed vertical distance from the line to a data point. The error made in using the line to predict the y-value ise = y −The decide which line best fits the data we compute the sum of the squared errorsThe line with the smaller sum of squared error is the one that fits the data better.Change to page 699
9 Insert Key Fact 14.2 and Definition 14.2 from page 701
10 Regression Equation for a set of n data points is Mean for y
11 ExtrapolationSuppose that a scatterplot indicates a linear relationship between two variables.Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable.However, to do so outside that range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there.Grossly incorrect predictions can result from extrapolation.Change to page 701
12 Outliers and Influential Observations An outlier is an observation that lies outside the overall pattern of the data.In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points.An outlier can sometimes have a significant effect on a regression analysis.We must also watch for influential observations.In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably.A data point separated in the x-direction from the other data points is often an influential observation because the regression line is “pulled” toward such a data point without counteraction by other data points.Change to page 701
13 Correlation between x and y Regression analysis is used when you want to show if and/or how one variable can predict or cause changes in another variable.Correlation between x and ySx and Sy are the standard deviations of x and ySlope of best fit lineInsert Key Fact 14.2 and Definition 14.2 from page 701
14 C4S3 – The Coefficient of Determination Insert Definition 14.4 from page 719
15 The coefficient of determination, r2, always lies between 0 and 1. r2 near 0 suggests that the regression equation is not very useful for making predictionsr2 near 1 suggest that the regression equation is quite useful for making predictionsShows us if we can use the regression equation instead of the mean.Percentage of variation.Insert Definition 14.4 from page 719
16 Regression IdentityThe total of the squares equals the regression sum of squares plus the error sum of squares. SST = SSR + SSE Equation is always true
17 C4S4 – Linear Correlation We here things like “there is a positive correlation between x and y” and “x and y are uncorrelated” these are explained in this section.Linear Correlations measures the strength of the linear relationship between two variables.Reveals the meaning and basic propertiesInsert Definition 14.6 from page 724Used for hand calculations
18 Understanding the Linear Correlation Coefficient r is the independent of the of the choice of units and always lies between -1 and 1.Close to ±1 then there is a strong linear relationship and is useful in making predictions. Regression equation is extremely useful. The data points are clustered closely about the regression line.Near 0 then the linear relationship is weak and a poor predictor. The data points are essentially scattered about a horizontal line.Keep in mind that r measures the strength of the linear relationship between two variables and that the following properties of r are meaningful only when the data points are scattered about a line.Change to page 717r reflects the slope of the scatterplot.The magnitude of r indicates the strength of the linear relationship.The sign of r suggests the type of linear relationship.The sign of r and the sign of the slope of the regression line are identical.
19 Figure 4.17 Understanding the Linear Correlation Coefficient To graphically portray the meaning of the linear correlation coefficient, we present various degrees of linear correlation in FigChange to page 726Figure 4.17
20 Relationship Between the Correlation Coefficient and the Coefficient of Determination The coefficient of determination, r2, is a descriptive measure of the utility of the regression equation for making predictions.The coefficient of determination, r2, equals the square of the linear correlation coefficient, r.Linear correlation coefficient, r, is a descriptive measure of the strength of the linear relationship between two variables.Because linear correlation coefficient describes the strength of the linear relationship between two variables it should be used as a descriptive measure only when a scatterpoint indicates that the data points are scattered about the line.
21 Relationship Between the Correlation Coefficient and the Coefficient of Determination When using linear correlation coefficient you must also watch for outliers and influential observation because sample means and sample standard deviations are not resistant to outliers and other extreme values.We cannot say the a value of r near 0 implies there is no relationship and we cannot say that values of r near ± 1 implies that a linear relationship exists. Only meaningful when the scatterplot indicate that the data points are scattered about a line.