 # Descriptive Methods in Regression and Correlation

## Presentation on theme: "Descriptive Methods in Regression and Correlation"— Presentation transcript:

Descriptive Methods in Regression and Correlation
Chapter 4 Descriptive Methods in Regression and Correlation

C4S1 – Linear Equation with One Independent Variable
Linear Equations Linear equations with one independent variable can be written as y = b0 + b1x b0 and b1 are constants (fixed numbers) and x is the independent variable and y is the dependent variable. The graph of a linear equation is a straight line. y = mx + b Insert Definition 14.1 from pg 696

Insert Definition 14.1 from pg 696

Figure 4.6 Insert Definition 14.1 from pg 696 Positive Slope
Falls right to left Negative Slope Falls left to right Horizontal Line Has a slope of 0

C4S2 – The Regression Equation
Plotting the data in a scatterplot helps us visualize any apparent relationship between x and y. Generally speaking, a scatterplot (or scatter diagram) is a graph of data from two quantitative variables of a population.  To construct a scatterplot, we use a horizontal axis for the observations of one variable and a vertical axis for the observations of the other. Each pair of observations is then plotted as a point. Change to page 699

Because we could draw many different lines through the cluster of data points, we need a method to choose the “best” line. The method, called the least-squares criterion, is based on an analysis of the errors made in using a line to fit the data points. Change to page 699

To avoid confusion, we use to denote the y-value predicted for a value of x.
To measure quantitatively how well a line fits the data, we first consider the errors, e, made in using the line to predict the y-values of the data points. In general, an error, e, is the signed vertical distance from the line to a data point. The error made in using the line to predict the y-value is e = y − The decide which line best fits the data we compute the sum of the squared errors The line with the smaller sum of squared error is the one that fits the data better. Change to page 699

Insert Key Fact 14.2 and Definition 14.2 from page 701

Regression Equation for a set of n data points is
Mean for y

Extrapolation Suppose that a scatterplot indicates a linear relationship between two variables. Then, within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable. However, to do so outside that range, which is called extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold there. Grossly incorrect predictions can result from extrapolation. Change to page 701

Outliers and Influential Observations
An outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points. An outlier can sometimes have a significant effect on a regression analysis. We must also watch for influential observations. In regression analysis, an influential observation is a data point whose removal causes the regression equation (and line) to change considerably. A data point separated in the x-direction from the other data points is often an influential observation because the regression line is “pulled” toward such a data point without counteraction by other data points. Change to page 701

Correlation between x and y
Regression analysis is used when you want to show if and/or how one variable can predict or cause changes in another variable. Correlation between x and y Sx and Sy are the standard deviations of x and y Slope of best fit line Insert Key Fact 14.2 and Definition 14.2 from page 701

C4S3 – The Coefficient of Determination
Insert Definition 14.4 from page 719

The coefficient of determination, r2, always lies between 0 and 1.
r2 near 0 suggests that the regression equation is not very useful for making predictions r2 near 1 suggest that the regression equation is quite useful for making predictions Shows us if we can use the regression equation instead of the mean. Percentage of variation. Insert Definition 14.4 from page 719

Regression Identity The total of the squares equals the regression sum of squares plus the error sum of squares. SST = SSR + SSE Equation is always true

C4S4 – Linear Correlation
We here things like “there is a positive correlation between x and y” and “x and y are uncorrelated” these are explained in this section. Linear Correlations measures the strength of the linear relationship between two variables. Reveals the meaning and basic properties Insert Definition 14.6 from page 724 Used for hand calculations

Understanding the Linear Correlation Coefficient
r is the independent of the of the choice of units and always lies between -1 and 1. Close to ±1 then there is a strong linear relationship and is useful in making predictions. Regression equation is extremely useful. The data points are clustered closely about the regression line. Near 0 then the linear relationship is weak and a poor predictor. The data points are essentially scattered about a horizontal line. Keep in mind that r measures the strength of the linear relationship between two variables and that the following properties of r are meaningful only when the data points are scattered about a line. Change to page 717 r reflects the slope of the scatterplot. The magnitude of r indicates the strength of the linear relationship. The sign of r suggests the type of linear relationship. The sign of r and the sign of the slope of the regression line are identical.

Figure 4.17 Understanding the Linear Correlation Coefficient
To graphically portray the meaning of the linear correlation coefficient, we present various degrees of linear correlation in Fig Change to page 726 Figure 4.17

Relationship Between the Correlation Coefficient and the Coefficient of Determination
The coefficient of determination, r2, is a descriptive measure of the utility of the regression equation for making predictions. The coefficient of determination, r2, equals the square of the linear correlation coefficient, r. Linear correlation coefficient, r, is a descriptive measure of the strength of the linear relationship between two variables. Because linear correlation coefficient describes the strength of the linear relationship between two variables it should be used as a descriptive measure only when a scatterpoint indicates that the data points are scattered about the line.

Relationship Between the Correlation Coefficient and the Coefficient of Determination
When using linear correlation coefficient you must also watch for outliers and influential observation because sample means and sample standard deviations are not resistant to outliers and other extreme values. We cannot say the a value of r near 0 implies there is no relationship and we cannot say that values of r near ± 1 implies that a linear relationship exists. Only meaningful when the scatterplot indicate that the data points are scattered about a line.