Presentation on theme: "Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = b o + bX b o."— Presentation transcript:
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = b o + bX b o is called the Y intercept - represents the value of Y when X = 0. But be cautious - this interpretation may be incorrect and difficult to estimate - many times our data does not include 0. Think of this value as representing the influences of the many other independent variables that are not included in the equation. bX is called the slope - represents the amount of change in Y when X increases by one unit.
Regression Analysis Regression line - line that best fits a collection of X-Y data points. The regression line minimizes the sum of the squared distances from the points to the line. Regression equation - Method of Least Squares. Find b o and bX. Other models: step- wise, forward and backward stepwise.
Regression Assumptions Y values are normally distributed about the regression line Variance remains constant as X values increase and decrease. Violation is called heteroscedasticity. Error terms (residuals) are independent of one another - random (no autocorrelation) Linear relationship exists between X and Y - nonlinear techniques are discussed later.
Excel’s Regression Tool Tools, Data Analysis, Regression - Hint: Include labels in the input ranges to help with the interpretation! Can also include plots (not shown here)
Sales Advertising Y ^ Y Unexplained Explained Total Deviation = Explained Variance + Unexplained Variance Total Variance Comparison of A Forecasted value to the actual value and average.
Data Analysis R 2 or Coefficient of Determination. Equals the proportion of the variance in the dependent variable Y that is explained through the relationship with the Independent variable X. Explained Variance = Total Variance - Unexplained Variance. We state this as a proportion: Adjusted R 2 - adjusted for complexity by the degrees of freedom. Unadjusted R 2 becomes larger as more variables are added to the equation (decreases the sum of errors in the denominator). The use of an unadjusted R 2 may result in believing that additional variables are useful when they are not. Unadjusted: Adjusted:
More on R 2 If R 2 = 1, there is a perfect linear relationship. All the variance in Y is explained by X. All of the data points are on the regression line. If R 2 = 0, there is no relationship between X and Y (if this is the case, we should not have run a linear model - and we should have realized this with a correlation coefficient and by graphing - BEFORE running the model! Several ways to calculate. From ANOVA table: SSR/SST (this is an UNADJUSTED R 2 ) Adjusted R 2 from ANOVA = 1-MSE/(SST/n-1) The square root of R 2 is R which is the correlation coefficient. This identifies positive and negative relationships R 2 is useful to make model comparisons
Data Analysis S yx or Standard Error - measure for goodness of fit. Measures the actual values (Y) against the regression line Lower S yx is a better fit k refers to the number of population parameters being estimated - in this case, we have 2: b o and bX The standard error can also be calculated by taking the square root of the MSE in the ANOVA table! Y ^ S yx =
Residuals Excel will provide the residuals in the output. This table also includes another column that I added - the residuals squared which is used to determine the standard error of the estimate (Syx)
Confidence Intervals Prior to relating Y to X, confidence intervals about the future values are based on the standard error of Y. However, in the regression equation, the standard error of forecast (S f ) gives tighter confidence intervals and greater accuracy. Confidence Interval for Y: Confidence Interval for : Y ^ Use t /2 for small sample sizes!
Identifying a forecasted point from the regression equation does not give us an idea of the accuracy of the prediction. We use the prediction interval to determine accuracy. For example, a prediction of 8.44 appears to be precise - but not if the 95% confidence level allows the forecast to be between 1.75 to 15.15! Be careful about making a prediction based on a prediction. For example, if the X values range between 5 and 15, you should be cautious about using an X value of 20 - it is outside the range of the data and possibly outside of the linear relationship. Making Predictions
H o : The regression coefficient is not significantly different from zero H A : The regression coefficient is significantly different from zero Is the Independent Variable Significant? Where B is the true slope of the regression line
The Standard Error of the Estimate is S yx, The Standard Error of the Regression Coefficient is S b. Is the Independent Variable Significant? We will use Excel’s P-value for the Independent Variable to determine significance. If the p-value is less than.05, we Reject the null hypothesis and conclude that the Independent variable is related to the dependent variable. However, it is important to have an understanding of the formulation development - which is why the formulas and definitions are provided.
What happens if you have a large sample size, a small R 2 (such as.10) and you have determined that the independent variable is significant? What happens with a small sample, large R 2 and the independent variable is NOT significant? To test the model, we use the F statistic from the ANOVA table. Analyzing it all at once
ANOVA Analysis ANOVAdfSSMS F Regression Error Total k-1 n-k n-1 SSR/k-1 SSE/n-k MSR/MSE
Ho: The model is NOT valid and there is NOT a statistical relationship between the dependent and independent variables HA: The model is valid. There is a statistical relationship between the dependent and independent variables. F-Test If F from the ANOVA is greater than the F from the F-table, reject Ho: The model is valid. We can look at the P-values. If the p-value is less than our set level, we can REJECT Ho.
Minitab will provide a DW statistic. This detects autocorrelation for Y t and Y t-1. The value of DW varies between 0 and 4. A value of 2 indicates no autocorrelation. A value of 0 indicates positive autocorrelation A value of 4 indicates negative autocorrelation. Durbin-Watson Statistic
Data Transformations Curvilinear relationships - fit the data with a curved line Transform the X variable (independent) so the resulting relationship with Y is linear. Log of X, Square Root of X, X squared, and reciprocal of X (or 1/X) are common. The hope is that one of these transformations will result in a linear relationship.
Ok, 18 pages of notes, so where do we start? Determine the dependent and independent variables Develop scatter plots and determine if linear or nonlinear relationships exist. Calculate a correlation coefficient. Transform non-linear data. Run an autocorrelation and interpret the results - it will be helpful to see if any patterns exist Compute the regression equation. Interpret. Understand the difference between standard error of estimate, standard error of forecast (regression) and standard error of the regression coefficient. Evaluate and interpret the adjusted R 2 Test the independent variables for significance Evaluate the ANOVA and test the model for significance (F and DW) Plot the error terms Calculate a prediction and prediction interval State final conclusions about the model (if running different models, compare using MSE, MAD, MAPE, MPE)