Presentation on theme: "BA 555 Practical Business Analysis"— Presentation transcript:
1 BA 555 Practical Business Analysis AgendaLinear Regression AnalysisCase Study: Cost of Manufacturing ComputersMultiple Regression AnalysisDummy Variables
2 Regression AnalysisA technique to examine the relationship between an outcome variable (dependent variable, Y) and a group of explanatory variables (independent variables, X1, X2, … Xk).The model allows us to understand (quantify) the effect of each X on Y.It also allows us to predict Y based on X1, X2, …. Xk.
3 Types of Relationship Linear Relationship Nonlinear Relationship Simple Linear RelationshipY = b0 + b1 X + eMultiple Linear RelationshipY = b0 + b1 X1 + b2 X2 + … + bk Xk + eNonlinear RelationshipY = a0 exp(b1X+e)Y = b0 + b1 X1 + b2 X12 + e… etc.Will focus only on linear relationship.
4 Simple Linear Regression Model populationTrue effect of X on YEstimated effect of X on YsampleKey questions:1. Does X have any effect on Y?2. If yes, how large is the effect?3. Given X, what is the estimated Y?ASSOCIATION ≠ CAUSALITY
5 Least Squares Method Least squares line: It is a statistical procedure for finding the “best-fitting” straight line.It minimizes the sum of squares of the deviations of the observed values of Y from those predictedBad fit.Deviations are minimized.
6 Case: Cost of Manufacturing Computers (pp.13 – 45) A manufacturer produces computers. The goal is to quantify cost drivers and to understand the variation in production costs from week to week.The following production variables were recorded:COST: the total weekly production cost (in $millions)UNITS: the total number of units (in 000s) produced during the week.LABOR: the total weekly direct labor cost (in $10K).SWITCH: the total number of times that the production process was re-configured for different types of computersFACTA: = 1 if the observation is from factory A; = 0 if from factory B.
7 Raw Data (p. 14)How many possible regression models can we build?
8 Simple Linear Regression Model (pp. 17 – 26) Research Questions:Is Labor a significant cost driver?How accurate can Labor predict Cost?
9 Initial Analysis (pp. 15 – 16) Summary statistics + Plots (e.g., histograms + scatter plots) + CorrelationsThings to look forFeatures of Data (e.g., data range, outliers)do not want to extrapolate outside data range because the relationship is unknown (or un-established).Summary statistics and graphs.Is the assumption of linearity appropriate?Inter-dependence among variables? Any potential problem?Scatter plots and correlations.
10 Correlation (p. 15) Is the assumption of linearity appropriate? r (rho): Population correlation (its value most likely is unknown.)r: Sample correlation (its value can be calculated from the sample.)Correlation is a measure of the strength of linear relationship.Correlation falls between –1 and 1.No linear relationship if correlation is close to 0. But, ….r = – –1 < r < r = < r < r = 1r = – –1 < r < r = < r < r = 1
11 Scatterplot (p. 16) and Correlation (p Scatterplot (p.16) and Correlation (p. 15) Checking the linearity assumptionSample sizeP-value forH0: r = 0Ha: r ≠ 0Is a r or r?
12 Hypothesis Testing for b (pp Hypothesis Testing for b (pp.18 – 19 ) Key Q1: Does X have any effect on Y?b0 or b0?b0b1 or b1?b1Sb0H0: b1 = 0Ha: b1 ≠ 0Sb1** Divide the p-value by 2for one-sided test. Makesure there is at least weakevidence for doing this step.Degrees of freedom = n – k – 1, where n = sample size, k = # of Xs.
13 Confidence Interval Estimation for b (pp Confidence Interval Estimation for b (pp. 19 – 20) Key Q2: How large is the effect?Q1: Does Labor have any impact on Cost → Hypothesis TestingQ2: If so, how large is the impact? → Confidence Interval Estimationb0b1Sb1Sb0Degrees of freedom = n – k – 1k = # of independent variables
14 Prediction (pp. 25 – 26) Key Q3: What is the Y-prediction? What is the predicted production cost of a given week, say, Week 21 of the year that Labor = 5 (i.e., $50,000)?Point estimate: predicted cost = b0 + b1 (5) = (5) = (million dollars).Margin of error? → Prediction IntervalWhat is the average production cost of a typical week that Labor = 5?Point estimate: estimated cost = b0 + b1 (5) = (5) = (million dollars).Margin of error? → Confidence Interval
15 Prediction vs. Confidence Intervals (pp. 25 – 26) ☻☻☻☻☻☻☺☺☺☺☺☺Variation (margin of error) on both ends seems larger. Implication?
16 Analysis of Variance (p. 21) - Not very useful in simple regression.- Useful in multiple regression.
17 Sum of Squares (p.22) SSE = remaining variation that can not be explained by the model.Syy = Total variation in YSSR = Syy – SSE = variation in Y that has been explained by the model.
19 Another Simple Regression Model: Cost = b0 + b1 Units + e (p. 27) A better model?Why?
20 Multiple Regression Model Cost = b0 + b1 Units + b2 Labor + e (p. 29) Test of Global Fit (p. 29)Marginal effect (p. 30)Adjusted R-sq (p. 30)
21 R-sq vs. Adjusted R-sq Independent variables R-sq Adjusted R-sq Labor 20.43%18.84%Units86.44%86.17%Switch0.05%-1.95%Labor, Units86.51%85.96%Units, Switch88.20%87.72%Labor, Switch21.32%18.11%Labor, Units, Switch88.21%87.48%Remember! There are still many more models to try.
22 Test of Global Fit (p.29) Variation explained by the model that consists of 2 Xs.Variation explained, on the average,by each independent variable.If F-ratio is large → H0 or Ha?If F-ratio is small → H0 or Ha?(please read pp. 39–41, 47for finding the cutoff.)H0: the model is useless.Ha: the model is not completelyuseless.
23 Residual Analysis (pp.33 – 34) The three conditions required for the validity of the regression analysis are:the error variable is normally distributed with mean = 0.the error variance is constant for all values of x.the errors are independent of each other.How can we identify any violation?
24 Residual Analysis (pp. 33 – 34) We do not have e (random error), but we can calculate residuals from the sample.Residual = actual Y – estimated YExamining the residuals (or standardized residuals), help detect violations of the required conditions.
25 Residuals, Standardized Residuals, and Studentized Residuals (p.33)
26 The random error e is normally distributed with mean = 0 (p.34)
27 The error variance se is constant for all values of X and estimated Y (p.34) Constant spread !
28 The spread increases with y Constant VarianceWhen the requirement of a constant variance is violated we have a condition of heteroscedasticity.Diagnose heteroscedasticity by plotting the residual against the predicted y, actual y, and each independent variable X.Residual+++++++++++++^+++y++++++++The spread increases with y^
29 The errors are independent of each other (p.34) Do NOT want to see any pattern.
30 Non Independence of Error Variables ResidualResidual+++++++++++++++TimeTime+++++++++++++Note the runs of positive residuals,replaced by runs of negative residualsNote the oscillating behavior of theresiduals around zero.
31 Residual Plots with FACTA (p.34) Which factory is more efficient?
32 Dummy/Indicator Variables (p.36) Qualitative variables are handled in a regression analysis by the use of 0-1 variables. This kind of qualitative variables are also referred to as “dummy” variables. They indicate which category the corresponding observation belongs to.Use k–1 dummy variable for a qualitative variable with k categories.Gender = “M” or “F” → Needs one dummy variable.Training Level = “A”, “B”, or “C” → Needs 2 dummy variables.
36 Statgraphics Prediction/Confidence Intervals for Y Simple Regression AnalysisRelate / Simple RegressionX = Independent variable, Y = dependent variableFor prediction, click on the Tabular option icon and check Forecasts. Right click to change X values.Multiple Regression AnalysisRelate / Multiple RegressionFor prediction, enter values of Xs in the Data Window and leave the corresponding Y blank. Click on the Tabular option icon and check Reports.Saving intermediate results (e.g., studentized residuals).Click the icon and check the results to save.Removing outliers.Highlight the point to remove on the plot and click the Exclude icon