Presentation on theme: "Statistical Modeling - 1 US Army Logistics Management College Part 1: Regression Analysis Estimating Relationships."— Presentation transcript:
Statistical Modeling - 1 US Army Logistics Management College Part 1: Regression Analysis Estimating Relationships
Statistical Modeling - 2 US Army Logistics Management College Preparing to Use Stat Tools Pharmex Drug Stores Stat Tools is a part of the Decision Tools Suite Open both Excel and Stat Tools. Select StatTools + Data Set Manager Select New Highlight the portion of the spreadsheet that includes the data and select OK Pharmex.xls
Statistical Modeling - 3 US Army Logistics Management College Scatterplots: Graphing Relationships Pharmex Drug Stores Pharmex is a chain of drugstores that operates around the country. The company has collected data from 50 randomly selected metropolitan regions. In each region it has collected data on its promotional expenditures and sales in the region over the past year. There are two variables each of which are indexes, not dollar amounts. Promote: Pharmex’s promotional expenditures as a percentage of those of the leading competitor. Sales: Pharmex’s sales as a percentage of those of the leading competitor. The company expects that there is a positive relationship between the two variables, so that regions with relatively more expenditures have relatively more sales. However, it is not clear what the nature of this relationship is. Pharmex.xls
Statistical Modeling - 4 US Army Logistics Management College Creating the Scatterplot Pharmex Drug Stores The tricky part is to decide which variable should be on the horizontal axis. Select any data cell. Select StatTools + Summary Graphs + Scatterplot… In regression analysis, we always put the explanatory variable on the horizontal axis and the response variable on the vertical axis. In this example the store tends to believe that large promotional expenditures “cause” larger values of sales, so select “Sales” as the Y variable (the vertical axis).. Select “Promote” as the X variable (the horizontal axis). Pharmex.xls
Statistical Modeling - 5 US Army Logistics Management College Interpretation Pharmex Drug Stores The scatterplot indicates that there is a positive relationship between Promote and Sales - the points tend to rise from bottom left to top right - but the relationship is not perfect. The correlation of is shown automatically on the plot. The important things to note about the correlation is that it is positive and its magnitude is moderately large. Causation - we can never make definitive statements about causation based on regression analysis. Regression identifies only a statistical relationship, not a causal relationship
Statistical Modeling - 6 US Army Logistics Management College Simple Linear Regression Pharmex Drug Stores The Pharmex scatterplot hints at a linear relationship between Promote and Sales. We want to draw the “best fitting” straight line through the points to quantify that linear relationship. Since the relationship is not perfect, not all points lie exactly on the line. The differences are the residuals. They show how much the observed values differ from the fitted values. The fitted value is the vertical distance from the horizontal axis to the line. We decide to define “best fitting” line through the points in the scatterplot to be the one with the smallest sum of the squared residuals. This line is called the least squares line We now want to find the least squares line for the Pharmex drugstore data, using Sales as the response variable and Promote as the explanatory variable.
Statistical Modeling - 7 US Army Logistics Management College Least Squares Line with StatTools Pharmex Drug Stores Select any data cell. From the Menu bar, select : StatTools + Regression & Classification + Regression… Specify that “Sales” is the response (dependent) variable. Specify that “Promote” is the explanatory (independent) variable. Select graph option: “Residuals vs Fitted values”
Statistical Modeling - 8 US Army Logistics Management College Regression Output Table Pharmex Drug Stores The “Constant” and “Promote” coefficients B18:C18 imply that the equation for the least squares line is: Predicted Sales = ( x Promote)
Statistical Modeling - 9 US Army Logistics Management College Least Square Line Equation Pharmex Drug Stores We can interpret this equation as follows: The slope indicates that the sales index tends to increase by about 0.76 for each unit increase in the promotional expenses index. The interpretation of the intercept is less important. It is literally the predicted sales index for a region that does no promotions. The Scatterplot A useful graph in almost any regression analysis is a scatterplot of residuals (on the vertical axis) versus fitted values. We typically examine the scatterplot for striking patterns. A “good” fit not only has small residuals, but it has residuals scattered randomly around 0 with no apparent pattern. This is the case here.
Statistical Modeling - 10 US Army Logistics Management College The Scatterplot of Residuals vs Fitted Values
Statistical Modeling - 11 US Army Logistics Management College Multiple Regression Bendrix Automotive Parts Company The Bendrix Company manufactures various types of parts for automobiles. The factory manager wants to get a better understanding of overhead costs, including supervision, indirect labor, supplies, payroll taxes, overtime premiums,depreciation, and a number of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses. Some of the overhead costs are “fixed” in the sense they do not vary appreciably with the volume of work being done, whereas others are “variable” and do vary directly with the volume of work being done. It is not easy to draw a clear line between the fixed and variable overhead components. The Bendrix manager has tracked total overhead costs for 36 months.
Statistical Modeling - 12 US Army Logistics Management College Explanatory Variables Bendrix Automotive Parts Company The factory manager collected data on two variables he believes might be responsible for variations in overhead costs: MachHrs: number of machine hours used during the month. ProdRuns: the number of separate production runs during the month (Bendrix manufactures parts in fairly large batches called production runs. Between each run there is a downtime.). Each observation (row) corresponds to a single month. We need to estimate and interpret the equation for Overhead when both explanatory variables, MachHrs and ProdRuns, are included in the regression equation, but because these are time series variables we should also look out for relationships between these variables and the Month variable. Bendrix.xls
Statistical Modeling - 13 US Army Logistics Management College Multiple Regression with StatTools Bendrix Automotive Parts Company Select StatTools + Regression & Classification + Regression… Check “Overhead” as the response (dependent) variable. Check “MachHrs” and ProdRuns” as the explanatory (independent) variables. Select the Graph options in the dialog box as shown here.
Statistical Modeling - 14 US Army Logistics Management College Multiple Regression Output Table Bendrix Automotive Parts Company The coefficients in B18:B20 indicate that the estimated regression equation is Predicted Overhead = (43.45 x MachHrs) + ( x ProdRuns)
Statistical Modeling - 15 US Army Logistics Management College Interpretation of Equation Bendrix Automotive Parts Company If the number of production runs is held constant, then the overhead cost is expected to increase by $43.54 for each extra machine hour If the number of machine hours is held constant, the overhead is expected to increase by $ for each extra production run. $3997 is the fixed component of overhead. The slope terms involving MachHrs and ProdRuns are the variable components of overhead.
Statistical Modeling - 16 US Army Logistics Management College Equation Comparison Bendrix Automotive Parts Company It is interesting to compare this equation with the separate equations: Predicted Overhead = 48, (MachHrs) and Predicted Overhead = 75, (ProdRuns) Predicted Overhead = 3, MachHrs ProdRuns Note that both coefficients have increased. Also, the intercept is now lower than either intercept in the single variable equation. It is difficult to guess the changes that more explanatory variables will cause, but it is likely that changes will occur. The reasoning for this is that when MachHrs is the only variable in the equation, we are obviously not holding ProdRuns constant - we are ignoring it - so in effect the coefficient 34.7 of MachHrs indicates the effect of MachHrs and the omitted ProdRuns on Overhead. But when we include both variables, the coefficient of 43.5 of MachHrs indicates the effect of MachHrs only, holding ProdRuns constant. Since the coefficients have different meanings, it is not surprising that we obtain different estimates.
Statistical Modeling - 17 US Army Logistics Management College Modeling Possibilities Fifth National Bank Gender-Discrimination Suit The Fifth National Bank of Springfield is facing a gender- discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. The bank’s employee database is listed in this file. Here is a partial list of the data. Bank.xls
Statistical Modeling - 18 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Variables Fifth National Bank Gender-Discrimination Suit EducLev: education level with categories 1 (high school grad), 2 (some college), 3 (bachelor’s degree), 4 (some graduate courses) & 5 (graduate degree) JobGrade: current job level, the possible levels being from 1-6 (6 is highest) YrHired: year employee was hired Salary: current annual salary in thousands of dollars YrBorn: year employee was born Gender: a categorical variable with values “Female” and “Male” YrsPrior: number of years of work experience at another bank prior to working at Fifth National PCJob: a dummy variable with value 1 if the employee’s current job is computer-related and value 0 otherwise Do the data provide evidence that females are discriminated against in terms of salary? For each of the 208 employees, the variables in the data set are:
Statistical Modeling - 19 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Naïve Approach Fifth National Bank Gender-Discrimination Suit A naïve approach to the problem is to compare the average salaries of the males and females. The average of all salaries is $39,922, the average female salary is $37,210, and the average male salary is $45,505. The difference between the averages is statistically different. The females are definitely earning less, but perhaps there is a reason. The question is whether the differences between the average salaries is still evident after taking other attributes into account. A perfect task for regression.
Statistical Modeling - 20 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Dummy Variables Fifth National Bank Gender-Discrimination Suit Some potential explanatory variables are categorical and cannot be measured on a quantitative scale. However, we often need to use these variables because they are related to the response variable. The trick is to create dummy variables, also called indicator or 0-1 variables, that indicate the category a given observation is in. To create dummy variables we can use an IF statement or we can use StatTools’ Dummy variable procedure, which is usually easier particularly when there are multiple categories. Once the dummy variables are created, we can combine the variables if we like by simply adding the columns to get the dummy for the new category.
Statistical Modeling - 21 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Regression Analysis w/Dummy Variables Fifth National Bank Gender-Discrimination Suit In this example we create dummy variables for Gender, and JobGrade. We also create another variable: YrsExper = 95 – YrHired (since this is 1995 data) We must follow two rules: We shouldn’t use any of the original categorical variables that the dummies are based on. We should use one less dummy than the number of categories for any categorical variable. Then we can run a regression analysis with Salary as the response variable, using any combination of numerical and dummy explanatory variables.
Statistical Modeling - 22 US Army Logistics Management College Creating Dummy Variables Gender Categorical Variable To create a dummy variable called Female for Gender: Select any data cell. From the Menu bar, select StatTools + Data Utilities + Dummy… Select “Gender”, as the variable Select “Create One Dummy Variable for Each Distinct Category”. Answer “Yes” to warnings. Repeat the procedure for JobGrade.
Statistical Modeling - 23 US Army Logistics Management College Regression Analysis Gender Only We first estimate a regression equation with Female as the only variable. The resulting equation is: Predicted Salary = Female To interpret this equation recall that Female has only two possible values, 0 and 1. If we substitute 1 then the predicted salary equals and if we substitute 0 the predicated salary is These are the average salaries of females and males. Therefore the interpretation of the coefficient of the Female dummy variable is straightforward. The above equation only tells part of the story, it ignores all information except for gender.
Statistical Modeling - 24 US Army Logistics Management College Regression Analysis Gender + YrsExper + YrsPrior We expand this equation by adding YrsExper and YrsPrior. The corresponding equation is: Pred Salary = YrsExper YrsPrior Female It is useful to write two separate equations, one for females: Predicted Salary = YrsExper YrsPrior and one for males: Predicted Salary = YrsExper YrsPrior We interpret the coefficient of the Female dummy variable as the average salary disadvantage for females relative to males after controlling for job experience. But there is still more story to tell.
Statistical Modeling - 25 US Army Logistics Management College Regression Analysis Gender + YrsExper + YrsPrior + JobGrade We next add job grade to the equation by including five of the six job grade dummies. Although any five can be use we use Job_2 - Job_6. The estimated regression equations is now: Predicted Salary = YrsExper YrsPrior Female Job_ Job_ Job_ Job_ Job_6 There are now two categorical variables involved, gender and job grade. However, we can still write a separate equation for any combination of categories by setting the dummies to the appropriate values.
Statistical Modeling - 26 US Army Logistics Management College Interpretation Gender + YrsExper + YrsPrior + JobGrade The equation for females at the fifth job grade is found by setting Female=1, Job_5=1, & other job dummies equal to 0. PredictedSalary = YrsExper YrsPrior The expected salary increase for one extra year of experience is $408; the expected salary increase for one year experience with another bank is $149 (either gender and any job grade). The coefficients of the job dummies indicate the average increase in salary an employee can expect relative to the reference (lowest) job grade. The key coefficient, the negative $1962 for females indicates the average salary disadvantage for females relative to males, given that they have the same experience levels and are in the same job grade The “penalty” is less than a fourth of the penalty we saw before. It appears that females might be getting paid less on average partly because they are in the lower job categories.
Statistical Modeling - 27 US Army Logistics Management College Pivot Table Concentration of Females in Lower Paid Jobs We can use a pivot table to check whether females are disproportionately in the lower job categories (set JobGrade in the row area, Gender in the column area and the count (expressed as a percentage) of any variable in the data area). Clearly, females tend to be concentrated at the lower job grades. This helps explain why females get lower salaries on average, but doesn’t explain why females are at the lower job grades in the first place. We won’t be able to provide a thorough analysis of this issue.
Statistical Modeling - 28 US Army Logistics Management College Conclusion The main conclusion we can draw from the output is that there is still a plausible case to be made for discrimination against females, even after including information on all the variables in the database in the regression equation.
Statistical Modeling - 29 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Interaction Terms Fifth National Bank Gender-Discrimination Suit An interaction variable algebraically is the product of two variables. Its effect is to allow the effect of one of the variables on Y to depend on the value of the other variable. The interaction term allows the slope of the regression line to differ between the two categories. Earlier we estimated an equation for Salary using the numerical explanatory variables YrsExper and YrsPrior and the dummy variable Female. If we drop the YrsPrior variable from the equation (for simplicity) and rerun the regression, we obtain the equation Predicted Salary = YrsExper Female The R 2 value for this equation is 49.1%. If we decide to include an interaction variable between YrsExper and Female in this equation, what is the effect?
Statistical Modeling - 30 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Solution with Interaction Terms Fifth National Bank Gender-Discrimination Suit We first need to form an interaction variable that is the product of YrsExper and Female. This can be done two ways in Excel. Do it manually by introducing a new variable that contains the product of the two variables involved, or Use: StatTools + Data Utilities + Interaction… Using the latter way we must select Female and YrsExper as the variables. Once the interaction variable has been created, we include it in the regression equation in addition to the other variables.
Statistical Modeling - 31 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Interpretation w/ Interaction Terms Fifth National Bank Gender-Discrimination Suit The estimated regression equation is Predicted Salary = YrsExper Female YrsExper_Female The female equation is: Pred Salary = YrsExper & the male equation is: Pred Salary = YrsExper Graphically - Nonparallel Female and Male Salary Lines
Statistical Modeling - 32 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Conclusion w/Interaction Terms Fifth National Bank Gender-Discrimination Suit The Y-intercept for the female line is slightly higher - females with no experience at Fifth National Bank tend to start out slightly higher than males - but the slope of the female line is much lower. That is, males tend to move up the salary ladder much more quickly than females. Again, this provides another argument, although a somewhat different one, for gender discrimination against females. The R 2 value increased from 49.1% to 63.9%. The interaction variable has definitely added to the explanatory power of the equation.
Statistical Modeling - 33 US Army Logistics Management College Part 2: Regression Analysis Statistical Inference
Statistical Modeling - 34 US Army Logistics Management College Inference About Regression Coefficients Bendrix Automotive Parts Company As before, the response variable is Overhead and the explanatory variables are MachHrs and ProdRuns. What inferences can we make about the regression coefficients? We obtain the output from using StatTools Bendrix1.xls
Statistical Modeling - 35 US Army Logistics Management College Multiple Regression Output Bendrix Automotive Parts Company Regression coefficients estimate the true, but unobservable, population coefficients. The standard error of b i indicates the accuracy of these point estimates. For example, the effect on Overhead of a one-unit increase in MachHrs is We are 95% confident that the coefficient is between to Similar statements can be made for the coefficient of ProdRuns and the intercept term. Predicted Overhead = MachHrs ProdRuns
Statistical Modeling - 36 US Army Logistics Management College A Test for the Overall Fit: The ANOVA Table Bendrix Automotive Parts Company Does the ANOVA table for the Bendrix manufacturing data indicate that the combination MachHrs and ProdRuns has at least some ability to explain variation in Overhead? The F-ratio is “off the charts” and the p-value is practically 0.
Statistical Modeling - 37 US Army Logistics Management College Interpretation of the ANOVA Table Bendrix Automotive Parts Company This information wouldn’t be much comfort for the Bendrix manager who is trying to understand the causes of variation in overhead costs. This manager already knows that machine hours and production runs are related positively to overhead costs - everyone in the company knows that! What he really wants to know is a set of explanatory variables that yields a high R 2 and a low s e. The low p-value in the ANOVA table does not guarantee these. All it guarantees is that MachHrs and ProdRuns are of “some help” in explaining variation in Overhead.
Statistical Modeling - 38 US Army Logistics Management College Violations of Regression Assumptions Bendrix Automotive Parts Company Is there evidence of non constant variance? Is there any evidence of lag 1 autocorrelation in the Bendrix data when Overhead is regressed on MachHrs and ProdRuns? Is there evidence of non Normality?
Statistical Modeling - 39 US Army Logistics Management College Do the Residuals Have Constant Variance? Bendrix Automotive Parts Company If the residual variance is not constant, the standard error of the regression coefficient, s(b i ), is incorrect. Note: when we ran the regression we selected “Residuals vs Fitted Values” graphs.
Statistical Modeling - 40 US Army Logistics Management College Plot of Residuals vs Fitted Values Bendrix Automotive Parts Company Residuals appear to have equal Variances (homoscedasticity)
Statistical Modeling - 41 US Army Logistics Management College Autocorrelated Residuals Bendrix Automotive Parts Company The residuals of time series data are often autocorrelated. The most frequent type of autocorrelation is positive autocorrelation. For example, if residuals separated by 1 month are auto correlated, this is called lag 1 autocorrelation. We use the fitted (col C) and residuals values (col D) In the “Regression” tab. The residuals represent how much the regression over-predicts (if negative) or under-predicts (if positive) the overhead cost for that month.
Statistical Modeling - 42 US Army Logistics Management College Durbin-Watson Test Bendrix Automotive Parts Company We can check for lag 1 autocorrelation in two ways, with the Durbin- Watson(DW) statistic and by examining the time series graph of the residuals. The Durbin-Watson (DW) statistic is scaled between 0 and little lag 1 autocorrelation < 2 - positive autocorrelation > 2 – negative autocorrelation. If n = 30 and bi’s 1-5, <1.2 is a problem) We calculate the DW statistics in cell E45 with the formula: =StatDurbinWatson(D45:D80) Based on our guidelines for DW value suggests positive autocorrelation - it is less than 2 - but not enough to cause concern.
Statistical Modeling - 43 US Army Logistics Management College Time Series Graph of Residuals Bendrix Automotive Parts Company This general conclusion is supported by the time series graph. Serious autocorrelation of lag 1 would tend to show long runs of residuals alternating above and below the horizontal axis - positives would tend to follow positives and negatives would tend to follow negatives. There is some indication of this in the graph but not an excessive amount. − Add the range A44:D80 as a Data set − StatTools + Time Series & Forecasting + Time Series Graph − Select Residuals as the variable
Statistical Modeling - 44 US Army Logistics Management College Are the Residuals Normally Distributed? Bendrix Automotive Parts Company The Inferences we want to make assume the residuals are normally distributed. Using Data Set #2 Select: StatTools + Normality Tests + Q-Q Normal Plot Select “Residuals” as the variable Check “Plot Using Standardized Q-Values” and “Include Reference Line”
Statistical Modeling - 45 US Army Logistics Management College Normal Probability Plot Bendrix Automotive Parts Company Error terms appear to be Normally Distributed
Statistical Modeling - 46 US Army Logistics Management College Multicollinearity Height vs Left & Right Feet The relationship between the explanatory variable X and the response variable Y is not always accurately reflected in the coefficient of X; it depends on which other X’s are included or not included in the equation (especially when there is a linear relationship between two or more explanatory variables, in which case we have multicollinearity). Multicollinearity is the presence of a fairly strong linear relationship between two or more explanatory variables, and it can make estimation difficult. We want to explain a person’s height by means of foot length. The response variable is Height, and the explanatory variables are Right and Left, the length of the right foot and the left foot, respectively. It is likely that there is a large correlation between height and foot size, so we would expect this regression equation to do a good job. The R 2 value will probably be large. But what about the coefficients of Right and Left?
Statistical Modeling - 47 US Army Logistics Management College Correlation of Left & Right Height vs Left & Right Feet To show what can happen numerically, we generated a hypothetical data set of heights and left and right foot lengths in this file. We did this so that, except for random error, height is approximately 32 plus 3.2 times foot length (in inches). The correlations between Height and either Right or Left in our data set are quite large, and the correlation between Right and Left is very close to 1. Height.xls StatTools + Summary Statistics + Correlation & Covariance
Statistical Modeling - 48 US Army Logistics Management College Multiple Regression Height vs Left & Right Feet The Regression output tells a somewhat confusing story. The multiple R and the corresponding R 2 are about what we would expect, given the correlations between Height and either Right or Left. In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the s e value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches. However, the coefficients of Right and Left are not all what we might expect, given that we generated heights as approximately 32 plus 3.2 times foot length. In fact, the coefficient of Left has the wrong sign - it is negative! Besides this wrong sign, the tip-off that there is a problem is that the t-value of Left is quite small and the corresponding p- value is quite large.
Statistical Modeling - 49 US Army Logistics Management College Solution Judging by this, we might conclude that Height and Left are either not related or are related negatively. But we know from the table of correlations that both of these are false. In contrast, the coefficient of Right has the “correct” sign, and its t-value and associated p-value do imply statistical significance, at least at the 5% level. However, this happened mostly by chance, slight changes in the data could change the results completely.
Statistical Modeling - 50 US Army Logistics Management College Solution Although both Right and Left are clearly related to Height, it is impossible for the least squares method to distinguish their separate effects. Note that the sum of the coefficients is which is close to the coefficient of 3.2 we used to generate the data. Therefore, the estimated equation will work well for predicting heights, but does not provide reliable estimates of the coefficients of Right and Left. When Right is only variable: Predicted Height = Right The R 2 = 81.6%, s e = 2.005, the t-value = and p-value = for the coefficient of Right - very significant. When Left is only variable: Predicted Height = Left The R 2 = 81.1%, and s e = 2.033, the t-value = 20.99, and the p-value = for the coefficient of Left - again very significant. Clearly, both of these equations tell almost identical stories, and they are much easier to interpret than the equation with both Right and Left included.
Statistical Modeling - 51 US Army Logistics Management College Stepwise Regression HyTex Catalogs HyTex is a direct marketer of stereo equipment, personal computers, and other electronic products. HyTex advertises entirely by mailing catalogs to its customers, and all of its orders are taken over the telephone. The company spends a great deal of money on its catalog mailings, and it wants to be sure that this is paying off in sales. Data on 250 customers who purchased mail-order products from the HyTex Company in 1998 is available. Stepwise regression will be used to produce a regression equation for the amount spent in 1998.
Statistical Modeling - 52 US Army Logistics Management College The Data HyTex Catalogs Age: (1 = 30 or younger, 2 = 31 to 55, 3 for 56 and older) Gender: (1 = males, 0 =females OwnHome: (1 = customer owns home, 0 otherwise) Married: (1 = customer is currently married, 0 otherwise) Close: (1 = customers lives reasonably close to shopping area that sells similar merchandise, 2 otherwise) Salary: combined annual salary of customer and spouse (if any) Children: number of children living with customer Customer97: (1 = customer purchased from HyTex during 1997, 0 otherwise) Spent97: total amount of purchase in 1997 from HyTex Catalogs: Number of catalogs sent to the customer in 1998 Spent98: total amount of purchase in 1998 from HyTex For each customer there are data on the following variables:
Statistical Modeling - 53 US Army Logistics Management College Stepwise Regression Many statistical packages provide some assistance by including automatic equation-building options. These options estimate a series of regression equations by successively adding (or deleting) variables according to prescribed rules. Generically, these methods are referred to as stepwise regression. There are three types: forward, backward and stepwise. Forward - begins with no explanatory variables in the equation and successively adds one at a time until no explanatory variables make a significant contribution. Backward - begins with all potential explanatory variables in the equation and deletes them one at a time until further deletion would do more harm than good. Stepwise - much like a forward procedure, except that it also considers possible deletions along the way.
Statistical Modeling - 54 US Army Logistics Management College Select StatTools + Regression & Classification + Regression Select Regression Type: Stepwise. Specify Spent98 as the response variable and select all of the other variables (besides Customer) as potential explanatory variables. Choose p-values or F-values as the appropriate criterion. HyTex Catalogs Stepwise Regression in StatTools HyTex Catalogs
Statistical Modeling - 55 US Army Logistics Management College Interpretation of Final Regression Equation The coefficient of Catalogs implies that $42.00 more was spent for each catalog sent. The coefficient of Married implies that $ more was spent for every married person. The coefficient of Own Home implies that $ more was spent for every person owning their own home. The coefficients for Spent97 and Customer97 are somewhat more difficult to interpret. First, both are 0 for customers who didn’t purchase the previous year. For those who did, the terms become -1, Spent97.
Statistical Modeling - 56 US Army Logistics Management College The Partial F Test
Statistical Modeling - 57 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit The Partial F Test Fifth National Bank Gender-Discrimination Suit The Fifth National Bank is facing a gender-discrimination suit charging that its female employees receive substantially smaller salaries than its male employees. Previously we ran several regressions for Salary to see whether there is convincing evidence of salary discrimination against females. Now, we will perform the following analysis: We will regress Salary versus the Gender_Female, Yrs_Exper, and Yrs_Exper*Gender_Female_1. This will be the reduced equation. Then we’ll see whether the variables JobGrade_2 through JobGrade_6 add anything significant to the reduced equation. Next see if the variables Gender_Female_1*JobGrade_2_1 through Gender_Female_1*JobGrade_6_1 add anything significant to what we already have. Continuing on, see if EducLev_1 through EducLev_5 add anything significant to what we already have.
Statistical Modeling - 58 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit First Solution Fifth National Bank Gender-Discrimination Suit First, note that we created all of the dummies and interaction variables with StatTools’ Data Utilities procedures. Also, note that we have used three sets of dummies, for gender, job grad and education level. When we use these in a regression equation, the dummy for one category of each should always be excluded; it is the reference category. The reference categories we have used are “male”, job grade 1 and education level 1. The “smallest” equation uses Gender_Female, Yrs_Exper, and Yrs_Exper*Gender_Female_1 as explanatory variables. We’re off to a good start. These three variables already explain 63.9% of the variation of Salary. The next equation adds the explanatory variables JobGrade_2 through JobGrade_6.
Statistical Modeling - 59 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Second Solution Fifth National Bank Gender-Discrimination Suit This equation appears much better. ( R 2 increased to 81.1% ). Check whether it is significantly better with the partial F test. Calculate the F–ratio. Given SSE R = , SSE C = , MSE C = , k – j = 8 – 3 = 5 (represents the number of extra variables) the F–ratio is Calculate the corresponding p-value. Using Excel, the formula is: “=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above), dof1 is the number of additional variables (k – j), and dof2 is the degrees of freedom for the unexplained complete equation. Since FDIST(36.28,5,199) = 0, there is no doubt the added variables contribute to the explanatory power of the equation.
Statistical Modeling - 60 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Third Solution Fifth National Bank Gender-Discrimination Suit This equation appears better. ( R 2 increased to 84% ). Check whether it is significantly better with the partial F test. Calculate the F–ratio. Given SSE R = , SSE C = , MSE C = , k – j = 13 – 8 = 5 the F–ratio is Calculate the corresponding p-value. Using Excel, the formula is: “=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above), dof1 is the number of additional variables (k – j), and dof2 is the degrees of freedom for the unexplained complete equation. Since FDIST(6.9368,5,194) = 0, there is no doubt the added variables contribute to the explanatory power of the equation.
Statistical Modeling - 61 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Fourth Solution Fifth National Bank Gender-Discrimination Suit This equation seems very slightly better. ( R 2 increased to 84.7% ). Check whether it is significantly better with the partial F test. Calculate the F–ratio. Given SSE R = , SSE C = , MSE C = , k – j = 17 – 13 = 4 the F–ratio is Calculate the corresponding p-value. Using Excel, the formula is: “=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above), dof1 is the number of additional variables (k – j), and dof2 is the degrees of freedom for the unexplained complete equation. Since FDIST(2.383,4,190) = , we can not be 95% confident the added variables contribute to the explanatory power of the equation. We therefore choose not to include them in the model.
Statistical Modeling - 62 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Solution Fifth National Bank Gender-Discrimination Suit According to the partial F test, the variables added to the forth equation do not improve the solution enough to qualify for statistical significance at the 5% level. Based on this evidence, there is not much to gain from including the education dummies in the equation, so we would probably elect to exclude them. As a result, the third solution is considered the complete solution.
Statistical Modeling - 63 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Concluding Comments Fifth National Bank Gender-Discrimination Suit The partial test is the formal test of significance for an extra set of variables. Many users look only at the R 2 and/or s e values to check whether extra variables are doing a “good job”. If the partial F test shows that a block of variables is significant, it does not imply that each variable in this block is significant. Some added variables can have low t-values.
Statistical Modeling - 64 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Concluding Comments Fifth National Bank Gender-Discrimination Suit Producing all of these outputs and completing the partial F Test is a lot of work. StatTools includes a routine called “Block” that simplifies the process. Select StatTools + Regression & Classification + Regression Select Regression Type: Block. Choose 4 blocks and identify which additional variables enter each block.
Statistical Modeling - 65 US Army Logistics Management College Concluding Comments While we have concentrated on the partial F test and statistical significance in this example, don’t lose sight of the bigger picture. Once we have decided on a “final” regression equation we need to analyze its implications for the problem at hand. In this case the bank is interested in possible salary discrimination against females, so we should interpret this final equation in these terms. Don’t get so caught up in the details of statistical significance that you lose sight of the original purpose of the analysis!
Statistical Modeling - 66 US Army Logistics Management College Fifth National Bank Gender-Discrimination Suit Outliers Fifth National Bank Gender-Discrimination Suit Are there any obvious outliers from the 208 employees? In what sense are they outliers? Does it matter to the regression results, particularly those concerning gender discrimination, whether the outliers are removed? There are several places we could look for outliers. An obvious place is the Salary variable. The boxplot shown here shows that there are several employees making substantially more in salary than most of the employees.
Statistical Modeling - 67 US Army Logistics Management College Solution We could consider these outliers and remove them, arguing perhaps that these are senior managers who shouldn’t be included in the discrimination analysis. We leave it to you to check whether the regression results are any different with these high salary employees than without them. Another place to look is at the scatterplot of the residuals versus the fitted values. This type of plot shows points with abnormally large residuals. For example, we ran the Regression with Female, YrsExper, Fem-YrsExper and the five job grade dummies, and we obtained the output and scatterplot shown here.
Statistical Modeling - 68 US Army Logistics Management College Solution This scatterplot has several points that could be considered outliers, but we focus on the point identified in the figure. The residual for this point is approximately -21. Given the s e for this regression is approximately 5, this residual is over four standard errors below 0 - quite a lot. This person is found to be unusual and special circumstances can explain this. We delete this employee and rerun the regression with the same variables
Statistical Modeling - 69 US Army Logistics Management College Solution Recalling that gender discrimination is the key issue in this example we compare the coefficients of Female and Fem_YrsExper in the two outputs. The coefficient of Female has dropped from to In words, the Y-intercept for the female regression line used to be about $6000 higher than for the male line, now it’s only about $4350. More importantly, the coefficient of Fem_YrsExper has changed from to This indicates how much less steep the female line for Salary versus Yrs_Exper is than the male line. A change from to indicates less discrimination against females now than before. This unusual female employee accounts for a good bit of the discrimination argument - although a strong argument still exists even without her.
Statistical Modeling - 70 US Army Logistics Management College Part 3: Analysis of Variance and Experimental Design
Statistical Modeling - 71 US Army Logistics Management College One-Way ANOVA
Statistical Modeling - 72 US Army Logistics Management College One-way ANOVA A one-way analysis variance or one-way ANOVA is the procedure for analyzing the differences between more than two population means. A one-way ANOVA is also used in randomized experiments where a single population is treated in one of several ways. The data analysis in these two situations is identical; only the interpretation of the results differ.
Statistical Modeling - 73 US Army Logistics Management College One-way ANOVA Process The one-way ANOVA procedure is usually run in two stages. The first stage tests the null hypothesis. If the p-value is not sufficiently small, then there is not enough evidence to reject the equal-means hypothesis, and the analysis stops. If the p-value is sufficiently small, we can conclude with some assurance that the means are not all equal. If all means are not equal, the second stage determines which of the groups differ significantly from the others.
Statistical Modeling - 74 US Army Logistics Management College Background Information Effects of Shelf Height on Cereal Sales at Midway Midway Company selects 125 stores in its chain of supermarkets to conduct an experiment on cereal sales. These stores are similar in terms of store size, customer traffic, customer types, and other characteristics. Each store stocks cereal in a similar location in the store on five- shelf displays The 125 stores are randomly selected to be in one of five groups, where each group stocks Brand X cereal in a specific shelf location (highest, next highest, middle, next lowest, lowest) The number of Brand X boxes sold at each store are recorded for the last two weeks of the experiment (the first two weeks allow customers to get used to the shelving positions) Objective: does shelf height make any difference in mean sales of Brand X cereal, and if so, which shelf heights outperform the others.
Statistical Modeling - 75 US Army Logistics Management College One-way ANOVA Solution Cereal Sales at Midway Note that this is a designed experiment Initial stores chosen in an attempt to control for extraneous factors Randomly assigned stores to treatment levels (shelf heights) The output consists of three basic parts: summary statistics the ANOVA table confidence intervals Select Statistical Inference + One Way ANOVA The next slide contains this output.
Statistical Modeling - 76 US Army Logistics Management College One-way ANOVA Solution Cereal Sales at Midway
Statistical Modeling - 77 US Army Logistics Management College Summary Statistics Cereal Sales at Midway The summary statistics show that the next to highest shelf position has the largest mean store sales (426.28), and the lowest shelf has the smallest mean store sales (334.92), with the others in between. The sample standard deviations (or variances) vary somewhat across the shelf positions, but not enough to invalidate the procedure (we assume equal variance). The side-by-side boxplots in the figure on the next slide illustrate these summary measures graphically. However, there is too much overlap to tell whether the differences are statistically significant.
Statistical Modeling - 78 US Army Logistics Management College Boxplot of Mean Results by Region Cereal Sales at Midway
Statistical Modeling - 79 US Army Logistics Management College ANOVA Table Results Cereal Sales at Midway The Total variation in the ANOVA Table is based on the total variation of all observations around the grand mean in the summary section, and is used mainly to aid in calculations. The grand mean is the sample mean of all observations. The between variation is the squared difference between the treatment level means and the grand mean weighted by the treatment sample sizes (df = number of groups – 1) The within variation is variation due to differences within individual treatment groups (df = total sample size - # groups) The F-ratio for the test is with a corresponding p-value of (since <.05, we reject the null hypothesis that all means are equal). Since all means are not equal, we proceed to a comparison test to determine which means are not equal
Statistical Modeling - 80 US Army Logistics Management College Results Cereal Sales at Midway The final section of output lists a set of multiple comparison of two treatment levels (shelf heights). The difference shows which two shelf heights are being compared, and the mean difference shows how much difference there is between the mean sales for the two shelf heights The lower and upper level shows the confidence intervals for the two shelf heights – if the lower value is negative and the upper value is positive, then 0 is contained in the interval and we can conclude that there is no statistical difference in sales between those two heights The only statistically significant difference we can discern is between the next to highest shelf and the lowest shelf (largest and smallest mean sales) The company needs to discern if that difference is practically significant, or if any external factors confounded the experiment.
Statistical Modeling - 81 US Army Logistics Management College Two-Way ANOVA
Statistical Modeling - 82 US Army Logistics Management College Background Information Golf Ball Testing Many golf ball manufacturers claim to have the “longest ball,” that is, the ball that goes the farthest on drives. This example illustrates how these claims might be tested by testing five major brands (Brand A through E) A consumer testing service runs an experiment where 60 balls of each brand are driven under three temperature conditions. The first 20 are driven in cool weather, the next 20 are driven in mild weather, and the last 20 are driven in warm weather. The goal is to see whether some brands differ significantly, on average, from other brands and what effect temperature has on the mean differences between brands.
Statistical Modeling - 83 US Army Logistics Management College Experimental Design Golf Ball Testing Unlike the last example, this example represents a controlled experiment (20 golf balls of each brand are randomly assigned to each of three temperature levels). In general terminology, the experimental units are the individual golf balls and the response variable is the length (in yards) of each drive. There are two factors (brand and temperature), each with different treatment levels (brand has levels A through E, and temperature has three levels: cool, mild, and warm). The design is balanced because the same number of balls, 20, is used at each of the 5 x 3 = 15 treatment level combinations. There is one further piece of terminology. We call this a full factorial two-way design because we test golf balls at each of the 15 possible treatment level combinations.
Statistical Modeling - 84 US Army Logistics Management College Conducting the Experiment Golf Ball Testing How should the consumer testing service carry out the experiment? One possibility is to have 15 golfers, each of approximately the same skill level, hit 20 balls each. The downside of this design could be that the golfers assigned to a certain brand could be having a good day. Golfers could be spread out (each golfer could hit 2 balls). This, however, introduces an unwanted source of variation: the different abilities of the golfers. You could use the same golfer for 300 balls. Unfortunately, the golfer might get tired in the process of hitting this many balls. These are the type of things designers of experiments must consider.
Statistical Modeling - 85 US Army Logistics Management College Conducting the Experiment Golf Ball Testing The design should attempt to eliminate as many unwanted sources of variation as possible, so that any difference across the factor levels of interest can be attributed to these factors and not to extraneous factors. In this example, we suspect the best solution is to employ a “mechanical” golf ball driver to hit all 300 balls. This should reduce the inevitable random variation that would occur by using human golfers.
Statistical Modeling - 86 US Army Logistics Management College Coding the data Golf Ball Testing Although many rows in the figure are hidden, there are actually 300 rows of data, 20 for each of the 15 combinations of Brand and Temp. There must be two “code” variables that represent the levels of the two factors and a measurement variable that represents the response variable. Again this is a balanced design, which is what StatTools expects for its two- way ANOVA procedure.
Statistical Modeling - 87 US Army Logistics Management College Analysis of Results Golf Ball Testing Prompted by the table, here are some questions we might ask: 1.Look at column I. Do any brands average significantly more yards than any others (where these averages are averages over all temperatures)? 2.Look at the bottom row. Do average yardages differ significantly across temperatures (where these averages are across all brands)? 3.Look at the middle of the table. Do differences among averages of brands depend on temperature? For example, does one brand dominate in cool weather and another in warm weather? Also, do differences among averages of temperatures depend on brand? For example, are some brands very sensitive to changes in temperature while others are not?
Statistical Modeling - 88 US Army Logistics Management College Analysis of Results Golf Ball Testing It is useful to characterize the type of information these questions are seeking. Question 1 is asking about the main effect of the brand factor. If we ignore the temperature, do some brands tend to go farther than some others? Question 2 is also asking about a main effect, the main effect of the temperature factor. If we ignore the brand, do balls tend to go farther in some temperatures than others? (This answer is obvious to golfers: balls compress better and go farther in warm temperatures.) Therefore this is not a key question, although we would expect the study to confirm what common sense tells us. Question 3 is asking about interactions between the two factors. These interactions are often the most interesting results of a two- way study. In this example interactions are patterns of the averages that could not be guessed by looking only at the “main effect” averages.
Statistical Modeling - 89 US Army Logistics Management College Interaction Effects Golf Ball Testing Specifically, the order of brands in column F, from largest to smallest average yardages, is E, C, B, A, D. If there were not interactions at all, this ordering would hold at each temperature. For these data it is close. At cool temperatures the ordering is C, E, B, A, D; for mild, it is E, B, C, D, A; for warm, it is E, C, A, B, D. Actually, having no interaction implies even more than the preservation of these rankings.
Statistical Modeling - 90 US Army Logistics Management College Interaction Effects Golf Ball Testing It implies that the difference between any two brand averages is the same at any of the three temperature levels. For example, the difference between brands E and D at the three temperature levels are: = = = 14.8 If there were no interactions at all, these three differences would be equal.
Statistical Modeling - 91 US Army Logistics Management College Interaction Graphically Golf Ball Testing The concept of interaction is much easier to understand by looking at graphs. The following graphs, which are both outputs from StatTools’ two-way ANOVA procedure, represent two ways of looking at the pattern of averages for different combinations of brand and temperature. The first graph shows a line for each brand, where each point on the line corresponds to a different temperature. The second shows the same information with the roles of brand and temperature reversed.
Statistical Modeling - 92 US Army Logistics Management College Interaction Graphically Golf Ball Testing
Statistical Modeling - 93 US Army Logistics Management College Interaction Graphically Golf Ball Testing
Statistical Modeling - 94 US Army Logistics Management College Interaction Graphically Golf Ball Testing Neither graph is better than the other, they simply show the same data from different perspectives. The key to either is whether the lines are parallel. If they are, then there is no interactions - the effect of one factor on average yardage is the same regardless of the level of the other factor. The more nonparallel they are, however, the stronger the interactions are. The lines in either of these graphs are not exactly parallel but they are nearly so. This implies that there is very little interaction between brand and temperature.
Statistical Modeling - 95 US Army Logistics Management College Type of Interactions In general, interactions can be of several types. Shown here are two contrasting types. These graphs focus on two types and on different data than in GOLFBALLS.XLS. In the first graph brand A dominates at all temperatures. However, there is little interaction because the difference between brands increases as temperatures increase.
Statistical Modeling - 96 US Army Logistics Management College Type of Interactions In this situation the interaction effect is interesting, but not the main effect of brand - brand A is better when averaged over all temperatures - is also interesting. The situation is quite different in the next graph, where there is a crossover.
Statistical Modeling - 97 US Army Logistics Management College Type of Interactions Brand A is somewhat better at cool temperatures, but brand B is better at mild and warm temperatures. In this case the interaction is the most interesting finding, and the main effect of brand is much less interesting.
Statistical Modeling - 98 US Army Logistics Management College Type of Interactions In simple terms, if you are a golfer, you’d buy brand A in cool temperatures and brand B otherwise, and you wouldn’t care very much which brand is better when averaged over all temperatures. For these reasons, we check first for interactions in a two-way design. If there are significant interactions, then the main effects might not be as interesting. However, if there are no significant interactions, then main effects generally become more important.
Statistical Modeling - 99 US Army Logistics Management College Main Effects versus Interactions Main effects are differences in average across the levels on one factor, where these averages are averages over all levels of the other factor. In a table of sample means, we can check for main effects by looking at the averages in the “Grand Total” column and row. In contrast, the interactions are patterns of averages in the main body of the table and are best shown graphically. They indicate whether the effect of one factor depends on the level of the other factors.
Statistical Modeling US Army Logistics Management College Two Way ANOVA Table The next question is whether the main effects and interactions we see in the table of sample means are statistically significant. As in a one-way ANOVA, this is answered by an ANOVA table. However, instead of having just two sources of variation, within and between, as in a one-way ANOVA, there are now four sources of variation: one for the main effect of each factor, one for interactions, and one for the variation within treatment level combinations.
Statistical Modeling US Army Logistics Management College Analysis of Results Golf Ball Testing For the golf ball data, two-way ANOVA separates the total variation across all 300 observations into four sources. 1.There is variation due to different brands producing different average yardages. 2.There is variation due to different average yardages at different temperatures. 3.There is variation due to the interactions we saw in the interaction graphs. 4.There is the same type of “within” variation as in one-way ANOVA. This is the variation that occurs because yardages for the 20 balls of the same brand hit at the same temperature are not all identical.
Statistical Modeling US Army Logistics Management College Output of Results Golf Ball Testing Select the StatTools/ Statistical Inference/Two- way ANOVA menu item, selecting Brand and Temp as the “code” variables and Yards as the “measurement” variable Output includes tables of sample sizes, sample means, and sample standard deviations, as well as the ANOVA table.
Statistical Modeling US Army Logistics Management College Analysis of Results Golf Ball Testing We test whether main effects or interactions are statistically significant in the usual way - by examining p-values. Looking first at the interactions, the p-value is about 0.03, which says that the lines in the interaction graphs are significantly non- parallel, at least at the 5% significant level. There is at least some interaction between brand and temperature (although the practical significance could be disputed). The two p-values for the main effects in cells G32 and G33 are practically 0, meaning that there are differences across brands and across temperatures. Of course, the main effect of temperature was a foregone conclusion - we already know that balls do not go as far in cold temperatures - but the main effect of brand is more interesting. According to the evidence, some brands definitely go farther, on average, than some others.