Presentation is loading. Please wait.

Presentation is loading. Please wait.

Part 1: Regression Analysis Estimating Relationships

Similar presentations


Presentation on theme: "Part 1: Regression Analysis Estimating Relationships"— Presentation transcript:

1 Part 1: Regression Analysis Estimating Relationships
Concept of regression analysis as a statistical modeling tool: Ability to forecast (statistical association between a dependent variable and one or more independent variables). Ability to estimate impact of independent variable(s) on dependent variable Remember that the model does not prove causality. Relationships could be based on chance (especially with a small data set), or due to a third variable influence. We could even have the independent & dependent variables interchanged. Causality can only be proven through a series of designed experiments along with historical observation and experience that validates the regression model. Part 1: Regression Analysis Estimating Relationships

2 Preparing to Use Stat Tools Pharmex Drug Stores
Pharmex.xls Stat Tools is a part of the Decision Tools Suite Open both Excel and Stat Tools. Select StatTools + Data Set Manager Select New Highlight the portion of the spreadsheet that includes the data and select OK

3 Scatterplots: Graphing Relationships Pharmex Drug Stores
Pharmex.xls Pharmex is a chain of drugstores that operates around the country. The company has collected data from 50 randomly selected metropolitan regions. In each region it has collected data on its promotional expenditures and sales in the region over the past year. There are two variables each of which are indexes, not dollar amounts. Promote: Pharmex’s promotional expenditures as a percentage of those of the leading competitor. Sales: Pharmex’s sales as a percentage of those of the leading competitor. The company expects that there is a positive relationship between the two variables, so that regions with relatively more expenditures have relatively more sales. However, it is not clear what the nature of this relationship is. In this example, remember that the two variables are indices. If the value is over 100, our store is spending more on promotions or has greater sales relative to the leading competitor. We are building a model based on a hypothesis that as regions increase promotional spending relative to their leading competitor, sales should also increase relative to their leading competitor. Before we build a formal regression model, we can do a visual observation of the relationship by building a scatterplot. We’ll use the promote variable as the x-axis, and the sales variable as the y-axis.

4 Creating the Scatterplot Pharmex Drug Stores
Pharmex.xls The tricky part is to decide which variable should be on the horizontal axis. Select any data cell. Select StatTools + Summary Graphs + Scatterplot… In regression analysis, we always put the explanatory variable on the horizontal axis and the response variable on the vertical axis. In this example the store tends to believe that large promotional expenditures “cause” larger values of sales, so select “Sales” as the Y variable (the vertical axis).. Select “Promote” as the X variable (the horizontal axis). You can create the scatterplot in EXCEL – the StatTools software provides a user friendly menu to create the plot once you have defined the dataset.

5 Interpretation Pharmex Drug Stores
The scatterplot indicates that there is a positive relationship between Promote and Sales - the points tend to rise from bottom left to top right - but the relationship is not perfect. The correlation of is shown automatically on the plot. The important things to note about the correlation is that it is positive and its magnitude is moderately large. Causation - we can never make definitive statements about causation based on regression analysis. Regression identifies only a statistical relationship, not a causal relationship R SQUARED = .453 OR 45.3 PERCENT EXPLAINED VARIATION. Correlation is a measure of association. Min=-1 & Max=1

6 Simple Linear Regression Pharmex Drug Stores
The Pharmex scatterplot hints at a linear relationship between Promote and Sales. We want to draw the “best fitting” straight line through the points to quantify that linear relationship. Since the relationship is not perfect, not all points lie exactly on the line. The differences are the residuals. They show how much the observed values differ from the fitted values. The fitted value is the vertical distance from the horizontal axis to the line . We decide to define “best fitting” line through the points in the scatterplot to be the one with the smallest sum of the squared residuals. This line is called the least squares line We now want to find the least squares line for the Pharmex drugstore data, using Sales as the response variable and Promote as the explanatory variable. Good place to show a graph with a fitted line through a series of points and visually show the residual (the vertical difference between a point and the regression line). We use squared residuals to avoid positive and negative residuals cancelling each other out (in fact, the sum of your residuals based on the “best fitting” least squares regression line will be zero). We could use an absolute value measure, however that is difficult to use to develop a formula for calculating an “best fitting” regression line. Absolute values do not work well with calculus, and we will use calculus to determine the “optimal” regression line/model. We can use calculus to determine the smallest sum of the squared residuals, which will then allow us to determine the coefficients used to describe the regression line/model. We are further making an assumption that the relationship between the variables is linear. We will not cover non-linear regression models in this block.

7 Least Squares Line with StatTools Pharmex Drug Stores
Select any data cell. From the Menu bar, select : StatTools + Regression & Classification + Regression… Specify that “Sales” is the response (dependent) variable. Specify that “Promote” is the explanatory (independent) variable. Select graph option: “Residuals vs Fitted values” Notice that there is no option in regression type for “single”. Always use “multiple” unless you are using a special regression procedure. Select appropriate dependent and independent variables. You have several graph options that you can select If you have one or more specific independent variables that you would like to predict their values using the regression model (along with an associated confidence interval), you can put those values in your spreadsheet, specify those values as a dataset, and select that dataset for use.

8 Regression Output Table Pharmex Drug Stores
R-SQUARE - % of Total variability explained by the regression equation Sample Variance of Y’s was Remaining Unexplained Variance (MSE) is Using Promote we have explained 45.29% of the variability of the Y’s Multiple R is the positive Square root of R-SQUARE StErr of Estimate is Positive Square Root of MSE p-value < indicates slope of regression line is statistically significant regardless of the level of significance at which you test. The “Constant” and “Promote” coefficients B18:C18 imply that the equation for the least squares line is: Predicted Sales = ( x Promote)

9 Least Square Line Equation Pharmex Drug Stores
We can interpret this equation as follows: The slope indicates that the sales index tends to increase by about 0.76 for each unit increase in the promotional expenses index. The interpretation of the intercept is less important. It is literally the predicted sales index for a region that does no promotions. The Scatterplot A useful graph in almost any regression analysis is a scatterplot of residuals (on the vertical axis) versus fitted values. We typically examine the scatterplot for striking patterns. A “good” fit not only has small residuals, but it has residuals scattered randomly around 0 with no apparent pattern. This is the case here.

10 The Scatterplot of Residuals vs Fitted Values
SIMPLE LINEAR ENDS. GO BACK TO FILE, DELETE BLANK COLUMN, STATPRO, TEST FOR NORMALITY, SELECT Q-Q, SELECT RESIDUALS, EXPLAIN SINCE POINTS NEARLY LINEAR IMPLIES MEET THE NORMALITY ASSUMPTION. SHOULD MENTION OTHER PACKAGES HAVE TEST OF NORMALITY.

11 Multiple Regression Bendrix Automotive Parts Company
The Bendrix Company manufactures various types of parts for automobiles. The factory manager wants to get a better understanding of overhead costs, including supervision, indirect labor, supplies, payroll taxes, overtime premiums,depreciation, and a number of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses. Some of the overhead costs are “fixed” in the sense they do not vary appreciably with the volume of work being done, whereas others are “variable” and do vary directly with the volume of work being done. It is not easy to draw a clear line between the fixed and variable overhead components. The Bendrix manager has tracked total overhead costs for 36 months. In this multiple regression example, we are attempting to develop a model that will show how overhead costs can be related to the number of machine hours worked in a month and the number of production runs in a month. The thought here is that more machine hours will increase certain types of overhead costs, while number of runs will impact on other types of overhead costs.

12 Explanatory Variables Bendrix Automotive Parts Company
Bendrix.xls The factory manager collected data on two variables he believes might be responsible for variations in overhead costs: MachHrs: number of machine hours used during the month. ProdRuns: the number of separate production runs during the month (Bendrix manufactures parts in fairly large batches called production runs. Between each run there is a downtime.). Each observation (row) corresponds to a single month. We need to estimate and interpret the equation for Overhead when both explanatory variables, MachHrs and ProdRuns, are included in the regression equation, but because these are time series variables we should also look out for relationships between these variables and the Month variable.

13 Multiple Regression with StatTools Bendrix Automotive Parts Company
Select StatTools + Regression & Classification + Regression… Check “Overhead” as the response (dependent) variable. Check “MachHrs” and ProdRuns” as the explanatory (independent) variables. Select the Graph options in the dialog box as shown here. For right now, we will ignore the month variable and the issue associated with data presented in a time sequence order.

14 Multiple Regression Output Table Bendrix Automotive Parts Company
The coefficients in B18:B20 indicate that the estimated regression equation is Predicted Overhead = (43.45 x MachHrs) + ( x ProdRuns) Overall model is significant based on F-test from ANOVA table (p-value extremely small). 93% of the variation in overhead cost is explained by the model. Each independent variable is statistically significant based on p-value of t-test. We will do some additional interpretation of the output later on. Let’s look at the specific model developed and its interpretation.

15 Interpretation of Equation Bendrix Automotive Parts Company
If the number of production runs is held constant, then the overhead cost is expected to increase by $43.54 for each extra machine hour If the number of machine hours is held constant, the overhead is expected to increase by $ for each extra production run. $3997 is the fixed component of overhead. The slope terms involving MachHrs and ProdRuns are the variable components of overhead. In a multiple regression model, the interpretation of the coefficients of each independent variable assumes that the values of the other independent variables are held constant. We can assume that the y-intercept (when we have 0 production runs and 0 machine hours) of $3997 is a measure of the fixed overhead costs. However, we have to be very careful making interpretations of the y-intercept, since we typically do not have data values for that point and are extrapolating outside the range of the historical data.

16 Equation Comparison Bendrix Automotive Parts Company
It is interesting to compare this equation with the separate equations: Predicted Overhead = 48, (MachHrs) and Predicted Overhead = 75, (ProdRuns) Predicted Overhead = 3, MachHrs ProdRuns Note that both coefficients have increased. Also, the intercept is now lower than either intercept in the single variable equation. It is difficult to guess the changes that more explanatory variables will cause, but it is likely that changes will occur. The reasoning for this is that when MachHrs is the only variable in the equation, we are obviously not holding ProdRuns constant - we are ignoring it - so in effect the coefficient 34.7 of MachHrs indicates the effect of MachHrs and the omitted ProdRuns on Overhead. But when we include both variables, the coefficient of 43.5 of MachHrs indicates the effect of MachHrs only, holding ProdRuns constant. Since the coefficients have different meanings, it is not surprising that we obtain different estimates. NOTE: INSTRUCTOR SHOULD PULL UP STATPRO AND GENERATE THE OTHER REGRESSION. We would have to look at the regression statistics (particularly the adjusted R2 values when comparing across models with different numbers of independent variables) to see which model would be preferred. This is not a trivial question, and may involve issues such as how easy is it to get predicted values for machine hours and production runs in a future month. We will look at ways to compare between models (and even automate that process) later on in the block.

17 Modeling Possibilities Fifth National Bank Gender-Discrimination Suit
The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees. The bank’s employee database is listed in this file. Here is a partial list of the data. Remember that the key question we want to answer is whether female employees receives substantially smaller salaries than male employees and if so, can we determine the reasons that is happening. Along the way, we may generate other questions to answer and may find other insights in this problem, but we must remember to stay on target with the original problem. Bank.xls

18 Variables Fifth National Bank Gender-Discrimination Suit
For each of the 208 employees, the variables in the data set are: EducLev: education level with categories 1 (high school grad), 2 (some college), 3 (bachelor’s degree), 4 (some graduate courses) & 5 (graduate degree) JobGrade: current job level, the possible levels being from 1-6 (6 is highest) YrHired: year employee was hired Salary: current annual salary in thousands of dollars YrBorn: year employee was born Gender: a categorical variable with values “Female” and “Male” YrsPrior: number of years of work experience at another bank prior to working at Fifth National PCJob: a dummy variable with value 1 if the employee’s current job is computer-related and value 0 otherwise Notice that some of the variables are categorical data: -education level from 1 to 5 (a higher # indicates more education) -job grade from 1 to 6 (higher number means a higher graded job) -gender (female and male values) -PC Job with 0/1 coding Before we run any models, what factors do you think influence salary, and do we have all the variables to measure those influences? Do the data provide evidence that females are discriminated against in terms of salary?

19 Naïve Approach Fifth National Bank Gender-Discrimination Suit
A naïve approach to the problem is to compare the average salaries of the males and females. The average of all salaries is $39,922, the average female salary is $37,210, and the average male salary is $45,505. The difference between the averages is statistically different. The females are definitely earning less, but perhaps there is a reason. The question is whether the differences between the average salaries is still evident after taking other attributes into account. A perfect task for regression. A straight statistical hypothesis test or confidence interval of female salaries versus male salaries will conclude that there is a statistical difference between female and male salaries. To do the statistical test in StatTools, select Statistical Inference, then Hypothesis Test. Use two sample analysis, click format button next to dataset and choose stacked, check gender for cat column, salary for val column, do a difference of means test with null hypothesis value 0, alternative hypothesis type not equal to null value.

20 Dummy Variables Fifth National Bank Gender-Discrimination Suit
Some potential explanatory variables are categorical and cannot be measured on a quantitative scale. However, we often need to use these variables because they are related to the response variable. The trick is to create dummy variables, also called indicator or 0-1 variables, that indicate the category a given observation is in. To create dummy variables we can use an IF statement or we can use StatTools’ Dummy variable procedure, which is usually easier particularly when there are multiple categories. Once the dummy variables are created, we can combine the variables if we like by simply adding the columns to get the dummy for the new category. Actual StatTools steps for doing this are on slide 21

21 Regression Analysis w/Dummy Variables Fifth National Bank Gender-Discrimination Suit
In this example we create dummy variables for Gender, and JobGrade. We also create another variable: YrsExper = 95 – YrHired (since this is 1995 data) We must follow two rules: We shouldn’t use any of the original categorical variables that the dummies are based on. We should use one less dummy than the number of categories for any categorical variable. Then we can run a regression analysis with Salary as the response variable, using any combination of numerical and dummy explanatory variables. Create the dummy variables for JobGrade and calculate Yrs Experienced using the formula above (don’t forget to give that column a header name). Follow the two rules listed above if you are going to include categorical variables in a regression model.

22 Creating Dummy Variables Gender Categorical Variable
To create a dummy variable called Female for Gender: Select any data cell. From the Menu bar, select StatTools + Data Utilities + Dummy… Select “Gender”, as the variable Select “Create One Dummy Variable for Each Distinct Category”. Answer “Yes” to warnings. Repeat the procedure for JobGrade. StatTools provides an easy way to create dummy variables: -Select Data Utilities, then Dummy -Click the variable you want to code as a dummy (can only do one at a time), set option to create one dummy variable for each distinct category) -Click Yes when it mentions shifting rows and columns. StatTools will insert the new variables to the right of your other variables. It will put a label at the top and add these variables to your dataset. If you are not using StatTools, it’s still pretty easy to use an IF function to create dummy variables.

23 Regression Analysis Gender Only
We first estimate a regression equation with Female as the only variable. The resulting equation is: Predicted Salary = Female To interpret this equation recall that Female has only two possible values, 0 and 1. If we substitute 1 then the predicted salary equals and if we substitute 0 the predicated salary is These are the average salaries of females and males. Therefore the interpretation of the coefficient of the Female dummy variable is straightforward. The above equation only tells part of the story, it ignores all information except for gender. Run a regression model with the dummy variable GENDER=Female as your independent variable, salary as your dependent variable. This model shows why you only want to use one less dummy variable than the number of categories (male-female = 2 categories, so only use 1 dummy variable). If substitute 0, then the person is a male, if substitute 1, then person is a female. Since we are using only a gender variable, we get the same result as the hypothesis test. Note that the overall model, while statistically significant, also is not very powerful (gender only explains 12% of the variation in salaries).

24 Regression Analysis Gender + YrsExper + YrsPrior
We expand this equation by adding YrsExper and YrsPrior. The corresponding equation is: Pred Salary = YrsExper YrsPrior Female It is useful to write two separate equations, one for females: Predicted Salary = YrsExper YrsPrior and one for males: Predicted Salary = YrsExper YrsPrior We interpret the coefficient of the Female dummy variable as the average salary disadvantage for females relative to males after controlling for job experience. But there is still more story to tell. Build a second model with independent variables YrsExp (experience working at this bank), YrsPrior (prior banking experience before coming to this bank), and gender=female. This model is basically trying to measure gender differences in salary while controlling for work experience. To get the two equations, plug in Female = 1 to get the female equation and Female = 0 to get the male equation. Model is still significant, although YrsPrior is not statistically significant (you may want to exclude that variable and run just YrsExp and Gender=Female). Explanatory power has greatly increased, and there is still a difference in salary (although not quite as much as before). If you continue to build models in the same spreadsheet, you may want to change the names of the individual worksheets (StatTools just labels them Regression, Regression(2), …..)

25 Regression Analysis Gender + YrsExper + YrsPrior + JobGrade
We next add job grade to the equation by including five of the six job grade dummies. Although any five can be use we use Job_2 - Job_6. The estimated regression equations is now: Predicted Salary = YrsExper YrsPrior Female Job_ Job_ Job_ Job_ Job_6 There are now two categorical variables involved, gender and job grade. However, we can still write a separate equation for any combination of categories by setting the dummies to the appropriate values. INSTRUCTOR SHOULD GENERATE DUMMY VARIABLES FOR JOB GRADE. THEN RUN MULTIPLE REGRESSION USING JOB2 – JOB6. Create dummy variables for the JobGrade variable. To replicate the model above, include GENDER=female, YrsPrior, YrsExp, and Job_2, Job_3, Job_4, Job_5 and Job_6). Note that we did not include the Job_1 dummy variable, so if all of the other JobGrade dummy variables are set = 0, that means that the JobGrade = 1. Notice that the Job2 predicted salary increases by $2,575 over Job1, the Job3 salary increases by $6,295 over Job1, and so on up to an average salary increase of $27,647 for Job6 over Job1. Assuming that the JobGrade, YrsExp and YrsPrior are held constant, there is still a $1962 discrepancy between male and female salaries. However, the variable does not pass the significance test at a .05 level (just barely misses).

26 Interpretation Gender + YrsExper + YrsPrior + JobGrade
The equation for females at the fifth job grade is found by setting Female=1, Job_5=1, & other job dummies equal to 0. PredictedSalary = YrsExper YrsPrior The expected salary increase for one extra year of experience is $408; the expected salary increase for one year experience with another bank is $149 (either gender and any job grade). The coefficients of the job dummies indicate the average increase in salary an employee can expect relative to the reference (lowest) job grade. The key coefficient, the negative $1962 for females indicates the average salary disadvantage for females relative to males, given that they have the same experience levels and are in the same job grade The “penalty” is less than a fourth of the penalty we saw before. It appears that females might be getting paid less on average partly because they are in the lower job categories. This analysis leads to potentially a different question: could females salaries be lower because they are in lower job categories (on a percentage basis)?

27 Pivot Table Concentration of Females in Lower Paid Jobs
We can use a pivot table to check whether females are disproportionately in the lower job categories (set JobGrade in the row area, Gender in the column area and the count (expressed as a percentage) of any variable in the data area). Clearly, females tend to be concentrated at the lower job grades. This helps explain why females get lower salaries on average, but doesn’t explain why females are at the lower job grades in the first place. We won’t be able to provide a thorough analysis of this issue. You can either recreate the pivot table or just reference this one. Notice the limits of analysis. To answer this new question would consist of an entirely new analysis and modeling effort.

28 Conclusion The main conclusion we can draw from the output is that there is still a plausible case to be made for discrimination against females, even after including information on all the variables in the database in the regression equation. You may want to have students come up with their own model, since there are several variables that were not included in the few example models we did.

29 Interaction Terms Fifth National Bank Gender-Discrimination Suit
An interaction variable algebraically is the product of two variables. Its effect is to allow the effect of one of the variables on Y to depend on the value of the other variable. The interaction term allows the slope of the regression line to differ between the two categories. Earlier we estimated an equation for Salary using the numerical explanatory variables YrsExper and YrsPrior and the dummy variable Female. If we drop the YrsPrior variable from the equation (for simplicity) and rerun the regression, we obtain the equation Predicted Salary = YrsExper Female The R2 value for this equation is 49.1%. If we decide to include an interaction variable between YrsExper and Female in this equation, what is the effect? One other possible modeling effort we could perform is to create interaction terms. This allows the slope of the model to change based upon the value of a categorical variable. We will use a simple example to demonstrate. We will start with a model containing GENDER=female and YrsExp. This model is significant, all of the independent variables are significant, but the model is not very powerful (R2=.49). Let’s create a model that includes an interaction term with YrsExp and GENDER=female

30 Solution with Interaction Terms Fifth National Bank Gender-Discrimination Suit
We first need to form an interaction variable that is the product of YrsExper and Female. This can be done two ways in Excel. Do it manually by introducing a new variable that contains the product of the two variables involved, or Use: StatTools + Data Utilities + Interaction… Using the latter way we must select Female and YrsExper as the variables. Once the interaction variable has been created, we include it in the regression equation in addition to the other variables. Just like with Dummy variables, the interaction variable can be created using an EXCEL formula or with StatTools.

31 Interpretation w/ Interaction Terms Fifth National Bank Gender-Discrimination Suit
The estimated regression equation is Predicted Salary = YrsExper Female YrsExper_Female The female equation is: Pred Salary = YrsExper & the male equation is: Pred Salary = YrsExper Graphically - Nonparallel Female and Male Salary Lines When you create the regression model containing YrsExp, GENDER=Female and the interaction term, you get the model listed above. Notice that when you plug in 0 for GENDER=Female (male equation), the Female and interaction terms drop out. This tells us that each additional year of experience a male has equates to a $1528 increase in salary on average. When you plug in 1 for GENDER=Female, the interaction coefficient is subtracted from the YrsExp coefficient (since the sign on the interaction term is negative), and the female coefficient is added to the intercept term). This tells us that on average, females with 0 years of prior experience start at a slightly higher salary on average, but each additional year of experience equates to a $280 increase on average. Be careful about accepting these equations without checking the range of historical data. Every person in the database has at least 2 years of experience, and the observations become very few once experience goes above 20 years. It might be interesting to pull out the “outliers” of employees who are at the highest salary levels to see if it is strongly influencing the slope of the regression line (are they representative of the employee population as a whole?).

32 Conclusion w/Interaction Terms Fifth National Bank Gender-Discrimination Suit
The Y-intercept for the female line is slightly higher - females with no experience at Fifth National Bank tend to start out slightly higher than males - but the slope of the female line is much lower. That is, males tend to move up the salary ladder much more quickly than females. Again, this provides another argument, although a somewhat different one, for gender discrimination against females. The R2 value increased from 49.1% to 63.9%. The interaction variable has definitely added to the explanatory power of the equation. POSSIBLY THE END OF DAY 1 material? Depends upon how fast you cover the material.

33 Part 2: Regression Analysis Statistical Inference

34 Inference About Regression Coefficients Bendrix Automotive Parts Company
Bendrix1.xls As before, the response variable is Overhead and the explanatory variables are MachHrs and ProdRuns. What inferences can we make about the regression coefficients? We obtain the output from using StatTools

35 Multiple Regression Output Bendrix Automotive Parts Company
Predicted Overhead = MachHrs ProdRuns Regression coefficients estimate the true, but unobservable, population coefficients. The standard error of bi indicates the accuracy of these point estimates. For example, the effect on Overhead of a one-unit increase in MachHrs is We are 95% confident that the coefficient is between to Similar statements can be made for the coefficient of ProdRuns and the intercept term. INSTRUCTOR: THESE OBSERVATIONS ARE MEANINGLESS IF THE F STATISTIC AND ANOVA TABLE ARE NOT SIGNIFICANT.

36 A Test for the Overall Fit: The ANOVA Table Bendrix Automotive Parts Company
Does the ANOVA table for the Bendrix manufacturing data indicate that the combination MachHrs and ProdRuns has at least some ability to explain variation in Overhead? The F-ratio is “off the charts” and the p-value is practically 0.

37 Interpretation of the ANOVA Table Bendrix Automotive Parts Company
This information wouldn’t be much comfort for the Bendrix manager who is trying to understand the causes of variation in overhead costs. This manager already knows that machine hours and production runs are related positively to overhead costs - everyone in the company knows that! What he really wants to know is a set of explanatory variables that yields a high R2 and a low se. The low p-value in the ANOVA table does not guarantee these. All it guarantees is that MachHrs and ProdRuns are of “some help” in explaining variation in Overhead.

38 Violations of Regression Assumptions Bendrix Automotive Parts Company
Is there evidence of non constant variance? Is there any evidence of lag 1 autocorrelation in the Bendrix data when Overhead is regressed on MachHrs and ProdRuns? Is there evidence of non Normality?

39 Do the Residuals Have Constant Variance
Do the Residuals Have Constant Variance? Bendrix Automotive Parts Company If the residual variance is not constant, the standard error of the regression coefficient, s(bi), is incorrect. Note: when we ran the regression we selected “Residuals vs Fitted Values” graphs.

40 Plot of Residuals vs Fitted Values Bendrix Automotive Parts Company
Residuals appear to have equal Variances (homoscedasticity) C0LORFUL TITLES; E.G., COCKBANGER PUT IN TITLES

41 Autocorrelated Residuals Bendrix Automotive Parts Company
The residuals of time series data are often autocorrelated. The most frequent type of autocorrelation is positive autocorrelation. For example, if residuals separated by 1 month are auto correlated, this is called lag 1 autocorrelation. We use the fitted (col C) and residuals values (col D) In the “Regression” tab. The residuals represent how much the regression over-predicts (if negative) or under-predicts (if positive) the overhead cost for that month. DW IS A TEST THAT MEASURES AUTOCORRELATION. DW IS SCALED BETWEEN 0 AND 4. VALUES CLOSE TO 2 INDICATE VERY LITTLE LAG 1 AUTOCORRELATION. VALUES BELOW 2 INDICATE POSITIVE AUTOCORRELATION. VALUES ABOVE 2 INDICATE NEGATIVE AUTOCORRELATION. P MANAGERIAL STATISTICS.

42 Durbin-Watson Test Bendrix Automotive Parts Company
We can check for lag 1 autocorrelation in two ways, with the Durbin-Watson(DW) statistic and by examining the time series graph of the residuals. The Durbin-Watson (DW) statistic is scaled between 0 and 4. 2 - little lag 1 autocorrelation < 2 - positive autocorrelation > 2 – negative autocorrelation. If n = 30 and bi’s 1-5, <1.2 is a problem) We calculate the DW statistics in cell E45 with the formula: =StatDurbinWatson(D45:D80) Based on our guidelines for DW value suggests positive autocorrelation - it is less than 2 - but not enough to cause concern. Durbin-Watson statistic is discussed on pages

43 Time Series Graph of Residuals Bendrix Automotive Parts Company
This general conclusion is supported by the time series graph. Add the range A44:D80 as a Data set StatTools + Time Series & Forecasting + Time Series Graph Select Residuals as the variable Note: Refer to the slide but do not attempt to demonstrate. Serious autocorrelation of lag 1 would tend to show long runs of residuals alternating above and below the horizontal axis - positives would tend to follow positives and negatives would tend to follow negatives. There is some indication of this in the graph but not an excessive amount.

44 Are the Residuals Normally Distributed
Are the Residuals Normally Distributed? Bendrix Automotive Parts Company The Inferences we want to make assume the residuals are normally distributed. Using Data Set #2 Select: StatTools + Normality Tests + Q-Q Normal Plot Select “Residuals” as the variable Check “Plot Using Standardized Q-Values” and “Include Reference Line” Perhaps a student assignment on this section???

45 Normal Probability Plot Bendrix Automotive Parts Company
Error terms appear to be Normally Distributed Perhaps a student assignment on this section???

46 Multicollinearity Height vs Left & Right Feet
The relationship between the explanatory variable X and the response variable Y is not always accurately reflected in the coefficient of X; it depends on which other X’s are included or not included in the equation (especially when there is a linear relationship between two or more explanatory variables, in which case we have multicollinearity). Multicollinearity is the presence of a fairly strong linear relationship between two or more explanatory variables, and it can make estimation difficult. We want to explain a person’s height by means of foot length. The response variable is Height, and the explanatory variables are Right and Left, the length of the right foot and the left foot, respectively. It is likely that there is a large correlation between height and foot size, so we would expect this regression equation to do a good job. The R2 value will probably be large. But what about the coefficients of Right and Left? Multicollinearity assignment

47 Correlation of Left & Right Height vs Left & Right Feet
Height.xls To show what can happen numerically, we generated a hypothetical data set of heights and left and right foot lengths in this file. We did this so that, except for random error, height is approximately 32 plus 3.2 times foot length (in inches). StatTools + Summary Statistics + Correlation & Covariance The correlations between Height and either Right or Left in our data set are quite large, and the correlation between Right and Left is very close to 1.

48 Multiple Regression Height vs Left & Right Feet
The Regression output tells a somewhat confusing story. The multiple R and the corresponding R2 are about what we would expect, given the correlations between Height and either Right or Left. In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the se value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches. However, the coefficients of Right and Left are not all what we might expect, given that we generated heights as approximately 32 plus 3.2 times foot length. In fact, the coefficient of Left has the wrong sign - it is negative! Besides this wrong sign, the tip-off that there is a problem is that the t-value of Left is quite small and the corresponding p-value is quite large.

49 Solution Judging by this, we might conclude that Height and Left are either not related or are related negatively. But we know from the table of correlations that both of these are false. In contrast, the coefficient of Right has the “correct” sign, and its t-value and associated p-value do imply statistical significance, at least at the 5% level. However, this happened mostly by chance, slight changes in the data could change the results completely.

50 Solution Although both Right and Left are clearly related to Height, it is impossible for the least squares method to distinguish their separate effects. Note that the sum of the coefficients is which is close to the coefficient of 3.2 we used to generate the data. Therefore, the estimated equation will work well for predicting heights, but does not provide reliable estimates of the coefficients of Right and Left. When Right is only variable: Predicted Height = Right The R2 = 81.6%, se = 2.005, the t-value = and p-value = for the coefficient of Right - very significant. When Left is only variable: Predicted Height = Left The R2 = 81.1%, and se = 2.033, the t-value = 20.99, and the p-value = for the coefficient of Left - again very significant. Clearly, both of these equations tell almost identical stories, and they are much easier to interpret than the equation with both Right and Left included.

51 Stepwise Regression HyTex Catalogs
HyTex is a direct marketer of stereo equipment, personal computers, and other electronic products. HyTex advertises entirely by mailing catalogs to its customers, and all of its orders are taken over the telephone. The company spends a great deal of money on its catalog mailings, and it wants to be sure that this is paying off in sales. Data on 250 customers who purchased mail-order products from the HyTex Company in 1998 is available. Stepwise regression will be used to produce a regression equation for the amount spent in 1998.

52 The Data HyTex Catalogs
For each customer there are data on the following variables: Age: (1 = 30 or younger, 2 = 31 to 55, 3 for 56 and older) Gender: (1 = males, 0 =females OwnHome: (1 = customer owns home, 0 otherwise) Married: (1 = customer is currently married, 0 otherwise) Close: (1 = customers lives reasonably close to shopping area that sells similar merchandise, 2 otherwise) Salary: combined annual salary of customer and spouse (if any) Children: number of children living with customer Customer97: (1 = customer purchased from HyTex during 1997, 0 otherwise) Spent97: total amount of purchase in 1997 from HyTex Catalogs: Number of catalogs sent to the customer in 1998 Spent98: total amount of purchase in 1998 from HyTex

53 Stepwise Regression Many statistical packages provide some assistance by including automatic equation-building options. These options estimate a series of regression equations by successively adding (or deleting) variables according to prescribed rules. Generically, these methods are referred to as stepwise regression. There are three types: forward, backward and stepwise. Forward - begins with no explanatory variables in the equation and successively adds one at a time until no explanatory variables make a significant contribution. Backward - begins with all potential explanatory variables in the equation and deletes them one at a time until further deletion would do more harm than good. Stepwise - much like a forward procedure, except that it also considers possible deletions along the way. Instructor should demonstrate the Stepwise. In general, the Stepwise technique is preferred over the Forward or Backward techniques since the Stepwise procedure allows for the elimination of variables. The Forward and Backward do not.

54 Stepwise Regression in StatTools HyTex Catalogs
Select StatTools + Regression & Classification + Regression Select Regression Type: Stepwise. Specify Spent98 as the response variable and select all of the other variables (besides Customer) as potential explanatory variables. Choose p-values or F-values as the appropriate criterion.

55 Interpretation of Final Regression Equation
The coefficient of Catalogs implies that $42.00 more was spent for each catalog sent. The coefficient of Married implies that $ more was spent for every married person. The coefficient of Own Home implies that $ more was spent for every person owning their own home. The coefficients for Spent97 and Customer97 are somewhat more difficult to interpret. First, both are 0 for customers who didn’t purchase the previous year. For those who did, the terms become -1, Spent97.

56 The Partial F Test This constitutes the general procedure for determining the significance of variables in Regression Analysis.

57 The Partial F Test Fifth National Bank Gender-Discrimination Suit
The Fifth National Bank is facing a gender-discrimination suit charging that its female employees receive substantially smaller salaries than its male employees. Previously we ran several regressions for Salary to see whether there is convincing evidence of salary discrimination against females. Now, we will perform the following analysis: We will regress Salary versus the Gender_Female, Yrs_Exper, and Yrs_Exper*Gender_Female_1. This will be the reduced equation. Then we’ll see whether the variables JobGrade_2 through JobGrade_6 add anything significant to the reduced equation. Next see if the variables Gender_Female_1*JobGrade_2_1 through Gender_Female_1*JobGrade_6_1 add anything significant to what we already have. Continuing on, see if EducLev_1 through EducLev_5 add anything significant to what we already have. The Bank1.xls file is the original Bank.xls file with the dummy variables already added.

58 First Solution Fifth National Bank Gender-Discrimination Suit
First, note that we created all of the dummies and interaction variables with StatTools’ Data Utilities procedures. Also, note that we have used three sets of dummies, for gender, job grad and education level. When we use these in a regression equation, the dummy for one category of each should always be excluded; it is the reference category. The reference categories we have used are “male”, job grade 1 and education level 1. The “smallest” equation uses Gender_Female, Yrs_Exper, and Yrs_Exper*Gender_Female_1 as explanatory variables. We’re off to a good start. These three variables already explain 63.9% of the variation of Salary. The next equation adds the explanatory variables JobGrade_2 through JobGrade_6.

59 Second Solution Fifth National Bank Gender-Discrimination Suit
This equation appears much better. ( R2 increased to 81.1%). Check whether it is significantly better with the partial F test. Calculate the F–ratio. Given SSER = , SSEC = , MSEC = , k – j = 8 – 3 = 5 (represents the number of extra variables) the F–ratio is 36.28 Calculate the corresponding p-value. Using Excel, the formula is: “=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above), dof1 is the number of additional variables (k – j), and dof2 is the degrees of freedom for the unexplained complete equation. Since FDIST(36.28,5,199) = 0, there is no doubt the added variables contribute to the explanatory power of the equation.

60 Third Solution Fifth National Bank Gender-Discrimination Suit
This equation appears better. ( R2 increased to 84%). Check whether it is significantly better with the partial F test. Calculate the F–ratio. Given SSER = , SSEC = , MSEC = , k – j = 13 – 8 = 5 the F–ratio is Calculate the corresponding p-value. Using Excel, the formula is: “=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above), dof1 is the number of additional variables (k – j), and dof2 is the degrees of freedom for the unexplained complete equation. Since FDIST(6.9368,5,194) = 0, there is no doubt the added variables contribute to the explanatory power of the equation.

61 Fourth Solution Fifth National Bank Gender-Discrimination Suit
This equation seems very slightly better. ( R2 increased to 84.7%). Check whether it is significantly better with the partial F test. Calculate the F–ratio. Given SSER = , SSEC = , MSEC = , k – j = 17 – 13 = 4 the F–ratio is 2.383 Calculate the corresponding p-value. Using Excel, the formula is: “=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above), dof1 is the number of additional variables (k – j), and dof2 is the degrees of freedom for the unexplained complete equation. Since FDIST(2.383,4,190) = , we can not be 95% confident the added variables contribute to the explanatory power of the equation. We therefore choose not to include them in the model.

62 Solution Fifth National Bank Gender-Discrimination Suit
According to the partial F test, the variables added to the forth equation do not improve the solution enough to qualify for statistical significance at the 5% level. Based on this evidence, there is not much to gain from including the education dummies in the equation, so we would probably elect to exclude them. As a result, the third solution is considered the complete solution.

63 Concluding Comments Fifth National Bank Gender-Discrimination Suit
The partial test is the formal test of significance for an extra set of variables. Many users look only at the R2 and/or se values to check whether extra variables are doing a “good job”. If the partial F test shows that a block of variables is significant, it does not imply that each variable in this block is significant. Some added variables can have low t-values.

64 Concluding Comments Fifth National Bank Gender-Discrimination Suit
Producing all of these outputs and completing the partial F Test is a lot of work. StatTools includes a routine called “Block” that simplifies the process. Select StatTools + Regression & Classification + Regression Select Regression Type: Block. Choose 4 blocks and identify which additional variables enter each block.

65 Concluding Comments While we have concentrated on the partial F test and statistical significance in this example, don’t lose sight of the bigger picture. Once we have decided on a “final” regression equation we need to analyze its implications for the problem at hand. In this case the bank is interested in possible salary discrimination against females, so we should interpret this final equation in these terms. Don’t get so caught up in the details of statistical significance that you lose sight of the original purpose of the analysis!

66 Outliers Fifth National Bank Gender-Discrimination Suit
Are there any obvious outliers from the 208 employees? In what sense are they outliers? Does it matter to the regression results, particularly those concerning gender discrimination, whether the outliers are removed? There are several places we could look for outliers. An obvious place is the Salary variable. The boxplot shown here shows that there are several employees making substantially more in salary than most of the employees. BOXPLOT: RECTANGLE = 1ST AND 3RD QUARTILES; MID LINE = MEDIAN;RED POINT = MEAN; LINES ARE NO MORE THAN 1.5 IQRs FROM THE BOX.

67 Solution We could consider these outliers and remove them, arguing perhaps that these are senior managers who shouldn’t be included in the discrimination analysis. We leave it to you to check whether the regression results are any different with these high salary employees than without them. Another place to look is at the scatterplot of the residuals versus the fitted values. This type of plot shows points with abnormally large residuals. For example, we ran the Regression with Female, YrsExper, Fem-YrsExper and the five job grade dummies, and we obtained the output and scatterplot shown here.

68 Solution This scatterplot has several points that could be considered outliers, but we focus on the point identified in the figure. The residual for this point is approximately -21. Given the se for this regression is approximately 5, this residual is over four standard errors below 0 - quite a lot. This person is found to be unusual and special circumstances can explain this. We delete this employee and rerun the regression with the same variables

69 Solution Recalling that gender discrimination is the key issue in this example we compare the coefficients of Female and Fem_YrsExper in the two outputs. The coefficient of Female has dropped from to In words, the Y-intercept for the female regression line used to be about $6000 higher than for the male line, now it’s only about $4350. More importantly, the coefficient of Fem_YrsExper has changed from to This indicates how much less steep the female line for Salary versus Yrs_Exper is than the male line. A change from to indicates less discrimination against females now than before. This unusual female employee accounts for a good bit of the discrimination argument - although a strong argument still exists even without her.

70 Part 3: Analysis of Variance and Experimental Design

71 One-Way ANOVA

72 One-way ANOVA A one-way analysis variance or one-way ANOVA is the procedure for analyzing the differences between more than two population means. A one-way ANOVA is also used in randomized experiments where a single population is treated in one of several ways. The data analysis in these two situations is identical; only the interpretation of the results differ.

73 One-way ANOVA Process The one-way ANOVA procedure is usually run in two stages. The first stage tests the null hypothesis. If the p-value is not sufficiently small, then there is not enough evidence to reject the equal-means hypothesis, and the analysis stops. If the p-value is sufficiently small, we can conclude with some assurance that the means are not all equal. If all means are not equal, the second stage determines which of the groups differ significantly from the others.

74 Background Information Effects of Shelf Height on Cereal Sales at Midway
Midway Company selects 125 stores in its chain of supermarkets to conduct an experiment on cereal sales. These stores are similar in terms of store size, customer traffic, customer types, and other characteristics. Each store stocks cereal in a similar location in the store on five-shelf displays The 125 stores are randomly selected to be in one of five groups, where each group stocks Brand X cereal in a specific shelf location (highest, next highest, middle, next lowest, lowest) The number of Brand X boxes sold at each store are recorded for the last two weeks of the experiment (the first two weeks allow customers to get used to the shelving positions) Objective: does shelf height make any difference in mean sales of Brand X cereal, and if so, which shelf heights outperform the others. Might want to put in an optional data file with 2 columns the first designating region.

75 One-way ANOVA Solution Cereal Sales at Midway
Note that this is a designed experiment Initial stores chosen in an attempt to control for extraneous factors Randomly assigned stores to treatment levels (shelf heights) The output consists of three basic parts: summary statistics the ANOVA table confidence intervals Select Statistical Inference + One Way ANOVA The next slide contains this output. Use one of the three correction methods: Bonferroni, Tukey or Scheffe (Tukey)

76 One-way ANOVA Solution Cereal Sales at Midway

77 Summary Statistics Cereal Sales at Midway
The summary statistics show that the next to highest shelf position has the largest mean store sales (426.28), and the lowest shelf has the smallest mean store sales (334.92), with the others in between. The sample standard deviations (or variances) vary somewhat across the shelf positions, but not enough to invalidate the procedure (we assume equal variance). The side-by-side boxplots in the figure on the next slide illustrate these summary measures graphically. However, there is too much overlap to tell whether the differences are statistically significant.

78 Boxplot of Mean Results by Region Cereal Sales at Midway
Box Plots: Left and Right of the box are the 1st and 3rd quartiles, Red = Mean, Vertical line in box is median, Length of lines extend to no further than 1.5 Interquartile ranges from the box Points outside of the line are outliers by this designation.

79 ANOVA Table Results Cereal Sales at Midway
The Total variation in the ANOVA Table is based on the total variation of all observations around the grand mean in the summary section, and is used mainly to aid in calculations. The grand mean is the sample mean of all observations. The between variation is the squared difference between the treatment level means and the grand mean weighted by the treatment sample sizes (df = number of groups – 1) The within variation is variation due to differences within individual treatment groups (df = total sample size - # groups) The F-ratio for the test is with a corresponding p-value of (since < .05, we reject the null hypothesis that all means are equal). Since all means are not equal, we proceed to a comparison test to determine which means are not equal

80 Results Cereal Sales at Midway
The final section of output lists a set of multiple comparison of two treatment levels (shelf heights). The difference shows which two shelf heights are being compared, and the mean difference shows how much difference there is between the mean sales for the two shelf heights The lower and upper level shows the confidence intervals for the two shelf heights – if the lower value is negative and the upper value is positive, then 0 is contained in the interval and we can conclude that there is no statistical difference in sales between those two heights The only statistically significant difference we can discern is between the next to highest shelf and the lowest shelf (largest and smallest mean sales) The company needs to discern if that difference is practically significant, or if any external factors confounded the experiment.

81 Two-Way ANOVA

82 Background Information Golf Ball Testing
Many golf ball manufacturers claim to have the “longest ball,” that is, the ball that goes the farthest on drives. This example illustrates how these claims might be tested by testing five major brands (Brand A through E) A consumer testing service runs an experiment where 60 balls of each brand are driven under three temperature conditions. The first 20 are driven in cool weather, the next 20 are driven in mild weather, and the last 20 are driven in warm weather. The goal is to see whether some brands differ significantly, on average, from other brands and what effect temperature has on the mean differences between brands.

83 Experimental Design Golf Ball Testing
Unlike the last example, this example represents a controlled experiment (20 golf balls of each brand are randomly assigned to each of three temperature levels). In general terminology, the experimental units are the individual golf balls and the response variable is the length (in yards) of each drive. There are two factors (brand and temperature), each with different treatment levels (brand has levels A through E, and temperature has three levels: cool, mild, and warm). The design is balanced because the same number of balls, 20, is used at each of the 5 x 3 = 15 treatment level combinations. There is one further piece of terminology. We call this a full factorial two-way design because we test golf balls at each of the 15 possible treatment level combinations.

84 Conducting the Experiment Golf Ball Testing
How should the consumer testing service carry out the experiment? One possibility is to have 15 golfers, each of approximately the same skill level, hit 20 balls each. The downside of this design could be that the golfers assigned to a certain brand could be having a good day. Golfers could be spread out (each golfer could hit 2 balls). This, however, introduces an unwanted source of variation: the different abilities of the golfers. You could use the same golfer for 300 balls. Unfortunately, the golfer might get tired in the process of hitting this many balls. These are the type of things designers of experiments must consider.

85 Conducting the Experiment Golf Ball Testing
The design should attempt to eliminate as many unwanted sources of variation as possible, so that any difference across the factor levels of interest can be attributed to these factors and not to extraneous factors. In this example, we suspect the best solution is to employ a “mechanical” golf ball driver to hit all 300 balls. This should reduce the inevitable random variation that would occur by using human golfers.

86 Coding the data Golf Ball Testing
Although many rows in the figure are hidden, there are actually 300 rows of data, 20 for each of the 15 combinations of Brand and Temp. There must be two “code” variables that represent the levels of the two factors and a measurement variable that represents the response variable. Again this is a balanced design, which is what StatTools expects for its two-way ANOVA procedure. Note to the Instructor: An unbalanced could be used but you would have to use dummy variables and the Regression Procedure – see Neter and Wasserman & Kutner: Linear Statistical Models

87 Analysis of Results Golf Ball Testing
Prompted by the table, here are some questions we might ask: Look at column I. Do any brands average significantly more yards than any others (where these averages are averages over all temperatures)? Look at the bottom row. Do average yardages differ significantly across temperatures (where these averages are across all brands)? Look at the middle of the table. Do differences among averages of brands depend on temperature? For example, does one brand dominate in cool weather and another in warm weather? Also, do differences among averages of temperatures depend on brand? For example, are some brands very sensitive to changes in temperature while others are not? Change this slide so the output is consistent with the above.

88 Analysis of Results Golf Ball Testing
It is useful to characterize the type of information these questions are seeking. Question 1 is asking about the main effect of the brand factor. If we ignore the temperature, do some brands tend to go farther than some others? Question 2 is also asking about a main effect, the main effect of the temperature factor. If we ignore the brand, do balls tend to go farther in some temperatures than others? (This answer is obvious to golfers: balls compress better and go farther in warm temperatures.) Therefore this is not a key question, although we would expect the study to confirm what common sense tells us. Question 3 is asking about interactions between the two factors. These interactions are often the most interesting results of a two-way study. In this example interactions are patterns of the averages that could not be guessed by looking only at the “main effect” averages.

89 Interaction Effects Golf Ball Testing
Specifically, the order of brands in column F, from largest to smallest average yardages, is E, C, B, A, D. If there were not interactions at all, this ordering would hold at each temperature. For these data it is close. At cool temperatures the ordering is C, E, B, A, D; for mild, it is E, B, C, D, A; for warm, it is E, C, A, B, D. Actually, having no interaction implies even more than the preservation of these rankings.

90 Interaction Effects Golf Ball Testing
It implies that the difference between any two brand averages is the same at any of the three temperature levels. For example, the difference between brands E and D at the three temperature levels are: = 9.8 = 18.1 = 14.8 If there were no interactions at all, these three differences would be equal.

91 Interaction Graphically Golf Ball Testing
The concept of interaction is much easier to understand by looking at graphs. The following graphs, which are both outputs from StatTools’ two-way ANOVA procedure, represent two ways of looking at the pattern of averages for different combinations of brand and temperature. The first graph shows a line for each brand, where each point on the line corresponds to a different temperature. The second shows the same information with the roles of brand and temperature reversed.

92 Interaction Graphically Golf Ball Testing

93 Interaction Graphically Golf Ball Testing

94 Interaction Graphically Golf Ball Testing
Neither graph is better than the other, they simply show the same data from different perspectives. The key to either is whether the lines are parallel. If they are, then there is no interactions - the effect of one factor on average yardage is the same regardless of the level of the other factor. The more nonparallel they are, however, the stronger the interactions are. The lines in either of these graphs are not exactly parallel but they are nearly so. This implies that there is very little interaction between brand and temperature.

95 Type of Interactions In general, interactions can be of several types.
Shown here are two contrasting types. These graphs focus on two types and on different data than in GOLFBALLS.XLS. In the first graph brand A dominates at all temperatures. However, there is little interaction because the difference between brands increases as temperatures increase.

96 Type of Interactions In this situation the interaction effect is interesting, but not the main effect of brand - brand A is better when averaged over all temperatures - is also interesting. The situation is quite different in the next graph, where there is a crossover. This is sometimes referred to as ordinal.

97 Type of Interactions Brand A is somewhat better at cool temperatures, but brand B is better at mild and warm temperatures. In this case the interaction is the most interesting finding, and the main effect of brand is much less interesting. This is sometimes referred to as disordinal.

98 Type of Interactions In simple terms, if you are a golfer, you’d buy brand A in cool temperatures and brand B otherwise, and you wouldn’t care very much which brand is better when averaged over all temperatures. For these reasons, we check first for interactions in a two-way design. If there are significant interactions, then the main effects might not be as interesting. However, if there are no significant interactions, then main effects generally become more important.

99 Main Effects versus Interactions
Main effects are differences in average across the levels on one factor, where these averages are averages over all levels of the other factor. In a table of sample means, we can check for main effects by looking at the averages in the “Grand Total” column and row. In contrast, the interactions are patterns of averages in the main body of the table and are best shown graphically. They indicate whether the effect of one factor depends on the level of the other factors.

100 Two Way ANOVA Table The next question is whether the main effects and interactions we see in the table of sample means are statistically significant. As in a one-way ANOVA, this is answered by an ANOVA table. However, instead of having just two sources of variation, within and between, as in a one-way ANOVA, there are now four sources of variation: one for the main effect of each factor, one for interactions, and one for the variation within treatment level combinations.

101 Analysis of Results Golf Ball Testing
For the golf ball data, two-way ANOVA separates the total variation across all 300 observations into four sources. There is variation due to different brands producing different average yardages. There is variation due to different average yardages at different temperatures. There is variation due to the interactions we saw in the interaction graphs. There is the same type of “within” variation as in one-way ANOVA. This is the variation that occurs because yardages for the 20 balls of the same brand hit at the same temperature are not all identical.

102 Output of Results Golf Ball Testing
Select the StatTools/ Statistical Inference/Two-way ANOVA menu item, selecting Brand and Temp as the “code” variables and Yards as the “measurement” variable Output includes tables of sample sizes, sample means, and sample standard deviations, as well as the ANOVA table.

103 Analysis of Results Golf Ball Testing
We test whether main effects or interactions are statistically significant in the usual way - by examining p-values. Looking first at the interactions, the p-value is about 0.03, which says that the lines in the interaction graphs are significantly non-parallel, at least at the 5% significant level. There is at least some interaction between brand and temperature (although the practical significance could be disputed). The two p-values for the main effects in cells G32 and G33 are practically 0, meaning that there are differences across brands and across temperatures. Of course, the main effect of temperature was a foregone conclusion - we already know that balls do not go as far in cold temperatures - but the main effect of brand is more interesting. According to the evidence, some brands definitely go farther, on average, than some others.


Download ppt "Part 1: Regression Analysis Estimating Relationships"

Similar presentations


Ads by Google