 # Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. Analysis of Variance Chapter 16.

## Presentation on theme: "Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. Analysis of Variance Chapter 16."— Presentation transcript:

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. Analysis of Variance Chapter 16

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 2 ANOVA Analysis of variance: tool for analyzing how the mean value of a quantitative response variable is affected by one or more categorical explanatory factors. If one categorical variable: one-way ANOVA If two categorical variables: two-way ANOVA

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 3 16.1 Comparing Means with an ANOVA F-Test F-statistic: H 0 :  1 =  2 = … =  k H a : The means are not all equal.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 4 Variation among sample means is 0 if all k sample means are equal and gets larger the more spread out they are. If F is large enough => evidence at least one population mean differs from others => reject null hypothesis. p-value found using an F-distribution (more later)

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 5 Example 16.1 Seat Location and GPA Q: Do best students sit in the front of a classroom? Data on seat location and GPA for n = 384 students; 88 sit in front, 218 in middle, 78 in back Students sitting in the front generally have slightly higher GPAs than others.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 6 Example 16.1 Seat Location and GPA (cont) The F-statistic is 6.69 and the p-value is 0.001. p-value so small => reject H 0 and conclude there are differences among the means. H 0 :  1 =  2 =  3 H a : The means are not all equal.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 7 Example 16.1 Seat Location and GPA (cont) 95% Confidence Intervals for 3 population means: Interval for “front” does not overlap with the other two intervals => significant difference between mean GPA for front-row sitters and mean GPA for other students

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 8 Notation for Summary Statistics k = number of groups, s i, and n i are the mean, standard deviation, and sample size for the i th sample group N = total sample size (N = n 1 + n 2 + … + n k ) Example 16.1 Seat Location and GPA (cont) Three seat locations => k = 3 n 1 = 88, n 2 = 218, n 3 = 78; N = 88+218+78 = 384

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 9 Assumptions for the F-Test Samples are independent random samples. Distribution of response variable is a normal curve within each population. Different populations may have different means. All populations have same standard deviation, . e.g. How k = 3 populations might look …

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 10 Conditions for Using the F-Test F-statistic can be used if data are not extremely skewed, there are no extreme outliers, and group standard deviations are not markedly different. Tests based on F-statistic are valid for data with skewness or outliers if sample sizes are large. A rough criterion for standard deviations is that the largest of the sample standard deviations should not be more than twice as large as the smallest of the sample standard deviations.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 11 Example 16.1 Seat Location and GPA (cont) The boxplot showed two outliers in the group of students who typically sit in the middle of a classroom, but there are 218 students in that group so these outliers don’t have much influence on the results. The standard deviations for the three groups are nearly the same. Data do not appear to be skewed. Necessary conditions for F-test seem satisfied.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 12 The Family of F-Distributions Skewed distributions with minimum value of 0. Specific F-distribution indicated by two parameters called degrees of freedom: numerator degrees of freedom and denominator degrees of freedom. In one-way ANOVA, numerator df = k – 1, and denominator df = N – k

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 13 Determining the p-Value Statistical Software reports the p-value in output. Table A.4 provides critical values for 1% and 5% significance levels. If the F-statistic is > than the 5% critical value, the p-value < 0.05. If the F-statistic is > than the 1% critical value, the p-value < 0.01. If the F-statistic is between the 1% and 5% critical values, the p-value is between 0.01 and 0.05.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 14 Example 16.2 Testosterone and Occupation Study: Compare mean testosterone levels for k = 7 occupational groups Reported F-statistic was F = 2.5 and p-value < 0.05 N = 66 men: num df = k – 1 = 7 – 1 = 6 den df = N – k = 66 – 7 = 59 Table A.4 with df of (6, 60): The 5% critical value is 2.25 and the F-statistic was larger so the the p-value < 0.05.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 15 Multiple Comparisons Multiple comparisons: two or more comparisons are made to examine specific pattern of differences among means. Most common: all pairwise comparisons. Ways to make inferences about each pair of means: Significance test to assess if two means significantly differ. Confidence interval for difference computed and if 0 is not in the interval, there is a statistically significant difference.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 16 Multiple Comparisons Many statistical tests done => increased risk of making at least one type I error (erroneously rejecting a null hypothesis). Several procedures to control the overall family type I error rate or overall family confidence level. Family error rate for set of significance tests is probability of making one or more type I errors when more than one significance test is done. Family confidence level for procedure used to create a set of confidence intervals is the proportion of times all intervals in set capture their true parameter values.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 17 Example 16.1 Seat Location and GPA (cont) Pairwise Comparison Output: Tukey: Family confidence level of 0.95 Fisher: 0.95 level for each individual interval Here, both give same conclusions: Only 1 interval covers 0,  Middle –  Back Appears population mean GPAs differ for front and middle students and for front and back students.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 18 16.2 Details of One-Way Analysis of Variance Fundamental concept: the variation among the data values in the overall sample can be separated into: (1) differences between group means (2) natural variation among observations within a group Total variation = Variation between groups + Variation within groups ANOVA Table displays this information.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 19 Measuring Variation Between Groups Sum of squares for groups = SS Groups Numerator of F-statistic = mean square for groups

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 20 Measuring Variation within Groups Sum of squared errors = SS Error Denominator of F-statistic = mean square error Pooled standard deviation:

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 21 Measuring Total Variation Total sum of squares = SS Total = SSTO SS Total = SS Groups + SS Error

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 22 General Format of a One-Way ANOVA Table

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 23 Example 16.3 Comparison of Weight Loss Programs Program 3 appears to have the highest weight loss overall.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 24 Example 16.3 Comparison of Weight Loss Programs (cont)

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 25 Example 16.3 Comparison of Weight Loss Programs (cont)

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 26 Example 16.3 Comparison of Weight Loss Programs (cont)

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 27 Example 16.3 Comparison of Weight Loss Programs (cont) 123123 “Factor” used instead of Groups as the groups (weight-loss programs) form an explanatory factor for the response. Note: Pooled StDev is

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 28 Example 16.4 Top Speeds of Supercars Data: top speeds for six runs on each of five supercars. Kitchens (1998, p. 783)

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 29 Example 16.4 Top Speeds (cont)

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 30 Example 16.4 Top Speeds (cont) F = 25.15 and p-value is 0.000 => reject null hypothesis that population mean speeds are same for all five cars. Conditions are satisfied. Data not skewed and no extreme outliers. Largest sample std dev (5.02 Viper) not more than twice as large as smallest std dev (2.92 Acura). MS Error =14.5 is an estimate of variance of top speed for hypothetical distribution of all possible runs with one car. Estimated standard deviation for each car is 3.81. Based on sample means and CIs: Porsche and Ferrari seem to be significantly faster than other cars.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 31 95% Confidence Intervals for the Population Means In one-way analysis of variance, a confidence interval for a population mean   is where and t* is such that the confidence level is the probability between -t* and t* in a t-distribution with df = N – k.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 32 16.3 Other Methods When data are skewed or extreme outliers present …better to analyze the median instead of mean Two such tests are: 1.Kruskal-Wallis Test 2.Mood’s Median Test Also called nonparametric tests. H 0 : Population medians are equal. H a : Population medians are not all equal.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 33 Example 16.5 Drinks and Seat Location Data: Seat location and number of alcoholic drinks per week Students sitting in the back report drinking more. Data appear skewed, sample standard deviations differ.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 34 Example 16.5 Drinks and Seat (cont) P = 0.000 => strong evidence that the population median number of drinks per week are not all equal.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 35 Example 16.5 Drinks and Seat (cont) P = 0.000 => the null hypothesis of equal population medians can be rejected.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 36 16.4 Two-Way ANOVA (CD Topic S4) Two-way analysis of variance: to examine how two categorical explanatory variables affect the mean of a quantitative response variable. Main effect: overall effect of a single explanatory variable. Interaction: effect on response variable of one explanatory variable depends upon the specific value or level for the other explanatory variable.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 37 Example 16.6 Happy Faces and Tips Q: Does drawing a happy face on the restaurant bill increase average tip to server? Effect of drawing happy face depended on gender. Speculated customers felt happy face not gender appropriate for males.

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 38 Example 16.7 You’ve Got to Have Heart Response: Weight gain in Infants Explanatory: Heartbeat Status (Yes or No) Initial weight (low, med, high) Weight gain generally greater for heartbeat group. There is a main effect for the heartbeat status. Approximately parallel lines => little/no interaction

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. 39 Example 16.6 Faces and Tips (cont) Two-way ANOVA: Three F-statistics are made – one for each main effect and one for interaction. Since interaction effect is significant => difficult to interpret the main effect.