Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparing k Populations

Similar presentations


Presentation on theme: "Comparing k Populations"— Presentation transcript:

1 Comparing k Populations
Means – One way Analysis of Variance (ANOVA)

2 The F test – for comparing k means
Situation We have k normal populations Let mi and s denote the mean and standard deviation of population i. i = 1, 2, 3, … k. Note: we assume that the standard deviation for each population is the same. s1 = s2 = … = sk = s

3 We want to test against

4 The data Assume we have collected data from each of th k populations
Let xi1, xi2 , xi3 , … denote the ni observations from population i. i = 1, 2, 3, … k. Let

5 One possible solution (incorrect)
Choose the populations two at a time then perform a two sample t test of Repeat this for every possible pair of populations

6 The flaw with this procedure is that you are performing a collection of tests rather than a single test If each test is performed with a = 0.05, then the probability that each test makes a type I error is 5% but the probability the group of tests makes a type I error could be considerably higher than 5%. i.e. Suppose there is no different in the means of the populations. The chance that this procedure could declare a significant difference could be considerably higher than 5%

7 The Bonferoni inequality
If N tests are preformed with significance level a. then P[group of N tests makes a type I error] ≤ 1 – (1 – a)N Example: Suppose a. = 0.05, N = 10 then P[group of N tests makes a type I error] ≤ 1 – (0.95)10 = 0.41

8 For this reason we are going to consider a single test for testing:
against Note: If k = 10, the number of pairs of means (and hence the number of tests that would have to be performed ) is:

9 The F test

10 To test against use the test statistic

11 the statistic is called the Between Sum of Squares and is denoted by SSBetween It measures the variability between samples k – 1 is known as the Between degrees of freedom and is called the Between Mean Square and is denoted by MSBetween

12 the statistic is called the Within Sum of Squares and is denoted by SSWithin is known as the Within degrees of freedom and is called the Within Mean Square and is denoted by MSWithin

13 then

14 The Computing formula for F:
Compute 1) 2) 3) 4) 5)

15 Then 1) 2) 3)

16 The critical region for the F test
We reject if Fa is the critical point under the F distribution with n1 = k - 1degrees of freedom in the numerator and n2 = N – k degrees of freedom in the denominator

17 Example In the following example we are comparing weight gains resulting from the following six diets Diet 1 - High Protein , Beef Diet 2 - High Protein , Cereal Diet 3 - High Protein , Pork Diet 4 - Low protein , Beef Diet 5 - Low protein , Cereal Diet 6 - Low protein , Pork

18

19 Hence

20 Thus Thus since F > we reject H0

21 A convenient method for displaying the calculations for the F-test
The ANOVA Table A convenient method for displaying the calculations for the F-test

22 Anova Table Mean Square F-ratio Between k - 1 SSBetween MSBetween
Source d.f. Sum of Squares Mean Square F-ratio Between k - 1 SSBetween MSBetween MSB /MSW Within N - k SSWithin MSWithin Total N - 1 SSTotal

23 The Diet Example Mean Square F-ratio Between 5 4612.933 922.587 4.3
Source d.f. Sum of Squares Mean Square F-ratio Between 5 4.3 Within 54 (p = ) Total 59

24 Equivalence of the F-test and the t-test when k = 2

25 the F-test

26

27 Hence

28 Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS

29 Assume the data is contained in an Excel file

30 Each variable is in a column
Weight gain (wtgn) diet Source of protein (Source) Level of Protein (Level)

31 After starting the SSPS program the following dialogue box appears:

32 If you select Opening an existing file and press OK the following dialogue box appears

33 The following dialogue box appears:

34 If the variable names are in the file ask it to read the names
If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range: Once you “click OK”, two windows will appear

35 One that will contain the output:

36 The other containing the data:

37 To perform ANOVA select Analyze->General Linear Model-> Univariate

38 The following dialog box appears

39 Select the dependent variable and the fixed factors
Press OK to perform the Analysis

40 The Output

41 Comments The F-test H0: m1 = m2 = m3 = … = mk against HA: at least one pair of means are different If H0 is accepted we know that all means are equal (not significantly different) If H0 is rejected we conclude that at least one pair of means is significantly different. The F – test gives no information to which pairs of means are different. One now can use two sample t tests to determine which pairs means are significantly different

42 Fishers LSD (least significant difference) procedure:
Test H0: m1 = m2 = m3 = … = mk against HA: at least one pair of means are different, using the ANOVA F-test If H0 is accepted we know that all means are equal (not significantly different). Then stop in this case If H0 is rejected we conclude that at least one pair of means is significantly different, then follow this by using two sample t tests to determine which pairs means are significantly different

43 Example In the following example we are comparing weight gains resulting from the following six diets Diet 1 - High Protein , Beef Diet 2 - High Protein , Cereal Diet 3 - High Protein , Pork Diet 4 - Low protein , Beef Diet 5 - Low protein , Cereal Diet 6 - Low protein , Pork

44

45 Hence

46 Thus

47 The ANOVA Table Thus since F > 2.386 we reject H0
Source d.f. Sum of Squares Mean Square F-ratio Between 5 4.3 Within 54 (p = ) Total 59 Thus since F > we reject H0 Conclusion: There are significant differences amongst the k = 6 means

48 Now we want to perform t tests to compare the k = 6 means
with t0.025 = for 54 d.f.

49 Table of means t test results Critical value t0.025 = for 54 d.f. t values that are significant are indicated in bold.

50 Conclusions: There is no significant difference between diet 1 (high protein, pork) and diet 3 (high protein, pork). There are no significant differences amongst diets 2, 4, 5 and 6. (i. e. high protein, cereal (diet 2) and the low protein diets (diets 4, 5 and 6)). There are significant differences between diets 1and 3 (high protein, meat) and the other diets (2, 4, 5, and 6). Major conclusion: High protein diets result in a higher weight gain but only if the source of protein is a meat source.

51 These are similar conclusions to those made using exploratory techniques
Examining box-plots

52 High Protein Low Protein Beef Cereal Pork Cereal Pork Beef

53 Conclusions Weight gain is higher for the high protein meat diets
Increasing the level of protein - increases weight gain but only if source of protein is a meat source The carrying out of the F-test and Fisher’s LSD ensures the significance of the conclusions. Differences observed exploratory methods could have occurred by chance.

54 Comparing k Populations
Proportions The c2 test for independence

55 The two sample test for proportions
The data can be displayed in the following table: population 1 2 Total Success x1 x2 x1 + x2 Failure n1 - x2 n2 - x2 n1 + n2- (x1 + x2) n1 n2 n1 + n2

56 This problem can be extended in two ways:
Increasing the populations (columns) from 2 to k (or c) Increasing the number of categories (rows) from 2 to r. 1 2 c Total x11 x12 R1 x21 x22 R2 Rr C1 C2 Cc N

57 The c2 test for independence

58 Situation We have two categorical variables R and C.
The number of categories of R is r. The number of categories of C is c. We observe n subjects from the population and count xij = the number of subjects for which R = i and C = j. R = rows, C = columns

59 Example Both Systolic Blood pressure (C) and Serum Cholesterol (R) were meansured for a sample of n = 1237 subjects. The categories for Blood Pressure are: < The categories for Cholesterol are: <

60 Table: two-way frequency

61 The c2 test for independence
Define = Expected frequency in the (i,j) th cell in the case of independence.

62 Justification - for Eij = (RiCj)/n in the case of independence
Let pij = P[R = i, C = j] = P[R = i] P[C = j] = rigj in the case of independence = Expected frequency in the (i,j) th cell in the case of independence.

63 H0: R and C are independent
Then to test H0: R and C are independent against HA: R and C are not independent Use test statistic Eij= Expected frequency in the (i,j) th cell in the case of independence. xij= observed frequency in the (i,j) th cell

64 Sampling distribution of test statistic when H0 is true
- c2 distribution with degrees of freedom n = (r - 1)(c - 1) Critical and Acceptance Region Reject H0 if : Accept H0 if :

65

66 Standardized residuals
Test statistic degrees of freedom n = (r - 1)(c - 1) = 9 Reject H0 using a = 0.05

67 Another Example This data comes from a Globe and Mail study examining the attitudes of the baby boomers. Data was collected on various age groups

68 One question with responses
Are there differences in weekly consumption of alcohol related to age?

69 Table: Expected frequencies

70 Table: Residuals Conclusion: There is a significant relationship between age group and weekly alcohol use

71 Examining the Residuals allows one to identify the cells that indicate a departure from independence
Large positive residuals indicate cells where the observed frequencies were larger than expected if independent Large negative residuals indicate cells where the observed frequencies were smaller than expected if independent

72 Another question with responses
In an average week, how many times would you surf the internet? Are there differences in weekly internet use related to age?

73 Table: Expected frequencies

74 Table: Residuals Conclusion: There is a significant relationship between age group and weekly internet use

75 Echo (Age 20 – 29)

76 Gen X (Age 30 – 39)

77 Younger Boomers (Age 40 – 49)

78 Older Boomers (Age 50 – 59)

79 Pre Boomers (Age 60+)

80 Regressions and Correlation
Estimation by confidence intervals, Hypothesis Testing


Download ppt "Comparing k Populations"

Similar presentations


Ads by Google