Presentation on theme: "Analysis of frequency counts with Chi square"— Presentation transcript:
1 Analysis of frequency counts with Chi square 2The autumn term lectures defined a level of measurement called categorical, but they did not cover the kind of statistics that are possible when the level of measurement of a variable is categorical.Dr David Field
2 Summary Categorical data Frequency counts One variable chi-square testing the null hypothesis that frequencies in the sample are equally divided among the catgegoriesvarying the null hypothesisTwo variable chi-squaretesting the null hypothesis that status on one categorical variable is independent from status on another categorical variableLimitations and assumptions of chi-squareAndy Field chapter 18 covers chi-squareThere is also a guide online atChi-square is topic 16 in the list
3 Categorical data18.2Each participant is a member of a single category, and the categories cannot be meaningfully placed in ordere.g., nationality = French, German, ItalianSometimes chi-square is used with ordered categories, e.g. age bandsTo perform statistical tests with categorical data each participant must be a member of only one categoryCategory membership must be mutually exclusiveYou can’t be a smoker and a non-smokerThis allows frequency counts in each category to be calculated
4 Chi squareIf you can express the data as frequency counts in several categories, then chi square can be used to test for differences between the categoriesYou will also see chi square written as a Greek letter accompanied by the mathematical symbol indicating that a number should be squared2
5 Chi square with a single categorical variable Suppose we are interested in which drink is most popularWe ask a sample of 100 people if they prefer to drink coffee, tea, or watereach respondent is only allowed to select one answerthis is important: if each person can have membership of more than one category you can’t use Chi squareBy default, the null hypothesis for chi-square is that each of the categories is equally frequent in the underlying populationit is possible to modify this (see later)
6 One variable chi-square example Let’s say that the preferences expressed by the sample of 100 people result in the following observed frequency counts:tea 39coffee 30Water 31SUM 100The null hypothesis assumes that each category is equally frequent, and thus provides a model that the data can be used to testBased on the null hypothesis, the expected frequency counts would 100 / 3 = 33.3 per categoryThe Chi square statistic works out the probability that the observed frequencies could be obtained by random sampling from a population where the null hyp is true
7 One variable chi-square example ObservedExpectedDifferenceDifference squaredDivide by expected3933.33031100Here is a table of the expected and observed frequencies. Each cell is one of the categories from the previous slide, i.e. tea, coffee and water
8 One variable chi-square example ObservedExpectedDifferenceDifference squaredDivide by expected3933.35.730-3.331-2.3100Begin by quantifying out the difference between the expected frequencies (null hypothesis) and the observed frequencies.
9 One variable chi-square example ObservedExpectedDifferenceDifference squaredDivide by expected3933.35.732.4930-3.310.8931-2.35.29100The second step is to square the difference scores. Squaring has two important effects. The first effect is to remove and minus signs, so that we deal only with the magnitude of the difference, not its direction. This tells us that chi-square is insensitive the direction of differences. The second thing that squaring does is to increase (exaggerate) the contribution to the eventual statistic that is made by the larger differences, and reduce the emphasis on small differences between observed and expected. This is intuitively useful because small differences are more likely to occur due to sampling error than large differences. The red ellipse compared to the other ellipse illustrates this latter point.
10 One variable chi-square example ObservedExpectedDifferenceDifference squaredDivide by expected3933.35.732.490.9830-3.310.890.3331-2.35.290.16100The next step is to divide each squared dif score by the expected score. This is a bit like dividing by an estimate of the sample variance in the t-test formula. Consider a difference score of 5. If the expected score was 180, then intuitively an observed of 175, or 185, which would produce a difference of 5 does not seem to be very different from the null hypothesis (expected) value of 180. On the other hand, if the difference score is 5, and the expected score was 15, then 10 or 20 (which would be the observed scores that could produce such a difference score) then the difference of 5 now seems like a big difference.By dividing the squared dif by the expected value we express the squared dif as a proportion of the expected score. This puts all squared dif scores into a common scale, regardless of what the initial sample sizes and expected scores were
11 One variable chi-square example ObservedExpectedDifferenceDifference squaredDivide by expected3933.35.732.490.9830-3.310.890.3331-2.35.290.16100SUM1.47Finally, one sums the figures to arrive at the value of chi-square. Chi square is the sum of all the differences between the expected and obtained frequencies. Therefore, it is a measure of how similar the observed and expected frequencies are. A bigger value of chi square indicates a greater difference from the null hyp.
12 Converting Chi square to a p value SPSS will do this for youChi square has degrees of freedom equal to the number of categories minus 12 in the example this is because if you knew the frequencies of preference for tea and coffee and the sample size, the frequency of preference for water would not be free to vary“The chi square value of 1.47, df = 2 had an associated p value of 0.48, so the null hypothesis that preferences for drinking tea, coffee and water in the population are equal cannot be rejected.”
13 One variable chi square with unequal expected frequencies By default, the expected frequencies are just the sample size divided equally among the number of categories.But, sometimes this is inappropriateFor example, we know that the % of the population of the UK that smokes is less than 50%Let’s assume for purposes of illustration that 25% of the UK population are smokersWe might hypothesise that the smoking rate is higher in Glasgow than the UK average rateThe null hypothesis is that it is the same
14 One variable chi square with unequal expected frequencies We ask 200 adults in Glasgow if they smoke.80 say yes120 say noWe know that the UK average rate is 25%, and 80 is rather more than 25% of 200Chi square can be used to assess the probability of the above frequencies being obtained by random sampling if the real smoking rate in Glasgow was actually 25%
15 One variable chi-square example with unequal expected frequencies ObservedExpectedDifferenceDifference squaredDivide by expected120150-30900680503018200SUM24The difference between this table and the previous example is that the expected frequencies have been entered based on a null hypothesis of a 25/75 split, rather than an equal split between categoriesNotice that although the squared differences are the same in both cases, the contribution to the overall chi-square statistic is unequal because of the division by the two different expected values
16 One variable chi square with unequal expected frequencies “80 of the sample of 200 people from Glasgow classified themselves as smokers. This resulted in a chi square value of 24.0, df = 1 with an associated p value of < 0.001, so the null hypothesis that smoking rates in Glasgow are equal to the UK average of 25% can be rejected.”
17 Chi square with two variables 18.3Chi square with two variablesUsually, it is more interesting to use Chi square to ask about the relationship between 2 categorical variables.For example, what is the relationship between gender and smoking?gender can be male or femalesmoking can be smoker or non-smokerIf you have smoking data from just men, you can only use chi-square to ask if the proportion of smokers and non-smokers is differentIf you have smoking data from men and women you can use chi-square to ask if the proportion of men who smoke differs from the proportion of women who smoke
18 What 2*2 chi square does not do It is important to realise that in the 2*2 chi square, having a big imbalance between the number of men and the number of women will not increase the value of the chi-square statisticAlso, having a big imbalance between the number of smokers and non-smokers will not increase the value of the chi-square statisticThis contrasts with the one variable chi-square, where an imbalance in the numbers of men vs women, or smokers vs. non-smokers does increase the value of chi-square.The value of chi-square for two variables is high if smoking frequency is contingent on gender, and low if smoking frequency is independent of gender
19 The key to understanding 2 The key to understanding 2*2 chi square is how the expected frequencies are calculatedThe expected frequencies provide the null hypothesis, or null model, that the chi square statistic testsIf there are 200 participants, the simplest null model would be to expect 50 female smokers, 50 male smokers, 50 female non smokers, and 50 male non smokersbut we already know that it is implausible to expect an equal split of smokers and non-smokersthe expected frequencies will have to allow for the imbalance of smokers vs non smokers and a possible imbalance of men vs women in the sampleA sample with 20 male smokers, 10 female smokers, 80 male non-smokers and 40 female non-smokers has an imbalance of gender and smoking status, but smoking status does not depend on gender and there is no deviation from the null model
20 The contingency table of observed frequencies MenWomenRow totalsSmoke133144Don’t smoke2986115Column totals42117159The first step in calculating the value of Chi square for a two variable design is to set up a contingency table. In the example, this tabulates the number of female smokers, male smokers, male non-smokers, and female non-smokers in the sample.Each entry in the contingency table is called a “cell” . The green box highlights a single cell, and there are four cells in this table. In more complicated chi-square designs you can have more than four cells, for example in a 2 * 4 Chi square, which has 8 cells.The first step in calculating chi square is to produce the row and column totals, highlighted in blue. This tells us that there are 42 men and 117 women in our sample. There are 44 smokers and 115 non smokers. Notice that there are very uneven numbers of men and women in the sample, but in a 2*2 chi square this inequality won’t contribute to the value of chi-square. The value of chi-square statistic will be determined by the relative proportions of men and women that smoke.
21 Calculating the expected frequencies The key step in the calculation of chi-square is to estimate the frequency counts that would occur in each cell if the null hypothesis that the row frequencies and column frequencies do not depend upon each other were trueTo calculate the expected frequency of the male smokers cell, we first need to calculate the proportion of participants that are male, without considering if they smoke or notThis proportion is 42 males out of 159 (the total number of participants)42 / 159 = 0.26
22 Calculating the expected frequencies If the null hyp is true, and the proportion of female smokers and male smokers is equal, then the proportion of the smokers in the sample that are male should be equal to the overall proportion of the sample that is maleTotal number of smokers in sample (44) * proportion of sample that is male (0.26)44 * = 11.62
23 Calculating the expected frequencies MenWomenRow totalsSmoke133144Expected smokers11.62Don’t smoke2986115Expected non smokeColumn totals42117159The expected number of male smokers is (42/159) * 44 = 11.62
24 Calculating the expected frequencies MenWomenRow totalsSmoke133144Expected smokers11.6232.37Don’t smoke2986115Expected non smokeColumn totals42117159The expected number of female smokers is (117/159) * 44 = 32.370.74
25 Calculating the expected frequencies MenWomenRow totalsSmoke133144Expected smokers11.6232.37Don’t smoke2986115Expected non smoke30.37Column totals42117159The expected number of male non-smokers is (42/159) * 115 = 30.37
26 Calculating the expected frequencies MenWomenRow totalsSmoke133144Expected smokers11.6232.37Don’t smoke2986115Expected non smoke30.3784.62Column totals42117159The expected number of female non-smokers is (117/159) * 115 = 84.62
27 Calculating the value of chi square Each cell in the contingency table makes a contribution to the total chi-squareFor each cell you calculate(Observed – Expected) and square itYou then divide by the ExpectedDo this for each cell individually and add up the results
28 Calculating chi square MenWomenRow totalsSmoke133144Expected smokers11.6232.37Don’t smoke2986115Expected non smoke30.3784.62Column totals42117159( )2 = 1.901.90 / = 0.16The first thing to note about this contingency table is that the observed and expected values are all very close to each other, and so intuitively, the value of chi-square is going to be low.The worked example shows the contribution to the overall chi-square from the male smokers cell. To calculated the overall chi square for this table you need to repeat the process for all four cells and add up the results.
29 Converting chi-square to a p value The degrees of freedom for a two way Chi square depends upon the number of categories in the contingency table(num columns -1) * (num rows -1)SPSS will calculate the DF and p value for you“The chi square value of 0.31, df = 1 had an associated p value of 0.58, so the null hypothesis that the proportion of men and women that smoke is equal cannot be rejected.”Also see18.5.7
30 Larger contingency tables You can perform chi-square on larger contingency tablesFor example, we might be interested in whether the proportion of smokers vs. non smokers differs according to age, where age is a 3 level categorical variable20-29 years old30-39 years old40-49 years oldThis results in a 2 * 3 contingency tableHowever, there is some uncertainty as to what a significant chi-square means in this case
31 Partitioning chi-square A statistically significant 2 * 3 chi-square might have occurred for one of these 3 reasonsThe proportion of year olds who smoke differs from the proportion of year olds that smokeThe proportion of year olds that smoke differs from the proportion of year olds that smokeThe proportion of year olds that smoke differs from the proportion of year olds that smokeOr all 3 of the above might be trueOr 2 of the above might be trueAs a researcher, you will want to distinguish between these possibilities
32 Partitioning chi-square The solution is to break the 2 * 3 contingency table into smaller 2 * 2 contingency tables to test each of the comparisons in the listThe proportion of year olds who smoke differs from the proportion of year olds that smokeThe proportion of year olds that smoke differs from the proportion of year olds that smokeThe proportion of year olds that smoke differs from the proportion of year olds that smokeRun 3 separate 2 * 2 chi-square tests
33 Partitioning chi-square However, running 3 tests results in 3 chances of a type 1 error occurringTo maintain the probability of a type 1 error at the conventional level of 5% you divide the alpha level by the number of chi-square tests you runEffectively, you share the 5% risk of rejecting the null hypothesis due to sampling error equally among the tests you performFor a single chi-square, it is significant if SPSS reports that p is less than 0.05For two chi-square tests, they are significant at the 0.05 level individually if SPSS reports that p is less than 0.025For three chi-square tests, they are significant at the 0.05 level individually if SPSS reports that p is less thanYou will encounter this procedure in year 2 ANOVA as “correction for multiple comparisons”. It is common to all situations where you have to perform multiple null hypothesis tests.It might not be necessary to run all the 2*2 chi-square tests that are possible based on a larger contingency table. You could argue that you only need to run 2 or 3 based on inspection of the differences between expected and observed cell values. This will make it more likely that you will get a significant result. If you have to divide your p value amongst 20 different chi-square tests it will never be significant….
34 Warnings about chi-square The expected frequency count in any cell must not be less than 5If this occurs then chi-square is not reliableIf the contingency table is 2 * 2 or 2 * 3 you can use the Fisher exact probability test insteadSPSS will report thisFor bigger contingency tables the only solution is to “collapse” across categories, but only where this is meaningfulIf you began with age categories 0-4, 5-10, 11-15, you could collapse to 0-10 and 11-20, which would increase the expected frequencies in each cellFinally, remember that the total of frequencies is equal to the number of participants you haveeach person must only be a member of one cell in the table