Presentation is loading. Please wait. # Analysis of frequency counts with Chi square

## Presentation on theme: "Analysis of frequency counts with Chi square"— Presentation transcript:

Analysis of frequency counts with Chi square
2 The autumn term lectures defined a level of measurement called categorical, but they did not cover the kind of statistics that are possible when the level of measurement of a variable is categorical. Dr David Field

Summary Categorical data Frequency counts One variable chi-square
testing the null hypothesis that frequencies in the sample are equally divided among the catgegories varying the null hypothesis Two variable chi-square testing the null hypothesis that status on one categorical variable is independent from status on another categorical variable Limitations and assumptions of chi-square Andy Field chapter 18 covers chi-square There is also a guide online at Chi-square is topic 16 in the list

Categorical data 18.2 Each participant is a member of a single category, and the categories cannot be meaningfully placed in order e.g., nationality = French, German, Italian Sometimes chi-square is used with ordered categories, e.g. age bands To perform statistical tests with categorical data each participant must be a member of only one category Category membership must be mutually exclusive You can’t be a smoker and a non-smoker This allows frequency counts in each category to be calculated

Chi square If you can express the data as frequency counts in several categories, then chi square can be used to test for differences between the categories You will also see chi square written as a Greek letter accompanied by the mathematical symbol indicating that a number should be squared 2

Chi square with a single categorical variable
Suppose we are interested in which drink is most popular We ask a sample of 100 people if they prefer to drink coffee, tea, or water each respondent is only allowed to select one answer this is important: if each person can have membership of more than one category you can’t use Chi square By default, the null hypothesis for chi-square is that each of the categories is equally frequent in the underlying population it is possible to modify this (see later)

One variable chi-square example
Let’s say that the preferences expressed by the sample of 100 people result in the following observed frequency counts: tea 39 coffee 30 Water 31 SUM 100 The null hypothesis assumes that each category is equally frequent, and thus provides a model that the data can be used to test Based on the null hypothesis, the expected frequency counts would 100 / 3 = 33.3 per category The Chi square statistic works out the probability that the observed frequencies could be obtained by random sampling from a population where the null hyp is true

One variable chi-square example
Observed Expected Difference Difference squared Divide by expected 39 33.3 30 31 100 Here is a table of the expected and observed frequencies. Each cell is one of the categories from the previous slide, i.e. tea, coffee and water

One variable chi-square example
Observed Expected Difference Difference squared Divide by expected 39 33.3 5.7 30 -3.3 31 -2.3 100 Begin by quantifying out the difference between the expected frequencies (null hypothesis) and the observed frequencies.

One variable chi-square example
Observed Expected Difference Difference squared Divide by expected 39 33.3 5.7 32.49 30 -3.3 10.89 31 -2.3 5.29 100 The second step is to square the difference scores. Squaring has two important effects. The first effect is to remove and minus signs, so that we deal only with the magnitude of the difference, not its direction. This tells us that chi-square is insensitive the direction of differences. The second thing that squaring does is to increase (exaggerate) the contribution to the eventual statistic that is made by the larger differences, and reduce the emphasis on small differences between observed and expected. This is intuitively useful because small differences are more likely to occur due to sampling error than large differences. The red ellipse compared to the other ellipse illustrates this latter point.

One variable chi-square example
Observed Expected Difference Difference squared Divide by expected 39 33.3 5.7 32.49 0.98 30 -3.3 10.89 0.33 31 -2.3 5.29 0.16 100 The next step is to divide each squared dif score by the expected score. This is a bit like dividing by an estimate of the sample variance in the t-test formula. Consider a difference score of 5. If the expected score was 180, then intuitively an observed of 175, or 185, which would produce a difference of 5 does not seem to be very different from the null hypothesis (expected) value of 180. On the other hand, if the difference score is 5, and the expected score was 15, then 10 or 20 (which would be the observed scores that could produce such a difference score) then the difference of 5 now seems like a big difference. By dividing the squared dif by the expected value we express the squared dif as a proportion of the expected score. This puts all squared dif scores into a common scale, regardless of what the initial sample sizes and expected scores were

One variable chi-square example
Observed Expected Difference Difference squared Divide by expected 39 33.3 5.7 32.49 0.98 30 -3.3 10.89 0.33 31 -2.3 5.29 0.16 100 SUM 1.47 Finally, one sums the figures to arrive at the value of chi-square. Chi square is the sum of all the differences between the expected and obtained frequencies. Therefore, it is a measure of how similar the observed and expected frequencies are. A bigger value of chi square indicates a greater difference from the null hyp.

Converting Chi square to a p value
SPSS will do this for you Chi square has degrees of freedom equal to the number of categories minus 1 2 in the example this is because if you knew the frequencies of preference for tea and coffee and the sample size, the frequency of preference for water would not be free to vary “The chi square value of 1.47, df = 2 had an associated p value of 0.48, so the null hypothesis that preferences for drinking tea, coffee and water in the population are equal cannot be rejected.”

One variable chi square with unequal expected frequencies
By default, the expected frequencies are just the sample size divided equally among the number of categories. But, sometimes this is inappropriate For example, we know that the % of the population of the UK that smokes is less than 50% Let’s assume for purposes of illustration that 25% of the UK population are smokers We might hypothesise that the smoking rate is higher in Glasgow than the UK average rate The null hypothesis is that it is the same

One variable chi square with unequal expected frequencies
We ask 200 adults in Glasgow if they smoke. 80 say yes 120 say no We know that the UK average rate is 25%, and 80 is rather more than 25% of 200 Chi square can be used to assess the probability of the above frequencies being obtained by random sampling if the real smoking rate in Glasgow was actually 25%

One variable chi-square example with unequal expected frequencies
Observed Expected Difference Difference squared Divide by expected 120 150 -30 900 6 80 50 30 18 200 SUM 24 The difference between this table and the previous example is that the expected frequencies have been entered based on a null hypothesis of a 25/75 split, rather than an equal split between categories Notice that although the squared differences are the same in both cases, the contribution to the overall chi-square statistic is unequal because of the division by the two different expected values

One variable chi square with unequal expected frequencies
“80 of the sample of 200 people from Glasgow classified themselves as smokers. This resulted in a chi square value of 24.0, df = 1 with an associated p value of < 0.001, so the null hypothesis that smoking rates in Glasgow are equal to the UK average of 25% can be rejected.”

Chi square with two variables
18.3 Chi square with two variables Usually, it is more interesting to use Chi square to ask about the relationship between 2 categorical variables. For example, what is the relationship between gender and smoking? gender can be male or female smoking can be smoker or non-smoker If you have smoking data from just men, you can only use chi-square to ask if the proportion of smokers and non-smokers is different If you have smoking data from men and women you can use chi-square to ask if the proportion of men who smoke differs from the proportion of women who smoke

What 2*2 chi square does not do
It is important to realise that in the 2*2 chi square, having a big imbalance between the number of men and the number of women will not increase the value of the chi-square statistic Also, having a big imbalance between the number of smokers and non-smokers will not increase the value of the chi-square statistic This contrasts with the one variable chi-square, where an imbalance in the numbers of men vs women, or smokers vs. non-smokers does increase the value of chi-square. The value of chi-square for two variables is high if smoking frequency is contingent on gender, and low if smoking frequency is independent of gender

The key to understanding 2
The key to understanding 2*2 chi square is how the expected frequencies are calculated The expected frequencies provide the null hypothesis, or null model, that the chi square statistic tests If there are 200 participants, the simplest null model would be to expect 50 female smokers, 50 male smokers, 50 female non smokers, and 50 male non smokers but we already know that it is implausible to expect an equal split of smokers and non-smokers the expected frequencies will have to allow for the imbalance of smokers vs non smokers and a possible imbalance of men vs women in the sample A sample with 20 male smokers, 10 female smokers, 80 male non-smokers and 40 female non-smokers has an imbalance of gender and smoking status, but smoking status does not depend on gender and there is no deviation from the null model

The contingency table of observed frequencies
Men Women Row totals Smoke 13 31 44 Don’t smoke 29 86 115 Column totals 42 117 159 The first step in calculating the value of Chi square for a two variable design is to set up a contingency table. In the example, this tabulates the number of female smokers, male smokers, male non-smokers, and female non-smokers in the sample. Each entry in the contingency table is called a “cell” . The green box highlights a single cell, and there are four cells in this table. In more complicated chi-square designs you can have more than four cells, for example in a 2 * 4 Chi square, which has 8 cells. The first step in calculating chi square is to produce the row and column totals, highlighted in blue. This tells us that there are 42 men and 117 women in our sample. There are 44 smokers and 115 non smokers. Notice that there are very uneven numbers of men and women in the sample, but in a 2*2 chi square this inequality won’t contribute to the value of chi-square. The value of chi-square statistic will be determined by the relative proportions of men and women that smoke.

Calculating the expected frequencies
The key step in the calculation of chi-square is to estimate the frequency counts that would occur in each cell if the null hypothesis that the row frequencies and column frequencies do not depend upon each other were true To calculate the expected frequency of the male smokers cell, we first need to calculate the proportion of participants that are male, without considering if they smoke or not This proportion is 42 males out of 159 (the total number of participants) 42 / 159 = 0.26

Calculating the expected frequencies
If the null hyp is true, and the proportion of female smokers and male smokers is equal, then the proportion of the smokers in the sample that are male should be equal to the overall proportion of the sample that is male Total number of smokers in sample (44) * proportion of sample that is male (0.26) 44 * = 11.62

Calculating the expected frequencies
Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 Don’t smoke 29 86 115 Expected non smoke Column totals 42 117 159 The expected number of male smokers is (42/159) * 44 = 11.62

Calculating the expected frequencies
Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 32.37 Don’t smoke 29 86 115 Expected non smoke Column totals 42 117 159 The expected number of female smokers is (117/159) * 44 = 32.37 0.74

Calculating the expected frequencies
Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 32.37 Don’t smoke 29 86 115 Expected non smoke 30.37 Column totals 42 117 159 The expected number of male non-smokers is (42/159) * 115 = 30.37

Calculating the expected frequencies
Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 32.37 Don’t smoke 29 86 115 Expected non smoke 30.37 84.62 Column totals 42 117 159 The expected number of female non-smokers is (117/159) * 115 = 84.62

Calculating the value of chi square
Each cell in the contingency table makes a contribution to the total chi-square For each cell you calculate (Observed – Expected) and square it You then divide by the Expected Do this for each cell individually and add up the results

Calculating chi square
Men Women Row totals Smoke 13 31 44 Expected smokers 11.62 32.37 Don’t smoke 29 86 115 Expected non smoke 30.37 84.62 Column totals 42 117 159 ( )2 = 1.90 1.90 / = 0.16 The first thing to note about this contingency table is that the observed and expected values are all very close to each other, and so intuitively, the value of chi-square is going to be low. The worked example shows the contribution to the overall chi-square from the male smokers cell. To calculated the overall chi square for this table you need to repeat the process for all four cells and add up the results.

Converting chi-square to a p value
The degrees of freedom for a two way Chi square depends upon the number of categories in the contingency table (num columns -1) * (num rows -1) SPSS will calculate the DF and p value for you “The chi square value of 0.31, df = 1 had an associated p value of 0.58, so the null hypothesis that the proportion of men and women that smoke is equal cannot be rejected.” Also see 18.5.7

Larger contingency tables
You can perform chi-square on larger contingency tables For example, we might be interested in whether the proportion of smokers vs. non smokers differs according to age, where age is a 3 level categorical variable 20-29 years old 30-39 years old 40-49 years old This results in a 2 * 3 contingency table However, there is some uncertainty as to what a significant chi-square means in this case

Partitioning chi-square
A statistically significant 2 * 3 chi-square might have occurred for one of these 3 reasons The proportion of year olds who smoke differs from the proportion of year olds that smoke The proportion of year olds that smoke differs from the proportion of year olds that smoke The proportion of year olds that smoke differs from the proportion of year olds that smoke Or all 3 of the above might be true Or 2 of the above might be true As a researcher, you will want to distinguish between these possibilities

Partitioning chi-square
The solution is to break the 2 * 3 contingency table into smaller 2 * 2 contingency tables to test each of the comparisons in the list The proportion of year olds who smoke differs from the proportion of year olds that smoke The proportion of year olds that smoke differs from the proportion of year olds that smoke The proportion of year olds that smoke differs from the proportion of year olds that smoke Run 3 separate 2 * 2 chi-square tests

Partitioning chi-square
However, running 3 tests results in 3 chances of a type 1 error occurring To maintain the probability of a type 1 error at the conventional level of 5% you divide the alpha level by the number of chi-square tests you run Effectively, you share the 5% risk of rejecting the null hypothesis due to sampling error equally among the tests you perform For a single chi-square, it is significant if SPSS reports that p is less than 0.05 For two chi-square tests, they are significant at the 0.05 level individually if SPSS reports that p is less than 0.025 For three chi-square tests, they are significant at the 0.05 level individually if SPSS reports that p is less than You will encounter this procedure in year 2 ANOVA as “correction for multiple comparisons”. It is common to all situations where you have to perform multiple null hypothesis tests. It might not be necessary to run all the 2*2 chi-square tests that are possible based on a larger contingency table. You could argue that you only need to run 2 or 3 based on inspection of the differences between expected and observed cell values. This will make it more likely that you will get a significant result. If you have to divide your p value amongst 20 different chi-square tests it will never be significant….

Warnings about chi-square
The expected frequency count in any cell must not be less than 5 If this occurs then chi-square is not reliable If the contingency table is 2 * 2 or 2 * 3 you can use the Fisher exact probability test instead SPSS will report this For bigger contingency tables the only solution is to “collapse” across categories, but only where this is meaningful If you began with age categories 0-4, 5-10, 11-15, you could collapse to 0-10 and 11-20, which would increase the expected frequencies in each cell Finally, remember that the total of frequencies is equal to the number of participants you have each person must only be a member of one cell in the table

Download ppt "Analysis of frequency counts with Chi square"

Similar presentations

Ads by Google