Presentation on theme: "SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)"— Presentation transcript:
SADC Course in Statistics Goodness-of-fit tests (and further issues) (Session 16)
To put your footer here go to View > Header and Footer 2 Learning Objectives By the end of this session, you will be able to conduct and interpret results from a chi- square test for testing the goodness-of-fit of data to a particular distribution understand how two-way contingency tables can be further examined to look at its residuals present results from a standard chi-square test, paying attention to the tables summary features
To put your footer here go to View > Header and Footer 3 Goodness-of-fit tests In previous sessions, we have seen that many tests are based on the assumption of normality On some occasions, it is also important to ascertain whether the data follow other distributions, e.g. the binomial or Poisson distributions We shall now look at how the chi-square test can be applied to examine the extent to which assumptions concerning the distribution of a given variable holds
To put your footer here go to View > Header and Footer 4 Goodness-of-fit tests The basic idea is first to calculate the probability of each possible value occurring e.g. the number of cows getting disease in a farm which has 6 cows, may be assumed to follow a binomial random variable. e.g. the number of visits made by a pregnant woman in a region to the regions single anti-natal clinic may be assumed to follow a Poisson distribution. Can we check these assumptions before subjecting the data to tests based on these?
To put your footer here go to View > Header and Footer 5 Goodness-of-fit test: Normal dist n Because the Normal distribution applies to a continuous random variable, it is necessary to group the data and obtain observed frequencies in each group. The next step is to determine the probability of an observation falling in each group, and hence the expected value. The chi-square test can then be applied in the usual way: the d.f. being number of groups – 1 – number of parameters estimated in computing expected values.
To put your footer here go to View > Header and Footer 6 An example: Normal dist n Consider the total rainfall in June at a particular site from 1928 to 1983. Suppose we wish to test the assumption that these data follow a normal distribution A histogram for the data appears below.
To put your footer here go to View > Header and Footer 7 An example: Normal dist n Expected values are now calculated for each group, assuming a normal distribution. The table shows observed and expected frequencies. The chi-square value is 3.6 with d.f.=5. P-value = 0.6083. Conclusions? RainTotalObservedExpected <=10046.86 to 125117.45 to 1551210.31 to 175911.12 to 20099.33 to 22566.10 to 25033.11 > 25021.72 Totals56
To put your footer here go to View > Header and Footer 8 An example: Binomial dist n First recall (from Module H1) the form of the probability density function for the binomial random variable with parameters n and p, where p is the probability of a success in a sequence of n trials, each trial having just 2 possible outcomes. The number of successes (X) in n trials has a binomial distribution. This formula gives the binomial probabilities, obtained also from Excels function Binomdist(x,n,p,false).
To put your footer here go to View > Header and Footer 9 An example: Binomial dist n Suppose we have a binomial variable with observed values as shown (n=7,p=0.222) Expected values can be derived using [P(X=k)]*404. The chi-square value is 141.3 with d.f.=4 since p has been estimated from the data. p-value = 0.000 kObservedExpected 08169.7 1130139.2 2129119.2 33756.7 41416.2 5,6,7233.0 Totals404 What are your conclusions?
To put your footer here go to View > Header and Footer 10 Other issues There are two more issues to discuss concerning chi-square tests for testing the association between two categorical variables. These relate to further examination of the table of frequencies when a significant result is found; and how to present the results
To put your footer here go to View > Header and Footer 11 Example of Session 15 For data below, we found a significant chi-square value, with p=0.0024, i.e. evidence that the proportion of diseased animals are not the same for all vaccines. Vaccinediseasedhealthy Total A43237280 B52198250 C25245270 D48212260 E57233290 Total22511251350 Question: But what contributes most to the chi- square statistic? i.e. departs most from Pr(diseased)=0.167?
To put your footer here go to View > Header and Footer 12 Cell contributions to chi-square: Vaccinediseasedhealthy A0.2880.057 B2.5630.512 C8.8891.778 D0.5020100 E1.5540.311 Table gives the chi-square contributions to each cell, i.e. values (O-E) 2 /E. Rule of thumb: Focus on cells with values4 and in larger tables, focus on those 9.
To put your footer here go to View > Header and Footer 13 Standardised residuals Vaccinediseasedhealthy A-0.540.24 B1.60-072 C-2.981.33 D0.71-0.33 E1.25-0.56 Better still, use standardised residuals so signs are also included, i.e. use SR=(O-E)/E. Rule of thumb: Focus on SR>|2|, or in larger tables, focus on those >|3|. Conclusion: Vaccine C gives most discrepancy from H 0.
To put your footer here go to View > Header and Footer 14 Presentation of results Vaccine% DISEASED C9.3% A15.4% D18.5% E19.7% B20.8% In this example, it would be appropriate to present a table of the percentage of animals diseased under each vaccine. Table sorted by the most useful vaccine would make the results easier to see. Note there are more advanced methods, e.g. modelling, to make specific comparisons between the above percentages
To put your footer here go to View > Header and Footer 15 Presentation: Example from Sess 14 Usually sleep under a mosquito net? Suffered malaria? YesNoTotal Yes 649 62.5% 3849 55.8% 4498 56.6% No 390 37.5% 3055 44.2% 3445 43.4% Total1039 100.0% 6904 100.0% 7943 (100%) Recall results below from before. Test of association gave p=0.000.
To put your footer here go to View > Header and Footer 16 Presentation and conclusions Test results indicate that there is an association between use of a mosquito net and incidence of malaria. However the resulting incidences are unexpected. Note: malaria incidence for those using net = 62.5% for those not using a net is = 55.8%. This emphasises the danger of ignoring other factors that may affect malaria incidence, e.g. altitude, housing conditions, etc. Further, could it be that those who had malaria, then started using mosquito nets?
To put your footer here go to View > Header and Footer 17 Some final remarks Performing a chi-square analysis is simple, but it does not take account of other factors that may affect the results. More advanced (e.g. log-linear modelling) procedures do exist for exploring factors affecting a categorical response, here use of a bednet. Recall that the chi-square test is an approximation. This approximation is poor if the expected frequencies are very small (e.g. < 5). Try collapsing some rows or columns if this happens.
To put your footer here go to View > Header and Footer 18 Some practical work follows…