Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Statistics To Make Inferences 8

Similar presentations


Presentation on theme: "Using Statistics To Make Inferences 8"— Presentation transcript:

1 Using Statistics To Make Inferences 8
Summary Contingency tables. Goodness of fit test. Mike Cox, Newcastle University, me fecit 10/04/2015 1 Friday, 14 April :30 AM

2 To assess contingency tables for independence.
Goals To assess contingency tables for independence. To perform and interpret a goodness of fit test. Practical Construct and analyse contingency tables. 2

3 Recall To compare a population and sample variance we employed? χ2 Cc
3

4 Today The probability approach from last week is employed to tell if “observed” data confirms to the pattern “expected” under a given model. 4

5 Categorical Data - Example
Assessed intelligence of athletic and non-athletic schoolboys. bright stupid Total athletic 581 567 1148 lazy 209 351 560 790 918 1708 K. Pearson “On The Relationship Of Intelligence To Size And Shape Of Head, And To Other Physical And Mental Characters”, Biometrika, 1906, 5, , data on page 144. 5

6 Procedure Formulate a null hypothesis. Typically the null hypothesis is that there is no association between the factors. Calculate expected frequencies for the cells in the table on the assumption that the null hypothesis is true. Calculate the chi-squared statistic. This is for an r x c table with entries in row i and column j. 6

7 ν = (rows ‑ 1)(columns ‑ 1) = (r – 1)(c – 1)
Procedure Compare the calculated statistic with tabulated values of the chi-squared distribution with ν degrees of freedom. ν   =  (rows ‑ 1)(columns ‑ 1) = (r – 1)(c – 1) 7

8 Key Assumptions Independence of the observations. The data found in each cell of the contingency table used in the chi-squared test must be independent observations and non-correlated. 2. Large enough expected cell counts. As described by Yates et al., "No more than 20% of the expected counts are less than 5 and all individual expected counts are 1 or greater" (Yates, Moore & McCabe, 1999, The Practice of Statistics, New York: W.H. Freeman p. 734). 8

9 Key Assumptions Randomness of data. The data in the table should be randomly selected. 4. Sufficient Sample Size. It is also generally assumed that the sample size for the entire contingency table is sufficiently large to prevent falsely accepting the null hypothesis when the null hypothesis is true. 9

10 Example Assessed intelligence of athletic and non athletic schoolboys.
Observed bright stupid Total athletic 581 567 1148 lazy 209 351 560 790 918 1708 10

11 Probabilities C The probability a random boy is athletic is
The probability a random boy is bright is Assuming independence, the probability a random boy is both athletic and bright is For 1708 respondents the expected number of athletic bright boys is bright stupid Total athletic 581 567 1148 lazy 209 351 560 790 918 1708 11

12 Expected bright stupid Total athletic 530.98 1148 lazy 560 790 918
The expected number of athletic bright boys is bright stupid Total athletic 530.98 1148 lazy 560 790 918 1708 12

13 Expected bright stupid Total athletic 530.98 ? 1148 lazy 560 790 918
The expected number of athletic stupid boys is bright stupid Total athletic 530.98 ? 1148 lazy 560 790 918 1708 13

14 Expected bright stupid Total athletic 530.98 617.02 1148 lazy 560 790
The expected number of athletic stupid boys is 1148 – = bright stupid Total athletic 530.98 617.02 1148 lazy 560 790 918 1708 14

15 Expected bright stupid Total athletic 530.98 617.02 1148 lazy ? 560
The expected number of lazy bright boys is bright stupid Total athletic 530.98 617.02 1148 lazy ? 560 790 918 1708 15

16 Expected bright stupid Total athletic 530.98 617.02 1148 lazy 259.02 ?
The expected number of stupid lazy boys is bright stupid Total athletic 530.98 617.02 1148 lazy 259.02 ? 560 790 918 1708 16

17 Expected bright stupid Total athletic 530.98 617.02 1148 lazy 259.02
The expected number of stupid lazy boys is 918 – = bright stupid Total athletic 530.98 617.02 1148 lazy 259.02 300.98 560 790 918 1708 17

18 Expected bright stupid Total athletic 530.98 617.02 1148 lazy 259.02
300.98 560 790 918 1708 18

19 χ2 Observed Expected Only one cell is free. 19

20 χ2 As a general rule to employ this statistic,
all expected frequencies should exceed 5. If this is not the case categories are pooled (merged) to achieve this goal. See the Prussian data later. 20

21 Conclusion ν p=0.1 p=0.05 p=0.025 p=0.01 p=0.005 p=0.002 1 2.706 3.841 5.024 6.635 7.879 9.550 The result is significant (26.73 > 3.84) at the 5% level. So we reject the hypothesis of independence between athletic prowess and intelligence. 21

22 SPSS Raw data Note v1 are the row labels v2 are the column labels
v3 is the frequency for each cell 22

23 SPSS Data > Weight Cases
Since frequency data has been input, necessary to weight. This is essential, do not use percentages. 23

24 SPSS Analyze > Descriptive Statistics > Crosstabs
Set row and column variables. Frequencies already set. 24

25 SPSS Select chi-square 25

26 SPSS Select Observed – input data
Expected – output data, under the model 26

27 SPSS Expected cell frequencies Expected under the model. 27

28 SPSS Pearson Chi Square is the required statistic ff
Do not report p = .000, rather p < .001 Note Fisher’s exact test, only available in SPSS for 2x2 tables (see next slide). 28

29 What If We Have Small Cell Counts?
Fisher's exact test The Fisher's exact test is used when you want to conduct a chi-square test but one or more of your cells has an expected frequency of five or less. Remember that the chi-square test assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and can be used regardless of how small the expected frequency is. In SPSS, unless you have the SPSS Exact Test Module, you can only perform a Fisher's exact test on a 2x2 table, and these results are presented by default. 29

30 Aside Two dials were compared. A subject was asked to read each dial many times, and the experimenter recorded his errors. Altogether 7 subjects were tested. The data shows how many errors each subject produced. Do the two conditions differ at the 0.05 significance level (give the appropriate p value)? Observed data What key word describes this data? 30

31 Aside What tests are available for paired data? C One sample t test c
Sign test Wilcoxon Signed Ranks Test 31

32 Aside What tests are available for paired data? What assumptions are made? One sample t test normality Sign test No assumption of normality Wilcoxon Signed Ranks Test Resembles the SignTest in scope, but it is much more sensitive. In fact, for large numbers it is almost as sensitive as the Student t-test 32

33 Aside What tests are available for paired data? One sample t test
Wilcoxon Signed Ranks Test Sign test Sign test answers the question How Often?, whereas other tests answer the question How Much? One sample t test – mean Wilcoxon Signed Ranks Test - median 33

34 Example The table is based on case-records of women employees in Royal Ordnance factories during The same test being carried out on the left eye (columns) and right eye (rows). Stuart “The estimation and comparison of strengths of association in contingency tables”, Biometrika, 1953, 40, 34

35 Observed Highest Second Third Lowest Total 1520 266 124 66 1976 234
1512 432 78 2256 117 362 1772 205 2456 36 82 179 492 789 1907 2222 2507 841 7477 Is there any obvious structure? 35

36 Row total x Column total / Grand total
Expected In general to find the expected frequency in a particular cell the equation is Row total x Column total / Grand total 36

37 Row total x Column total / Grand total
Expected In general to find the expected frequency in a particular cell the equation is Row total x Column total / Grand total So for highest right and bottom left the equation becomes 1976 x 1907 / 7477 = 37

38 Row total x Column total / Grand total
Expected Highest Second Third Lowest Total 503.98 ? 1976 2256 2456 789 1907 2222 2507 841 7477 Row total x Column total / Grand total 1976 x 1907 / 7477 = 38

39 Row total x Column total / Grand total
Expected Highest Second Third Lowest Total 503.98 587.22 662.54 ? 1976 575.39 670.43 756.43 2256 626.40 729.87 823.48 2456 789 1907 2222 2507 841 7477 Row total x Column total / Grand total 39

40 The missing values are simply found by subtraction
Expected Highest Second Third Lowest Total 503.98 587.22 662.54 ? 1976 575.39 670.43 756.43 2256 626.40 729.87 823.48 2456 789 1907 2222 2507 841 7477 The missing values are simply found by subtraction 40

41 Expected Highest Second Third Lowest Total 503.98 587.22 662.54 ? 1976
575.39 670.43 756.43 2256 626.40 729.87 823.48 2456 789 1907 2222 2507 841 7477 1976 – – – = 41

42 Expected Highest Second Third Lowest Total 503.98 587.22 662.54 222.26
1976 575.39 670.43 756.43 2256 626.40 729.87 823.48 2456 789 1907 2222 2507 841 7477 1976 – – – = 42

43 Similarly for the remaining cells
Expected Highest Second Third Lowest Total 503.98 587.22 662.54 222.26 1976 575.39 670.43 756.43 ? 2256 626.40 729.87 823.48 2456 789 1907 2222 2507 841 7477 Similarly for the remaining cells 43

44 Expected Highest Second Third Lowest Total 503.98 587.22 662.54 222.26
1976 575.39 670.43 756.43 253.75 2256 626.40 729.87 823.48 276.25 2456 201.23 234.47 264.55 88.75 789 1907 2222 2507 841 7477 44

45 Short Cut Contributions to the χ2 statistic,
for the top left cell the contribution is 45

46 Conclusion Nine cells are free.
ν p=0.1 p=0.05 p=0.025 p=0.01 p=0.005 p=0.002 9 14.684 16.919 19.023 21.666 23.589 26.056 The above statistic makes it very clear that there is some relationship between the quality of the right and left eyes. For the top left cell only. 46

47 Total χ2 Highest Second Third Lowest Total 2048.32 175.72 437.75
109.86 202.55 139.14 121.73 414.25 185.41 18.38 135.67 99.15 27.66 8097 47

48 Conclusion ν Nine cells are free.
p=0.1 p=0.05 p=0.025 p=0.01 p=0.005 p=0.002 9 14.684 16.919 19.023 21.666 23.589 26.056 The above statistic makes it very clear that there is some relationship between the quality of the right and left eyes. For all cells. 48

49 SPSS Raw data 49

50 SPSS Expected cell frequencies 50

51 SPSS Pearson Chi Square is the required statistic 51

52 Poisson Distribution The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume. Typical applications are to queues/arrivals. The number of phone calls received per day. The occurrence of accidents/industrial injuries. More exotically, birth defects and the number of genetic mutations. The occurrence of rare diseases. 52

53 Poisson Distribution discrete events which are independent.
events occur at a fixed rate λ per unit continuum. (λ lambda) 53 8.53

54 Poisson Distribution x successes e is approximately equal to 2.718
λ is the rate per unit continuum the mean is λ the variance is λ 54

55 Casio 83ES exp(1) = exp(2) = Its inverse, on the same key is ln, so ln( ) = 1 ln( ) = 2 exp or “e” 55

56 Alternate applications
A similar approach may be employed to test if simple models are plausible. 56

57 χ2 Goodness of Fit Test The degrees of freedom are ν = m – n – 1, where there are m frequencies left in the problem, after pooling, and n parameters have been fitted from the raw data. For example… 57

58 Example The number of Prussian army corps in which soldiers died from the kicks of a horse in a year. Typical “industrial injury” data 58

59 Which distribution is appropriate?
Is the data discrete or continuous? ccccccccccccccccccccccc Discrete, since a simple count 59

60 Check list of distributions
Discrete Continuous Binomial Normal Poisson Exponential 60

61 Check list of distribution parameters
Discrete Continuous Binomial Normal Poisson Exponential cccccccccccccccccccccccccc n p μ σ2 cccccccccccccccccccccccccc λ λ ccccccc Discrete, no “n” implies Poisson 61

62 Number deaths in a corps Observed frequency (Oi)
Observed Data Number deaths in a corps Observed frequency (Oi) 144 1 91 2 32 3 11 4 5 or more Total 280 We need to estimate the Poisson parameter λ. Which is the mean of the distribution. 62

63 Number deaths in a corps Observed frequency (Oi)
Observed Data Number deaths in a corps Observed frequency (Oi) 144 1 91 2 32 3 11 4 5 or more Total 280 63

64 ccccccccccccccccccccc
Mean ccccccccccccccccccccc 64

65 Number deaths in a corps
Expected λ = 0.7 and “e” is a constant on your calculator Number deaths in a corps Poisson model Expected probability 0.4966 1 0.3476 2 0.1217 3 0.0284 4 0.0050 5 or more By subtraction ? Total 65

66 Number deaths in a corps
Expected Number deaths in a corps Poisson model Expected probability 0.4966 1 0.3476 2 0.1217 3 0.0284 4 0.0050 5 or more By subtraction 0.0008 Total 66

67 Number deaths in a corps Expected frequency (Ei)
Expected frequency for no deaths 280 x = Number deaths in a corps Expected probability Expected frequency (Ei) 0.4966 139.04 1 0.3476 2 0.1217 3 0.0284 4 0.0050 5 or more 0.0008 Total 67

68 Number deaths in a corps Expected frequency (Ei)
Expected frequency for remaining rows 280 × probability = frequency Number deaths in a corps Expected probability Expected frequency (Ei) 0.4966 139.04 1 0.3476 97.33 2 0.1217 34.07 3 0.0284 7.95 4 0.0050 1.39 5 or more 0.0008 0.22 Total 280 Note the two expected frequencies less than 5! 68

69 χ2 Calculation Number deaths in a corps Observed frequency (Oi)
Expected frequency (Ei) 144 139.04 0.18 1 91 97.33 0.41 2 32 34.07 0.13 3 or more 13 9.56 1.24 Total 280 1.95 Pool to ensure all expected frequencies exceed 5 69

70 Conclusion Here m (frequencies) = 4, n (fitted parameters) = 1
then ν = m – n – 1 = 4 – 1 – 1 = 2 ν p=0.1 p=0.05 p=0.025 p=0.01 p=0.005 p=0.002 2 4.605 5.991 7.378 9.210 10.597 12.429 The hypothesis, that the data comes from a Poisson distribution would be accepted (5.991 > 1.95). 70

71 Bring your calculators next week
71

72 Read Read Howitt and Cramer pages 134-152
Read Howitt and Cramer (e-text) pages Read Russo (e-text) pages Read Davis and Smith pages 72

73 Practical 8 This material is available from the module web page.
Module Web Page 73

74 Instructions for the practical
This material for the practical is available. Instructions for the practical Practical 8 Material for the practical Practical 8 74

75 Assignment 2 You will find submission details on the module web site
Note the dialers lower down the page give access to your individual assignment. It is necessary to enter your student number exactly as it appears on your smart card. 75

76 Assignment 2 As a general rule make sure you can perform the calculations manually. It does no harm to check your calculations using a software package. Some software employ non-standard definitions and should be used with caution. 76

77 Assignment 2 All submissions must be typed. 77

78 Whoops! Researchers at Cardiff University School of Social Science claim errors made by the Hawk-Eye line - calling technology can be greater than 3.6mm - the average error quoted by the manufacturers. Teletext, p388 12 June 2008 78

79 Whoops! Kate Middleton 'marries Prince Harry' on souvenir mug
The Telegraph - Thursday 17 March 2011 79

80 Whoops! Poldark - BBC - 8 March 2015 80


Download ppt "Using Statistics To Make Inferences 8"

Similar presentations


Ads by Google