Analysis of Categorical Data. Types of Tests o Data in 2 X 2 Tables (covered previously) Comparing two population proportions using independent samples.

Analysis of Categorical Data

Types of Tests o Data in 2 X 2 Tables (covered previously) Comparing two population proportions using independent samples (Fisher’s Exact Test) Comparing two population proportions using dependent samples (McNemar’s Test) Relative Risk (RR), Odds Ratios (OR), Risk Difference, Attributable Risk (AR), & NNT/NNH o Data in r X c Tables Tests of Independence/Association and Homogeneity.

Cervical-Cancer and Age at First Pregnancy – 2 X 2 Data Table These data come from a case-control study to examine the potential relationship between age at first pregnancy and cervical cancer. In this study we will be comparing the proportion of women who had their first pregnancy at or before the ages of 25, because researchers suspected that an early age at first pregnancy leads to increased risk of developing cervical cancer. These data come from a case-control study to examine the potential relationship between age at first pregnancy and cervical cancer. In this study we will be comparing the proportion of women who had their first pregnancy at or before the ages of 25, because researchers suspected that an early age at first pregnancy leads to increased risk of developing cervical cancer.

2 X 2 Example: Case-Control Study Cervical Cancer and Age at 1 st Pregnancy Cervical Cancer and Age at 1 st Pregnancy DiseaseStatus Age at 1 st Pregnancy Age < 25 Age at 1 st Pregnancy Age > 25 RowTotals Cervical Cancer (Case) 42749 Healthy(Control)203114317 ColumnTotals245121366

Previously o We have compared the proportions of women with the risk factor in both groups (p 1 vs. p 2 ) using the z-test, a CI for (p1 – p2) & Fisher’s Exact Test. o Computed the Odds Ratio (OR) and found a CI for the population OR.

Development of a Test Statistic to Measure Lack of Independence One way to generalize the question of interest to the researchers is to think of it as follows: Q: Is there an association between cervical cancer status and whether or not a woman had her 1 st pregnancy at or before the age of 25?

Development of a Test Statistic to Measure Lack of Independence If there is not an association, we say that the variables are independent. In the probability notes we saw that two events A and B are said to be independent if P(A|B) = P(A).

Development of a Test Statistic to Measure Lack of Independence In the context of our study this would mean P(Age < 25|Cancer Status) = P(Age < 25) i.e. knowing something about disease status tells you nothing about the presence of the risk factor of having their first pregnancy at or before age 25.

Development of a Test Statistic to Measure Lack of Independence When we consider this percentage conditioning on disease status we see that relationship for independence does not hold for these data. P(Age < 25|Cervical Cancer) = 42/47 =.8936 P(Age < 25|Healthy Control) = 203/317 =.6404 P(Age < 25) = 245/366 =.6694 In this study 66.94% of the women sampled had their first pregnancy at or before the age of 25. Should both be equal to.6694

Development of a Test Statistic to Measure Lack of Independence o Of course the observed differences could be due to random variation and in truth it may be the case that disease and risk factor status are independent. o Therefore we need a means of assessing how different the observed results are from what we would expect to see if the these two factors were independent.

2 X 2 Example: Case-Control Study Cervical Cancer and Age at 1 st Pregnancy Cervical Cancer and Age at 1 st Pregnancy DiseaseStatus Age at 1 st Pregnancy Age < 25 Age at 1 st Pregnancy Age > 25 RowTotals Cervical Cancer (Case) 42749 Healthy(Control)203114317 ColumnTotals245121366 C1C1 C2C2 R1R1 R2R2 n a b c d

Development of a Test Statistic to Measure Lack of Independence From this table we can calculate the conditional probability of having the risk factor of early pregnancy given the disease status of the subject as follows: The unconditional probability of risk presence of these data is given by: and setting these to equal we have

Development of a Test Statistic to Measure Lack of Independence Thus we expect the frequency in the a cell to be equal to: Similarly we find the following expected frequencies for the cells making up the 2 X 2 table

Development of a Test Statistic to Measure Lack of Independence In general we denote the observed frequency in the i th row and j th column as or just O for short. We denote the expected frequency for the i th row and j th column as or just E for short.

Development of a Test Statistic to Measure Lack of Independence o To measure how different our observed results are from what we expected to see if the two variables in question were independent we intuitively should look at the difference between the observed (O) and expected (E) frequencies, i.e. O – E or more specifically o However this will give too much weight to differences where these frequencies are both large in size.

Development of a Test Statistic to Measure Lack of Independence o One test statistic that addresses the “size” of the frequencies issue is Pearson’s Chi-Square    Notice this test statistic still uses (O – E) as the basic building block. This statistic will be large when the observed frequencies do NOT match the expected values for independence.

Chi-square Distribution    This is a graph of the chi-square distribution with 4 degrees of freedom. The area to the right of Pearson’s chi-square statistic give the p-value. The p-value is always the area to the right! p-value 

2 X 2 Example: Case-Control Study Cervical Cancer and Age at 1 st Pregnancy Cervical Cancer and Age at 1 st Pregnancy DiseaseStatus Age at 1 st Pregnancy Age < 25 Age at 1 st Pregnancy Age > 25 RowTotals Cervical Cancer (Case) 42749 Healthy(Control)203114317 ColumnTotals245121366 C1C1 C2C2 R1R1 R2R2 n O 11 O 12 O 21 O 22

Calculating Expected Frequencies Cervical Cancer and Age at 1 st Pregnancy Cervical Cancer and Age at 1 st Pregnancy DiseaseStatus Age at 1 st Pregnancy Age < 25 Age at 1 st Pregnancy Age > 25 RowTotals Cervical Cancer (Case) 42749 Healthy(Control)203114317 ColumnTotals245121366 (32.80) (16.20) (212.20)(104.80) C1C1 C2C2 R1R1 R2R2 n

Calculating the Pearson Chi-square http://www.stat.tamu.edu/~west/applets/chisqdemo.html

Chi-square Probability Calculator in JMP Enter the test statistic value and df and the p-value is automatically calculated. p-value = P(   

2 X 2 Example: Case-Control Study Cervical Cancer and Age at 1 st Pregnancy Conclusion: We have strong evidence to suggest that at age at first pregnancy and cervical cancer status are NOT independent, and that they are associated or related (p =.0027). In particular we found that the proportion of women having their first pregnancy at or before the age of 25 was higher amongst women with cervical cancer than for those without.

Other things we could do… o Odds Ratio (OR) and CI for OR - case-control study means no RR. o Fisher’s Exact Test - Pearson’s chi-square is an approximation that requires “large” sample sizes * typically we would like all E ij > 5 * or at least 80% of cells should have E ij > 5 * thus the approximation should be good here as both of these conditions are met for this study. * thus the approximation should be good here as both of these conditions are met for this study.

Example 2: Response to Treatment and Histological Type of Hodgkin’s Disease In this study a random sample of 538 patients diagnosed with some form of Hodgkin’s Disease was taken and the histological type: nodular sclerosis (NS), mixed cellularity (MC), lymphocyte predominance (LP), or lymphocyte depletion (LD) was recorded along with the outcome from standard treatment which was recorded as being none, partial, or complete remission. Q: Is there an association between type of Hodgkin’s and response to treatment? If so, what is the nature of the relationship?

Example 2: Response to Treatment and Histological Type of Hodgkin’s Disease TypeNonePartialPositiveRowTotals LD44101872 LP121874104 MC5854154266 NS12166896 ColumnTotals12698314 n = 538 Some Probabilities of Potential Interest Probability of Positive Response to Treatment P(positive) = 314/538 =.5836 Probability of Positive Response to Treatment Given Disease Type P(positive|LD) = 18/72 =.2500 P(positive|LP) = 74/104 =.7115 P(positive|MC) = 154/266 =.5789 P(positive|NS) = 68/96 =.7083 Notice the conditional probabilities are not equal to the unconditional!!!

Mosaic plot of the results Response to Treatment vs. Histological Type Clearly we see that LP and NS respond most favorably to treatment with over 70% of those sampled having experiencing complete remission, whereas lymphocyte depletion has a majority (61.1%) of patients having no response to treatment. A statistical test at this point seems unnecessary as it seems clear that there is an association between the type of Hodgkin’s disease and the response to treatment, nonetheless we will proceed…

Example 2: Response to Treatment and Histological Type of Hodgkin’s Disease TypeNonePartialPositiveRowTotals LD44101872 LP121874104 MC5854154266 NS12166896 ColumnTotals12698314 n = 538 (16.86) (13.11) (42.02) (24.36)(18.94)(60.69) (62.30)(48.45)(155.25) (22.48)(17.49)(56.03)

Example 2: Response to Treatment and Histological Type of Hodgkin’s Disease TypeNonePartialPositiveRowTotals LD44101872 LP121874104 MC5854154266 NS12166896 ColumnTotals12698314 n = 538 (16.86) (13.11) (42.02) (24.36)(18.94)(60.69) (62.30)(48.45)(155.25) (22.48)(17.49)(56.03) We have strong evidence of an association between the type of Hodgkin’s and response to treatment (p <.0001).

Measures of Association Between Two Categorical Variables This can be applied to the cervical cancer case- control study.

Measures of Association Between Two Categorical Variables This can be used for general r x c tables. This can be used for the Hodgkin’s example:

Measures of Association Between Two Categorical Variables For the Hodgkin’s study

Measures of Association Between Two Categorical Variables There are lots of other measures of association. When both variables are nominal the previous measures are fine and there are certainly many more. For cases where both variables are ordinal common measures include Kendall’s tau and Somer’s D. In some cases we wish to measure the degree of exact agreement between two nominal or ordinal variables measured using the same levels or scales in which case we generally use Cohen’s Kappa (  ).

Measures of Association Between Two Categorical Variables Cohen’s Kappa (  ) – measures the degree of agreement between two variables on the same scales. Example 3: Medicare Study – General health at baseline and 2-yr. follow-up, how well do they agree?  excellent agreement  good agreement 0 <  marginal agreement There is a fairly good agreement between the general assessment of overall health baseline and at follow-up. However, there appears to be some general trend for improvement as well.

Testing for Lack of Symmetry o Bowker’s Test of Symmetry is a generalization of McNemar’s Test to r x r tables where there where the row and column variables are on the same scale. o The general health of the subjects in the Medicare study is an example of where this test could be used as both the health at baseline and follow-up is recorded using the same 5-point ordinal scale.

Bowker’s Test of Symmetry 12…r Row Totals 1 O 11 O 12 … O 1r 2 O 21 O 22 … O 2r …………… r O r1 O r2 … O rr Column Totals Y X The test looks for the frequencies to be generally larger on one side of the diagonal than the other.

Bowker’s Test of Symmetry When will this test statistics be “large”? If there was a general trend or tendency for X > Y or for X < Y then we would expect the off diagonal cells of the table to larger on one side than the other. For example if Y tended to be larger than X, perhaps indicating an improvement in health, then we expect the frequencies above the diagonal to be larger than those below.

Bowker’s Test of Symmetry Symmetry of Disagreement Bowker’s test suggests the differences are asymmetric (p <.0001). Examining the percentages suggests a majority of patients either stayed the same or improved in each group based on baseline score. Therefore it is reasonable to state that we have evidence that in general subjects health stayed the same or if it did change it was generally for the better (p <.0001).

Other Approaches o Wilcoxon Sign-Rank Test for the paired differences in the ordinal health score (p <.0001). o Direct examination of the distribution of the changes in general health score. Follow-up – Baseline There is a slight advantage for improvement vs. decline in health. The plot on the right shows the change in general health vs. baseline health. With the exception of those with the lowest health at baseline a majority (50%+) of patients stayed the same. The shading for improvement is larger than the shading for health decline.

Other Tests for Categorical Data o Chi-square Test for Trend in Binomial Proportions tests whether or not p 1 < p 2 < p 3 < … < p k where 1, 2, …, k are levels of an ordinal variable, i.e. 2 X k table. o Chi-square Goodness-of-Fit Tests – used test whether observations come from some hypothesized distribution. o Cochran-Mantel-Haenszel Test – Looks at whether or not there is a relationship in a 2 X 2 table situation adjusting for the level of a third factor. For example, is there a relationship between heavy drinking (Y or N) and lung cancer (Y or N) adjusting for smoking status.

Analysis of Categorical Data. Types of Tests o Data in 2 X 2 Tables (covered previously) Comparing two population proportions using independent samples.

Similar presentations

Presentation on theme: "Analysis of Categorical Data. Types of Tests o Data in 2 X 2 Tables (covered previously) Comparing two population proportions using independent samples."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analysis of Categorical Data. Types of Tests o Data in 2 X 2 Tables (covered previously) Comparing two population proportions using independent samples.

Similar presentations

Presentation on theme: "Analysis of Categorical Data. Types of Tests o Data in 2 X 2 Tables (covered previously) Comparing two population proportions using independent samples."— Presentation transcript:

Similar presentations

About project

Feedback