# The association between two categorical variables

## Presentation on theme: "The association between two categorical variables"— Presentation transcript:

The association between two categorical variables
Categorical data II The association between two categorical variables دکتر سید ابراهیم جباری فر تاریخ : 1388 / 2010 دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر

By the end of the session you should be able to
Construct 2-way table to examine association between two categorical variatles Conduct Chi-squared test to assess the association between two categorical variables Conduct Chi-square test for trend when the exposure is ordered categorical

Example Random sample of individuals were questioned about their occupatim and their BP was measured. Based on SBP and DBP measures they were classified as hypertensive or non­hypertensive. Among 300 people in non-manual jobs, there were 72 hypertensive individuals. Among 240 people in manual jobs, there were 96 hypertensive indwiduals.

Constructing 2- way table
As a first step we need to organize our data in a formal way ­we construct 2-way table Hypertension Total No Yes 240 144 96 Manual 300 228 72 Non- manual 540 372 168

What does it mean when we speak about an association between two categorical variables?

What does it mean when we speak about an association between two categorical variables?
It means that knowing the value of one variable tells us sanething about the value of the other variable. Two variables are therefore saidto be associated if the distribution of one variable varies according to the value of the other varialde.

What does it mean when we speak about an association between two categorical variables?
In our example, the two variables, occupation and hypertension, are associated if the distribution of hypertension varies between occupatioml groups. And, if distribution of hypertension is same in both occupational groups, we can say that there is no association between hypertension and occupational category - because knowing a occupational category of indi\idual wil! not tell us anything about hypertension.

What does it mean when we speak about an association between two categorical variables?
• Having constructed a two-way table, the next step is to look whether the distribution of one variable differs according to the value of the other variable. We need to calculate either row or column percentages. Often, one variable can be regarded as the response variable, while the other is the explanatory variable, and this should help to decide what percentages are shown • If the columns represent the explanatory variable, then column percentages are more appropriate, and vice versa.

Constructing 2-way table
As a second step we calculate proportion of hypertensive individuals among manual workers. non-manual workers and in the whole sample Hypertension Total No Yes 240 144(60.0%) 96(40.0%) Manual 300 228(76.0%) 72(24.0%) Non-manual 450 372(68.9%) 168(31.1%)

The numbers in the four categories in the 2­way table in the previous slide all called
OBSERVED NUMBERS I

The data seem to suggest sane association between hypertension and occLPation (40% of manual workers with hypertension compared to 24% of non-manual workers with hypertension) The calculation and examination of such percentages is an essential step in the analysis of a two-way table, and should always be done before starting formal significance tests.

Significance test the association
Although it seemsthat there is an association in the table, the question is whether this may be attributable to samping variability Each of the percentages i1 the table is subject to sampling error, and we need to assess whether the differences between them may be due to chance This is done by conducting a sig1ificance test The null hypothesis is "there is no association between the two variables'

Expected numbers The significance test is Chi-squared test This test canpares the observed numbers in each of four categories of contingency table with the numbers to be expected if there was no difference in proportion of hypertensive indviduals in two occupational groups

Hypertension Total No Yes 240 74.64 Manual 300 Non-manual 450 372(68.9%) 168(31.1%) From the table above, the overall proportion of hypertensive individuals is 168/372 (31.1 %). If the null hypothesis were true, the expected number of manual subjects with hypertension is 31.1 % of 240, which is 74.64

Expected numbers in the other cells of the table can be calculated similarly, using the general formula: Row×Culumn total Expected number= Overall total Hypertension Total No Yes 240 74.64 Manual 300 Non-manual 450 372(68.9%) 168(31.1%)

Next step – compare observed and expected numbers
Hypertension OBSERVED Total No Yes 240 144 96 Manual 300 228 72 Non-manual 450 372(68.9%) 168(31.1%) Hypertension EXPECTED Total No Yes 240 165.36 74.64 Manual 300 206.64 93.36 Non-manual 450 372(68.9%) 168(31.1%)

Chi-squared test (X2test)
Calculate (O-E)/E for each cell and sum over all cells In our example X2 = [( ) 2/ ( )2 / ( )2 / ( )2 / ] = 15.97

If X2 value is large then (O-E) is, in general, large and data do not support H0 = association
If X2 value is small then (O-E) is, in general, Small and data do support H0= no association Large values of X2 suggest that the data are inconsistent with the null hypothesis, and therefore that there is an associaion between the two variables.

Obtaining p-value Under Ho: X2 distribution

Obtaining p-value The P-value is obtained by referring the calculated value of X2 to tables of the en-squared distribution . The P-value in this case corresp:mds to the value shown as a in the tables . The degrees of freedom are given by the formula: d.f=(r-1)x(c-1) R= number of rows , c=number of columns

Back to our example: X2=15.97 Table 2x2 d.f.=1 and from the table P<O.001

Relation with normal test for the difference between 2 proportions:
Pl-P2= 40%-24% = 16% =4.01 Z=(P1-P2)/SE=16/4.01=3.99 For 2 × 2 table, X2 = Z2, i.e = 3.992 Pvalues for both tests are identical

There is a quick formula calculating chi-square test for 2×2 table
If we use following notation: Quicker formula for calculating chi-squared test is Total e b a f d c N h g Quicker formula for calculating chi- squared test is d.f=1

Continuity correction
The chi-squared test for 2×2 table can be improved by using a continuity correction, called Yates' continuity correction When we compare two p-oportions using the Z test (or the X2 test), we are making the assumption that the sampling distributions of each of the proportions, and of the difference between them, are approximately normal.

Continuity correction - cont.
In fact, the distribution et a proportion is not exactly normal, because it is a dscrete variable, whereas the normal distribution elates to continuous variables.

Continuity correction
For large enough samples, the binomial distribution approximates very closely tothe normal distribution This is why we use of the Z and x2 tests for the comparison of proportions However, the approximation can be improved by making a continuity correction, which adjusts the value of z to allow for the fact thatwe are modelling a discrete variable with a continuous distribution

Yates’c ontinuity correction - formula
The effect of the continuity correction is to reduce the value of x2; with smaller samples, the effect is more substantial - which is exactly what we want given that the normal distribution is better approximation of binomial distribution when sample is larger

Exact test for 2×2 tables The exact test to compare 2 proportions is needed when the numbers in the 2 × 2 table are very small (the consensus: if the total number of individuals is less than 40 and one of the cells (or more) is lesslhan 5) It is based on calcLJating exact probabilities of the observed table and of more 'extreme' tables with the same row and column totas Formula and more details - Kirkwood and Sterne, p

Larger tables (r x c tables)
No continuity correction Valid if less than 20% of expected numbers are under 5 and none is less than 1 If low expected numbers- combine either rows or columns to overcome this problem

How to calculate expected number in particular cell
Row total Column total Expected number= Overall total

Ordered exposures 1X2 test for trend
A special case = there are two rows but more than two columns (or it can be two columns araJ more than two rows) AND an ordered categorical variable is involved (as the variable with more than 2 categories)

Education M1 Total University Secondary Primary 132 20(22.2%) 64(32%) 48(40%) Yes 278 70 136 72 No 410 90 200 120 3 2 1 Score for trend test

Example In this 2 x 3 table, the aim is to examine if there is any association between educaton and MI. MI is the outcome of interest Education is the exposure of nterest We want to compare the proportions of subjects with MI in the three education groups.

Standard chi-squared test
X2=7.45 Table 2×3-d.f=2 P=0.024 So there is statistical evidence of an association between the two variables. However, when the column variable is an ordered categorical variable, a more sensitive procedure for detecting an association is to use a x2 test-for-trend

X2 test – for trend The null hypothesis is the same as for an ordinary 2 × c X2 test, that there is no association between the two variables. But the test-for-trend is particularly sensitive to an upward or downward trend in proportions across the columns of the table The test is conducted by assigning the numerical scores 1, 2, to the columns of the table. We then calculate the mean scores X1 and X2 observed in those with MI and in those without MI.

• Mean score among those withoutMI
X1=[48*1+64*2+20*3]/132=236/132=1.79 Mean score among those without MI X2= [72* * 2+ 70* 3]/278= 554/278= 1.99 If there is no difference between the three proportions, two mean scores will be equal. If there is an upward (or downward) trend in the proportions, the mean score in the MJ group will be greater (or less) than that in the no MI group.

We now need to calculate stancard deviation s.
Let's assume that we have a sampe of 41 0 points made up of 120 ones, 200 twos and 90 threes. From this sample we find the standard deviation of the whole sample in the usual way. S= Mean is overall mean of ones, twos and threes

Example mean = (120 * *2 +90 *3)/410= 1.93 And S= S=0.71

Chi-squared test for trend
obtained by comparing the mean scores for the 2 groups: d.f.=1 p = 0.006 Where n, and n2 are total numbers of individuals with and without MI You can see that square root of the formula is formula for Z test for comparing means of 2 samples Z is square root of X2

You can see that p value is smaller than with ordinary chi-squared test
There is even stronger evidence of an association between MI and education The alternative formula for chi-squared test for trend is described by Kirkwood cnd Sterne, p.173­176

Chi- squared tests in STATA
Let's use data from practical 10 (last session) We try to evaluate whether there is an association between current smoking and age We have age grouped into 4 groups (30-39, 40-49, 50-59, 60­69) Smoking (variable smok) was coded 1 =current smokers, 2=non-smokers Firstly, we recode smoking Recode smok 2=0 (we need the outcome to be coded as 0.1)

Let’s check proportion of smokers in each age category
Tab smok agegroup col 30-39,40-49,50-59,60-69 Total 60 50 40 30 1=Yes 0=no 1.675 65.69 491 78.81 490 72.38 357 56.31 337 54.71 875 34.31 132 21.19 187 27.62 277 43.69 279 45.29 1 2.550 100.00 623 677 634 616

Degrees of freedom chi-squared test value P<0.001
Now , standard chi- squared test Tab smok agegroup , col chi 30-39,40-49,50-59,60-69 Total 60 50 40 30 1=Yes 0=no 1.675 65.69 491 78.81 490 72.38 357 56.31 337 54.71 875 34.31 132 21.19 187 27.62 277 43.69 279 45.29 1 2.550 100.00 623 677 634 616 Pearson chi2(3)= pr=0.000 Degrees of freedom chi-squared test value P<0.001

Chi – squared test for trend
COMMAND OUTCOME EXPOSURE Tabodds smok ageroup interval 95% conf odds controls cases agegroup 337 357 490 491 279 277 187 132 30 40 50 60 Test of homogeneity (equal odds): chi2(3)=118.70 pr>chi2=0.0000 Score test for trend of odds: chi2(1)=108.83 This is part that is interesting at themomenti

What have we learnt in the session?
Construct 2-way table to examine association between two categorical variatles Conduct Chi-squared test to assess the association between two categorical variables Conduct Chi-squared test for trend for ordered exposures