Presentation on theme: "Categorical data II The association between two categorical variables دکتر سید ابراهیم جباری فر تاریخ : 1388 / 2010 دانشیار دانشگاه علوم پزشکی اصفهان بخش."— Presentation transcript:
Categorical data II The association between two categorical variables دکتر سید ابراهیم جباری فر تاریخ : 1388 / 2010 دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر
By the end of the session you should be able to Construct 2-way table to examine association between two categorical variatles Conduct Chi-squared test to assess the association between two categorical variables Conduct Chi-square test for trend when the exposure is ordered categorical
Example Random sample of individuals were questioned about their occupatim and their BP was measured. Based on SBP and DBP measures they were classified as hypertensive or non hypertensive. Among 300 people in non-manual jobs, there were 72 hypertensive individuals. Among 240 people in manual jobs, there were 96 hypertensive indwiduals.
Constructing 2- way table As a first step we need to organize our data in a formal way we construct 2-way table Hypertension TotalNoYes 24014496Manual 30022872Non- manual 540372168Total
What does it mean when we speak about an association between two categorical variables?
It means that knowing the value of one variable tells us sanething about the value of the other variable. Two variables are therefore saidto be associated if the distribution of one variable varies according to the value of the other varialde.
What does it mean when we speak about an association between two categorical variables? In our example, the two variables, occupation and hypertension, are associated if the distribution of hypertension varies between occupatioml groups. And, if distribution of hypertension is same in both occupational groups, we can say that there is no association between hypertension and occupational category - because knowing a occupational category of indi\idual wil! not tell us anything about hypertension.
What does it mean when we speak about an association between two categorical variables? Having constructed a two-way table, the next step is to look whether the distribution of one variable differs according to the value of the other variable. We need to calculate either row or column percentages. Often, one variable can be regarded as the response variable, while the other is the explanatory variable, and this should help to decide what percentages are shown If the columns represent the explanatory variable, then column percentages are more appropriate, and vice versa.
Constructing 2-way table As a second step we calculate proportion of hypertensive individuals among manual workers. non-manual workers and in the whole sample Hypertension TotalNoYes 240144(60.0%)96(40.0%)Manual 300228(76.0%)72(24.0%)Non-manual 450372(68.9%)168(31.1%)Total
The numbers in the four categories in the 2way table in the previous slide all called OBSERVED NUMBERS I
The data seem to suggest sane association between hypertension and occLPation (40% of manual workers with hypertension compared to 24% of non-manual workers with hypertension) The calculation and examination of such percentages is an essential step in the analysis of a two-way table, and should always be done before starting formal significance tests.
Significance test the association Although it seemsthat there is an association in the table, the question is whether this may be attributable to samping variability Each of the percentages i1 the table is subject to sampling error, and we need to assess whether the differences between them may be due to chance This is done by conducting a sig1ificance test The null hypothesis is "there is no association between the two variables'
Expected numbers The significance test is Chi-squared test This test canpares the observed numbers in each of four categories of contingency table with the numbers to be expected if there was no difference in proportion of hypertensive indviduals in two occupational groups
Hypertension TotalNoYes 24074.64Manual 300Non-manual 450372(68.9%)168(31.1%)Total From the table above, the overall proportion of hypertensive individuals is 168/372 (31.1 %). If the null hypothesis were true, the expected number of manual subjects with hypertension is 31.1 % of 240, which is 74.64
Hypertension TotalNoYes 24074.64Manual 300Non-manual 450372(68.9%)168(31.1%)Total Expected numbers in the other cells of the table can be calculated similarly, using the general formula: Row×Culumn total Expected number= Overall total
HypertensionOBSERVED TotalNoYes 24014496Manual 30022872Non-manual 450372(68.9%)168(31.1%)Total Next step – compare observed and expected numbers HypertensionEXPECTED TotalNoYes 240165.3674.64Manual 300206.6493.36Non-manual 450372(68.9%)168(31.1%)Total
Chi-squared test (X2test) Calculate (O-E)/E for each cell and sum over all cells In our example X 2 = [(96-74.64) 2 /74.64 + (144-165.36) 2 / 165.36 + (72- 93.36) 2 / 93.36 + (228-206.64) 2 / 206.64] = 15.97
If X 2 value is large then (O-E) is, in general, large and data do not support H 0 = association If X 2 value is small then (O-E) is, in general, Small and data do support H 0 = no association Large values of X 2 suggest that the data are inconsistent with the null hypothesis, and therefore that there is an associaion between the two variables.
Obtaining p-value The P-value is obtained by referring the calculated value of X 2 to tables of the en-squared distribution. The P-value in this case corresp:mds to the value shown as a in the tables. The degrees of freedom are given by the formula: d.f=(r-1)x(c-1) R= number of rows, c=number of columns
Back to our example: X 2 =15.97 Table 2x2 d.f.=1 and from the table P
"name": "Back to our example: X 2 =15.97 Table 2x2 d.f.=1 and from the table P
Relation with normal test for the difference between 2 proportions: P l -P 2 = 40%-24% = 16% =4.01 Z=(P 1 -P 2 )/SE=16/4.01=3.99 For 2 × 2 table, X 2 = Z 2, i.e. 15.97 = 3.99 2 Pvalues for both tests are identical
Quick formula There is a quick formula calculating chi-square test for 2×2 table If we use following notation: Quicker formula for calculating chi-squared test is Total eba fdc Nhg Quicker formula for calculating chi- squared test is d.f=1
Continuity correction The chi-squared test for 2×2 table can be improved by using a continuity correction, called Yates' continuity correction When we compare two p-oportions using the Z test (or the X 2 test), we are making the assumption that the sampling distributions of each of the proportions, and of the difference between them, are approximately normal.
Continuity correction - cont. In fact, the distribution et a proportion is not exactly normal, because it is a dscrete variable, whereas the normal distribution elates to continuous variables.
Continuity correction For large enough samples, the binomial distribution approximates very closely tothe normal distribution This is why we use of the Z and x 2 tests for the comparison of proportions However, the approximation can be improved by making a continuity correction, which adjusts the value of z to allow for the fact thatwe are modelling a discrete variable with a continuous distribution
Yatesc ontinuity correction - formula The effect of the continuity correction is to reduce the value of x 2 ; with smaller samples, the effect is more substantial - which is exactly what we want given that the normal distribution is better approximation of binomial distribution when sample is larger
Exact test for 2×2 tables The exact test to compare 2 proportions is needed when the numbers in the 2 × 2 table are very small (the consensus: if the total number of individuals is less than 40 and one of the cells (or more) is lesslhan 5) It is based on calcLJating exact probabilities of the observed table and of more 'extreme' tables with the same row and column totas Formula and more details - Kirkwood and Sterne, p.169-171
Larger tables (r x c tables) No continuity correction Valid if less than 20% of expected numbers are under 5 and none is less than 1 If low expected numbers- combine either rows or columns to overcome this problem
How to calculate expected number in particular cell Row total Column total Expected number= Overall total
Ordered exposures 1X 2 test for trend A special case = there are two rows but more than two columns (or it can be two columns araJ more than two rows) AND an ordered categorical variable is involved (as the variable with more than 2 categories)
EducationM1 TotalUniversitySecondaryPrimary 13220(22.2%)64(32%)48(40%)Yes 2787013672No 41090200120Total 321Score for trend test
Example In this 2 x 3 table, the aim is to examine if there is any association between educaton and MI. MI is the outcome of interest Education is the exposure of nterest We want to compare the proportions of subjects with MI in the three education groups.
Standard chi-squared test X2=7.45 Table 2×3-d.f=2 P=0.024 So there is statistical evidence of an association between the two variables. However, when the column variable is an ordered categorical variable, a more sensitive procedure for detecting an association is to use a x 2 test- for-trend
X 2 test – for trend The null hypothesis is the same as for an ordinary 2 × c X 2 test, that there is no association between the two variables. But the test-for-trend is particularly sensitive to an upward or downward trend in proportions across the columns of the table The test is conducted by assigning the numerical scores 1, 2, 3... to the columns of the table. We then calculate the mean scores X 1 and X 2 observed in those with MI and in those without MI.
Mean score among those withoutMI X 1 =[48*1+64*2+20*3]/132=236/132=1.79 Mean score among those without MI X 2 = [72*1 + 136* 2+ 70* 3]/278= 554/278= 1.99 If there is no difference between the three proportions, two mean scores will be equal. If there is an upward (or downward) trend in the proportions, the mean score in the MJ group will be greater (or less) than that in the no MI group.
We now need to calculate stancard deviation s. Let's assume that we have a sampe of 41 0 points made up of 120 ones, 200 twos and 90 threes. From this sample we find the standard deviation of the whole sample in the usual way. S= Mean is overall mean of ones, twos and threes
Example mean = (120 *1 + 200 *2 +90 *3)/410= 1.93 And S= S=0.71
Chi-squared test for trend obtained by comparing the mean scores for the 2 groups: d.f.=1 p = 0.006 Where n, and n 2 are total numbers of individuals with and without MI You can see that square root of the formula is formula for Z test for comparing means of 2 samples Z is square root of X 2
You can see that p value is smaller than with ordinary chi- squared test There is even stronger evidence of an association between MI and education The alternative formula for chi-squared test for trend is described by Kirkwood cnd Sterne, p.173176
Chi- squared tests in STATA Let's use data from practical 10 (last session) We try to evaluate whether there is an association between current smoking and age We have age grouped into 4 groups (30-39, 40-49, 50-59, 6069) Smoking (variable smok) was coded 1 =current smokers, 2=non- smokers Firstly, we recode smoking Recode smok 2=0 (we need the outcome to be coded as 0.1)
Lets check proportion of smokers in each age category Tab smok agegroup col 30-39,40-49,50-59,60-69 Total605040301=Yes 0=no 1.675 65.69 491 78.81 490 72.38 357 56.31 337 54.71 0 875 34.31 132 21.19 187 27.62 277 43.69 279 45.29 1 2.550 100.00 623 100.00 677 100.00 634 100.00 616 100.00 Total
Now, standard chi- squared test Tab smok agegroup, col chi 30-39,40-49,50-59,60-69 Total605040301=Yes 0=no 1.675 65.69 491 78.81 490 72.38 357 56.31 337 54.71 0 875 34.31 132 21.19 187 27.62 277 43.69 279 45.29 1 2.550 100.00 623 100.00 677 100.00 634 100.00 616 100.00 Total Pearson chi2(3)=118.7458 pr=0.000 Degrees of freedom chi-squared test value P<0.001
Chi – squared test for trend COMMAND OUTCOME EXPOSURE Tabodds smok ageroup interval95% confoddscontrolscasesagegroup 0.97022 0.90775 0.45166 0.32580 0.70644 0.66322 0.32246 0.22184 0.82789 0.77591 0.38163 0.26884 337 357 490 491 279 277 187 132 30 40 50 60 Test of homogeneity (equal odds): chi2(3)=118.70 pr>chi2=0.0000 Score test for trend of odds: chi2(1)=108.83 pr>chi2=0.0000 This is part that is interesting at themomenti
What have we learnt in the session? 1.Construct 2-way table to examine association between two categorical variatles 2.Conduct Chi-squared test to assess the association between two categorical variables 3.Conduct Chi-squared test for trend for ordered exposures