Presentation on theme: "The association between two categorical variables"— Presentation transcript:
1 The association between two categorical variables Categorical data IIThe association between two categorical variablesدکتر سید ابراهیم جباری فر تاریخ : 1388 / 2010دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر
2 By the end of the session you should be able to Construct 2-way table to examine association between two categorical variatlesConduct Chi-squared test to assess the association between two categorical variablesConduct Chi-square test for trend when the exposure is ordered categorical
3 ExampleRandom sample of individuals were questioned about their occupatim and their BP was measured. Based on SBP and DBP measures they were classified as hypertensive or nonhypertensive. Among 300 people in non-manual jobs, there were 72 hypertensive individuals. Among 240 people in manual jobs, there were 96 hypertensive indwiduals.
4 Constructing 2- way table As a first step we need to organize our data in a formal way we construct 2-way tableHypertensionTotalNoYes24014496Manual30022872Non- manual540372168
5 What does it mean when we speak about an association between two categorical variables?
6 What does it mean when we speak about an association between two categorical variables? It means that knowing the value of one variable tells us sanething about the value of the other variable.Two variables are therefore saidto be associated if the distribution of one variable varies according to the value of the other varialde.
7 What does it mean when we speak about an association between two categorical variables? In our example, the two variables, occupation and hypertension, are associated if the distribution of hypertension varies between occupatioml groups.And, if distribution of hypertension is same in both occupational groups, we can say that there is no association between hypertension and occupational category - because knowing a occupational category of indi\idual wil! not tell us anything about hypertension.
8 What does it mean when we speak about an association between two categorical variables? • Having constructed a two-way table, the next step is to look whether the distribution of one variable differs according to the value of the other variable.We need to calculate either row or column percentages.Often, one variable can be regarded as the response variable, while the other is the explanatory variable, and this should help to decide what percentages are shown• If the columns represent the explanatory variable, then column percentages are more appropriate, and vice versa.
9 Constructing 2-way table As a second step we calculate proportion of hypertensive individuals among manual workers. non-manual workers and in the whole sampleHypertensionTotalNoYes240144(60.0%)96(40.0%)Manual300228(76.0%)72(24.0%)Non-manual450372(68.9%)168(31.1%)
10 The numbers in the four categories in the 2way table in the previous slide all called OBSERVED NUMBERS I
11 The data seem to suggest sane association between hypertension and occLPation (40% of manual workers with hypertension compared to 24% of non-manual workers with hypertension)The calculation and examination of such percentages is an essential step in the analysis of a two-way table, and should always be done before starting formal significance tests.
12 Significance test the association Although it seemsthat there is an association in the table, the question is whether this may be attributable to samping variabilityEach of the percentages i1 the table is subject to sampling error, and we need to assess whether the differences between them may be due to chanceThis is done by conducting a sig1ificance testThe null hypothesis is "there is no association between the two variables'
13 Expected numbersThe significance test is Chi-squared testThis test canpares the observed numbers in each of four categories of contingency table with the numbers to be expected if there was no difference in proportion of hypertensive indviduals in two occupational groups
14 HypertensionTotalNoYes24074.64Manual300Non-manual450372(68.9%)168(31.1%)From the table above, the overall proportion of hypertensive individuals is 168/372 (31.1 %).If the null hypothesis were true, the expected number of manual subjects with hypertension is 31.1 % of 240, which is 74.64
15 Expected numbers in the other cells of the table can be calculated similarly, using the general formula:Row×Culumn totalExpected number=Overall totalHypertensionTotalNoYes24074.64Manual300Non-manual450372(68.9%)168(31.1%)
16 Next step – compare observed and expected numbers HypertensionOBSERVEDTotalNoYes24014496Manual30022872Non-manual450372(68.9%)168(31.1%)HypertensionEXPECTEDTotalNoYes240165.3674.64Manual300206.6493.36Non-manual450372(68.9%)168(31.1%)
17 Chi-squared test (X2test) Calculate (O-E)/E for each cell and sum over all cellsIn our exampleX2 = [( ) 2/ ( )2 / ( )2 / ( )2 / ] = 15.97
18 If X2 value is large then (O-E) is, in general, large and data do not support H0 = association If X2 value is small then (O-E) is, in general, Small and data do supportH0= no associationLarge values of X2 suggest that the data are inconsistent with the null hypothesis, and therefore that there is an associaion between the two variables.
20 Obtaining p-valueThe P-value is obtained by referring the calculated value of X2 to tables of the en-squareddistribution .The P-value in this case corresp:mds to the value shown as a in the tables .The degrees of freedom are given by the formula:d.f=(r-1)x(c-1)R= number of rows , c=number of columns
21 Back to our example:X2=15.97Table 2x2d.f.=1and from the tableP<O.001
22 Relation with normal test for the difference between 2 proportions: Pl-P2= 40%-24% = 16%=4.01Z=(P1-P2)/SE=16/4.01=3.99For 2 × 2 table, X2 = Z2, i.e = 3.992Pvalues for both tests are identical
23 There is a quick formula calculating chi-square test for 2×2 table If we use following notation:Quicker formula for calculating chi-squared test isTotalebafdcNhgQuicker formula for calculating chi- squared test isd.f=1
24 Continuity correction The chi-squared test for 2×2 table can be improved by using a continuity correction, called Yates' continuity correctionWhen we compare two p-oportions using the Z test (or the X2 test), we are making the assumption that the sampling distributions of each of the proportions, and of the difference between them, are approximately normal.
25 Continuity correction - cont. In fact, the distribution et a proportion is not exactly normal, because it is a dscrete variable, whereas the normal distribution elates to continuous variables.
26 Continuity correction For large enough samples, the binomial distribution approximates very closely tothe normal distributionThis is why we use of the Z and x2 tests for the comparison of proportionsHowever, the approximation can be improved by making a continuity correction, which adjusts the value of z to allow for the fact thatwe are modelling a discrete variable with a continuous distribution
27 Yates’c ontinuity correction - formula The effect of the continuity correction is to reduce the value of x2; with smaller samples, the effect is more substantial - which is exactly what we want given that the normal distribution is better approximation of binomial distribution when sample is larger
28 Exact test for 2×2 tablesThe exact test to compare 2 proportions is needed when the numbers in the 2 × 2 table are very small (the consensus: if the total number of individualsis less than 40 and one of the cells (or more) is lesslhan 5)It is based on calcLJating exact probabilities of the observed table and of more 'extreme' tables with the same row and column totasFormula and more details - Kirkwood and Sterne, p
29 Larger tables (r x c tables) No continuity correctionValid if less than 20% of expected numbers are under 5 and none is less than 1If low expected numbers- combine either rows or columns to overcome this problem
30 How to calculate expected number in particular cell Row total Column totalExpected number=Overall total
31 Ordered exposures 1X2 test for trend A special case = there are two rows but more than two columns (or it can be two columns araJ more than two rows)AND an ordered categorical variable is involved (as the variable with more than 2 categories)
32 EducationM1TotalUniversitySecondaryPrimary13220(22.2%)64(32%)48(40%)Yes2787013672No41090200120321Score for trend test
33 ExampleIn this 2 x 3 table, the aim is to examine if there is any association between educaton and MI.MI is the outcome of interestEducation is the exposure of nterestWe want to compare the proportions of subjects with MI in the three education groups.
34 Standard chi-squared test X2=7.45Table 2×3-d.f=2P=0.024So there is statistical evidence of an association between the two variables.However, when the column variable is an ordered categorical variable, a moresensitive procedure for detecting an association is to use a x2 test-for-trend
35 X2 test – for trendThe null hypothesis is the same as for an ordinary 2 × c X2 test, that there is no association between the two variables. But the test-for-trend is particularly sensitive to an upward or downward trend in proportions across the columns of the tableThe test is conducted by assigning the numerical scores 1, 2, to the columns of the table. We then calculate the mean scores X1 and X2 observed in those with MI and in those without MI.
36 • Mean score among those withoutMI X1=[48*1+64*2+20*3]/132=236/132=1.79Mean score among those without MIX2= [72* * 2+ 70* 3]/278= 554/278= 1.99If there is no difference between the three proportions, two mean scores will be equal.If there is an upward (or downward) trend in the proportions, the mean score in the MJ group will be greater (or less) than that in the no MI group.
37 We now need to calculate stancard deviation s. Let's assume that we have a sampe of 41 0 points made up of 120 ones, 200 twos and 90 threes. From this sample we find the standard deviation of the whole sample in the usual way.S=Mean is overall mean of ones, twos and threes
39 Chi-squared test for trend obtained by comparing the mean scores for the 2 groups:d.f.=1p = 0.006Where n, and n2 are total numbers of individuals with and without MIYou can see that square root of the formula is formula for Z test for comparing means of 2 samplesZ is square root of X2
40 You can see that p value is smaller than with ordinary chi-squared test There is even stronger evidence of an association between MI and educationThe alternative formula for chi-squared test for trend is described by Kirkwood cnd Sterne, p.173176
41 Chi- squared tests in STATA Let's use data from practical 10 (last session)We try to evaluate whether there is an association between current smoking and ageWe have age grouped into 4 groups (30-39, 40-49, 50-59, 6069)Smoking (variable smok) was coded 1 =current smokers, 2=non-smokersFirstly, we recode smokingRecode smok 2=0 (we need the outcome to be coded as 0.1)
42 Let’s check proportion of smokers in each age category Tab smok agegroup col30-39,40-49,50-59,60-69Total605040301=Yes 0=no1.67565.6949178.8149072.3835756.3133754.7187534.3113221.1918727.6227743.6927945.2912.550100.00623677634616
43 Degrees of freedom chi-squared test value P<0.001 Now , standard chi- squared testTab smok agegroup , col chi30-39,40-49,50-59,60-69Total605040301=Yes 0=no1.67565.6949178.8149072.3835756.3133754.7187534.3113221.1918727.6227743.6927945.2912.550100.00623677634616Pearson chi2(3)= pr=0.000Degrees of freedom chi-squared test value P<0.001
44 Chi – squared test for trend COMMAND OUTCOME EXPOSURETabodds smok ageroupinterval95% confoddscontrolscasesagegroup33735749049127927718713230405060Test of homogeneity (equal odds): chi2(3)=118.70pr>chi2=0.0000Score test for trend of odds: chi2(1)=108.83This is part that is interesting at themomenti
45 What have we learnt in the session? Construct 2-way table to examine association between two categorical variatlesConduct Chi-squared test to assess the association between two categorical variablesConduct Chi-squared test for trend for ordered exposures