Presentation on theme: "Categorical data II The association between two categorical variables دکتر سید ابراهیم جباری فر تاریخ : 1388 / 2010 دانشیار دانشگاه علوم پزشکی اصفهان بخش."— Presentation transcript:
Categorical data II The association between two categorical variables دکتر سید ابراهیم جباری فر تاریخ : 1388 / 2010 دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر
By the end of the session you should be able to Construct 2-way table to examine association between two categorical variatles Conduct Chi-squared test to assess the association between two categorical variables Conduct Chi-square test for trend when the exposure is ordered categorical
Example Random sample of individuals were questioned about their occupatim and their BP was measured. Based on SBP and DBP measures they were classified as hypertensive or non hypertensive. Among 300 people in non-manual jobs, there were 72 hypertensive individuals. Among 240 people in manual jobs, there were 96 hypertensive indwiduals.
Constructing 2- way table As a first step we need to organize our data in a formal way we construct 2-way table Hypertension TotalNoYes Manual Non- manual Total
What does it mean when we speak about an association between two categorical variables?
It means that knowing the value of one variable tells us sanething about the value of the other variable. Two variables are therefore saidto be associated if the distribution of one variable varies according to the value of the other varialde.
What does it mean when we speak about an association between two categorical variables? In our example, the two variables, occupation and hypertension, are associated if the distribution of hypertension varies between occupatioml groups. And, if distribution of hypertension is same in both occupational groups, we can say that there is no association between hypertension and occupational category - because knowing a occupational category of indi\idual wil! not tell us anything about hypertension.
What does it mean when we speak about an association between two categorical variables? Having constructed a two-way table, the next step is to look whether the distribution of one variable differs according to the value of the other variable. We need to calculate either row or column percentages. Often, one variable can be regarded as the response variable, while the other is the explanatory variable, and this should help to decide what percentages are shown If the columns represent the explanatory variable, then column percentages are more appropriate, and vice versa.
Constructing 2-way table As a second step we calculate proportion of hypertensive individuals among manual workers. non-manual workers and in the whole sample Hypertension TotalNoYes (60.0%)96(40.0%)Manual (76.0%)72(24.0%)Non-manual (68.9%)168(31.1%)Total
The numbers in the four categories in the 2way table in the previous slide all called OBSERVED NUMBERS I
The data seem to suggest sane association between hypertension and occLPation (40% of manual workers with hypertension compared to 24% of non-manual workers with hypertension) The calculation and examination of such percentages is an essential step in the analysis of a two-way table, and should always be done before starting formal significance tests.
Significance test the association Although it seemsthat there is an association in the table, the question is whether this may be attributable to samping variability Each of the percentages i1 the table is subject to sampling error, and we need to assess whether the differences between them may be due to chance This is done by conducting a sig1ificance test The null hypothesis is "there is no association between the two variables'
Expected numbers The significance test is Chi-squared test This test canpares the observed numbers in each of four categories of contingency table with the numbers to be expected if there was no difference in proportion of hypertensive indviduals in two occupational groups
Hypertension TotalNoYes Manual 300Non-manual (68.9%)168(31.1%)Total From the table above, the overall proportion of hypertensive individuals is 168/372 (31.1 %). If the null hypothesis were true, the expected number of manual subjects with hypertension is 31.1 % of 240, which is 74.64
Hypertension TotalNoYes Manual 300Non-manual (68.9%)168(31.1%)Total Expected numbers in the other cells of the table can be calculated similarly, using the general formula: Row×Culumn total Expected number= Overall total
HypertensionOBSERVED TotalNoYes Manual Non-manual (68.9%)168(31.1%)Total Next step – compare observed and expected numbers HypertensionEXPECTED TotalNoYes Manual Non-manual (68.9%)168(31.1%)Total
Chi-squared test (X2test) Calculate (O-E)/E for each cell and sum over all cells In our example X 2 = [( ) 2 / ( ) 2 / ( ) 2 / ( ) 2 / ] = 15.97
If X 2 value is large then (O-E) is, in general, large and data do not support H 0 = association If X 2 value is small then (O-E) is, in general, Small and data do support H 0 = no association Large values of X 2 suggest that the data are inconsistent with the null hypothesis, and therefore that there is an associaion between the two variables.
Obtaining p-value Under H o : X 2 distribution
Obtaining p-value The P-value is obtained by referring the calculated value of X 2 to tables of the en-squared distribution. The P-value in this case corresp:mds to the value shown as a in the tables. The degrees of freedom are given by the formula: d.f=(r-1)x(c-1) R= number of rows, c=number of columns
Back to our example: X 2 =15.97 Table 2x2 d.f.=1 and from the table P
Relation with normal test for the difference between 2 proportions: P l -P 2 = 40%-24% = 16% =4.01 Z=(P 1 -P 2 )/SE=16/4.01=3.99 For 2 × 2 table, X 2 = Z 2, i.e = Pvalues for both tests are identical
Quick formula There is a quick formula calculating chi-square test for 2×2 table If we use following notation: Quicker formula for calculating chi-squared test is Total eba fdc Nhg Quicker formula for calculating chi- squared test is d.f=1
Continuity correction The chi-squared test for 2×2 table can be improved by using a continuity correction, called Yates' continuity correction When we compare two p-oportions using the Z test (or the X 2 test), we are making the assumption that the sampling distributions of each of the proportions, and of the difference between them, are approximately normal.
Continuity correction - cont. In fact, the distribution et a proportion is not exactly normal, because it is a dscrete variable, whereas the normal distribution elates to continuous variables.
Continuity correction For large enough samples, the binomial distribution approximates very closely tothe normal distribution This is why we use of the Z and x 2 tests for the comparison of proportions However, the approximation can be improved by making a continuity correction, which adjusts the value of z to allow for the fact thatwe are modelling a discrete variable with a continuous distribution
Yatesc ontinuity correction - formula The effect of the continuity correction is to reduce the value of x 2 ; with smaller samples, the effect is more substantial - which is exactly what we want given that the normal distribution is better approximation of binomial distribution when sample is larger
Exact test for 2×2 tables The exact test to compare 2 proportions is needed when the numbers in the 2 × 2 table are very small (the consensus: if the total number of individuals is less than 40 and one of the cells (or more) is lesslhan 5) It is based on calcLJating exact probabilities of the observed table and of more 'extreme' tables with the same row and column totas Formula and more details - Kirkwood and Sterne, p
Larger tables (r x c tables) No continuity correction Valid if less than 20% of expected numbers are under 5 and none is less than 1 If low expected numbers- combine either rows or columns to overcome this problem
How to calculate expected number in particular cell Row total Column total Expected number= Overall total
Ordered exposures 1X 2 test for trend A special case = there are two rows but more than two columns (or it can be two columns araJ more than two rows) AND an ordered categorical variable is involved (as the variable with more than 2 categories)
EducationM1 TotalUniversitySecondaryPrimary 13220(22.2%)64(32%)48(40%)Yes No Total 321Score for trend test
Example In this 2 x 3 table, the aim is to examine if there is any association between educaton and MI. MI is the outcome of interest Education is the exposure of nterest We want to compare the proportions of subjects with MI in the three education groups.
Standard chi-squared test X2=7.45 Table 2×3-d.f=2 P=0.024 So there is statistical evidence of an association between the two variables. However, when the column variable is an ordered categorical variable, a more sensitive procedure for detecting an association is to use a x 2 test- for-trend
X 2 test – for trend The null hypothesis is the same as for an ordinary 2 × c X 2 test, that there is no association between the two variables. But the test-for-trend is particularly sensitive to an upward or downward trend in proportions across the columns of the table The test is conducted by assigning the numerical scores 1, 2, 3... to the columns of the table. We then calculate the mean scores X 1 and X 2 observed in those with MI and in those without MI.
Mean score among those withoutMI X 1 =[48*1+64*2+20*3]/132=236/132=1.79 Mean score among those without MI X 2 = [72* * 2+ 70* 3]/278= 554/278= 1.99 If there is no difference between the three proportions, two mean scores will be equal. If there is an upward (or downward) trend in the proportions, the mean score in the MJ group will be greater (or less) than that in the no MI group.
We now need to calculate stancard deviation s. Let's assume that we have a sampe of 41 0 points made up of 120 ones, 200 twos and 90 threes. From this sample we find the standard deviation of the whole sample in the usual way. S= Mean is overall mean of ones, twos and threes
Example mean = (120 * *2 +90 *3)/410= 1.93 And S= S=0.71
Chi-squared test for trend obtained by comparing the mean scores for the 2 groups: d.f.=1 p = Where n, and n 2 are total numbers of individuals with and without MI You can see that square root of the formula is formula for Z test for comparing means of 2 samples Z is square root of X 2
You can see that p value is smaller than with ordinary chi- squared test There is even stronger evidence of an association between MI and education The alternative formula for chi-squared test for trend is described by Kirkwood cnd Sterne, p.173176
Chi- squared tests in STATA Let's use data from practical 10 (last session) We try to evaluate whether there is an association between current smoking and age We have age grouped into 4 groups (30-39, 40-49, 50-59, 6069) Smoking (variable smok) was coded 1 =current smokers, 2=non- smokers Firstly, we recode smoking Recode smok 2=0 (we need the outcome to be coded as 0.1)
Lets check proportion of smokers in each age category Tab smok agegroup col 30-39,40-49,50-59,60-69 Total =Yes 0=no Total
Now, standard chi- squared test Tab smok agegroup, col chi 30-39,40-49,50-59,60-69 Total =Yes 0=no Total Pearson chi2(3)= pr=0.000 Degrees of freedom chi-squared test value P<0.001
Chi – squared test for trend COMMAND OUTCOME EXPOSURE Tabodds smok ageroup interval95% confoddscontrolscasesagegroup Test of homogeneity (equal odds): chi2(3)= pr>chi2= Score test for trend of odds: chi2(1)= pr>chi2= This is part that is interesting at themomenti
What have we learnt in the session? 1.Construct 2-way table to examine association between two categorical variatles 2.Conduct Chi-squared test to assess the association between two categorical variables 3.Conduct Chi-squared test for trend for ordered exposures