# Testing for Marginal Independence Between Two Categorical Variables with Multiple Responses Robert Jeutong.

## Presentation on theme: "Testing for Marginal Independence Between Two Categorical Variables with Multiple Responses Robert Jeutong."— Presentation transcript:

Testing for Marginal Independence Between Two Categorical Variables with Multiple Responses Robert Jeutong

Outline Introduction – Kansas Farmer Data – Notation Modified Pearson Based Statistic – Nonparametric Bootstrap – Bootstrap p-Value Methods Simulation Study Conclusion

Introduction pick any (or pick any/c) or multiple-response categorical variables Survey data arising from multiple-response categorical variables questions present a unique challenge for analysis because of the dependence among responses provided by individual subjects. Testing for independence between two categorical variables is often of interest When at least one of the categorical variables can have multiple responses, traditional Pearson chisquare tests for independence should not be used because of the within-subject dependence among responses

Intro contd A special kind of independence, called marginal independence, becomes of interest in the presence of multiple response categorical variables The purpose of this article is to develop new approaches to the testing of marginal independence between two multiple-response categorical variables Agresti and Liu (1999) call this a test for simultaneous pair wise marginal independence (SPMI) The proposed tests are extensions to the traditional Pearson chi-square tests for independence testing between single-response categorical variables

Kansas Farmer Data Comes from Loughin (1998) and Agresti and Liu (1999) Conducted by the Department of Animal Sciences at Kansas State University Two questions in the survey asked Kansas farmers about their sources of veterinary information and their swine waste storage methods Farmers were permitted to select as many responses as applied from a list of items

Data contd Interest lies in determining whether sources of veterinary information are independent of waste storage methods in a similar manner as would be done in a traditional Pearson chi-square test applied to a contingency table with single- response categorical variables A test for SPMI can be performed to determine whether each source of veterinary information is simultaneously independent of each swine waste storage method

Data contd 4 × 5 = 20 different 2 × 2 tables can be formed to marginally summarize all possible responses to item pairs Independence is tested in each of the 20 2 × 2 tables simultaneously for a test of SPMI Professional consultant 10 Lagoon134109 010126

Data contd The test is marginal because responses are summed over the other item choices for each of the multiple-response categorical variables If SPMI is rejected, examination of the individual 2 × 2 tables can follow to determine why the rejection occurs

Notation Let W and Y = multiple-response categorical variables for an r × c tables row and column variables, respectively Sources of veterinary information are denoted by Y and waste storage methods are denoted by W The categories for each multiple-response categorical variable are called items (Agresti and Liu, 1999) ; For example, lagoon is one of the items for waste storage method Suppose W has r items and Y has c items. Also, suppose n subjects are sampled at random

Notation contd Let W si = 1 if a positive response is given for item i by subject s for i = 1,..,r and s = 1,..,n; W si = 0 for a negative response. Let Y sj for j = 1,.., c and s = 1..,n be similarly defined. The abbreviated notation, W i and Y j, refers generally to the binary response random variable for item i and j, respectively The set of correlated binary item responses for subject s are Y s = (Y s1, Y s2,…,Y sc ) and W s = (W s1, W s2,…,W sr )

Notation contd Cell counts in the joint table are denoted by n gh for the g th possible (W 1 …,W r ) and h th possible (Y 1 …,Y c ) The corresponding probability is denoted by τ gh. Multinomial sampling is assumed to occur within the entire joint table; thus, g,h τ gh = 1 Let m ij denote the number of observed positive responses to W i and Y j The marginal probability of a positive response to W i and Y j is denoted by π ij and its maximum likelihood estimate (MLE) is m ij /n.

Joint Table

SPMI Defined in Hypothesis Ho: π ij = π i π j for i = 1,...,r and j = 1,...,c Ha: At least one equality does not hold where π ij = P(W i = 1, Y j = 1), π i = P(W i = 1), and π j = P(Y j = 1). This specifies marginal independence between each W i and Y j pair P(W i = 1, Y j = 1) = π ij P(W i = 1, Y j = 0) =π i π ij P(W i = 0, Y j = 1) = πj π ij P(W i =0, Y j = 0) = 1 π i πj + π ij

Hypothesis SPMI can be written as OR WY,ij =1 for i = 1,…,r and j = 1,…,c where OR is the abbreviation for odds ratio and – OR WY,ij = π ij (1 π i πj + π ij )/[(π i π ij )(π j π ij )] Therefore, SPMI represents simultaneous independence in the rc 2 × 2 pairwise item response tables formed for each W i and Y j pair Join independence implies SPMI but the reverse is not true

Modified Pearson Statistic Under the Null (1,1), (1,0), (0,1), (1,1) YjYj WiWi 10 1π ij π i π ij πi 0πj π ij 1 π i πj + π ij 1-πi πj 1-πj

The Statistic

Nonparametric Bootstrap To resample under independence of W and Y, W s and Y s are independently resampled with replacement from the data set. The test statistic calculated for the b th resample of size n is denoted by X 2 S,b. The p-value is calculated as – B -1 b I(X 2 S,b X 2 S ) where B is the number of resamples taken and I() is the indicator function

Bootstrap p-Value Combination Methods Each X 2 S,i,j gives a test for independence between each W i and Y j pair for i = 1,…,r and j = 1,…,c. The p-values from each of these tests (using a χ 2 1 approximation) can be combined to form a new statistic p tilde the product of the r×c p-values or the minimum of the r×c p-values could be used as p tilde The p-value is calculated as – B -1 b I(p* tilde p tilde)

Results from the Farmer Data MethodMy p-valueAuthors p-value Bootstrap X 2 s 0.0001<0.0001 Bootstrap product of p-values0.0001 Bootstrap minimum p-values0.00470.0034

Interpretation and Follow-Up The p-values show strong evidence against SPMI Since X 2 S is the sum of rc different Pearson chi-square test statistics, each X 2 S,i,j can be used to measure why SPMI is rejected The individual tests can be done using an asymptotic χ 2 1 approximation or the estimated sampling distribution of the individual statistics calculated in the proposed bootstrap procedures When this is done, the significant combinations are (Lagoon, pro consultant), (Lagoon, Veterinarian), (Pit, Veterinarian), (Pit, Feed companies & representatives), (Natural drainage, pro consultant), (Natural drainage, Magazines)

Simulation Study which testing procedures hold the correct size under a range of different situations and have power to detect various alternative hypotheses 500 data sets for each simulation setting investigated The SPMI testing methods are applied (B = 1000), and for each method the proportion of data sets are recorded for which SPMI is rejected at the 0.05 nominal level

My Results n=100 2×2 marginal table OR = 25 MethodMy p-valueAuthors p-value Bootstrap X 2 s 0.040.056 Bootstrap product of p-values0.0420.056 Bootstrap minimum p-values0.0360.044

Conclusion The bootstrap methods generally hold the correct size

Similar presentations