# Contingency Table Analysis. contingency tables show frequencies produced by cross-classifying observations e.g., pottery described simultaneously according.

## Presentation on theme: "Contingency Table Analysis. contingency tables show frequencies produced by cross-classifying observations e.g., pottery described simultaneously according."— Presentation transcript:

Contingency Table Analysis

contingency tables show frequencies produced by cross-classifying observations e.g., pottery described simultaneously according to vessel form & surface decoration polishedburnishedmatte bowl47283 jar30428 olla64525

most statistical tests for tables are designed for analyzing 2-dimensions –only examine the interaction of two variables at one time… most efficient when used with nominal data –using ratio data means recoding data to a lower scale of measurement ( ordinal) –means ignoring some of the information originally available…

still, you might do this, particularly if you are interested in association between metric and non-metric variables e.g.: variation in pot size vs. surface decoration… may decide to divide pot size into ordinal classes…

largemediumsmall largesmall

large specular413 non-specular1518 rim diameter: slip: other options may let you retain more of the original information content

non-specular slip specular slip could use a t-test to test the equality of the means makes full use of ratio data…

why do we work with contingency tables?? polishedburnishedmatte bowl47283 jar30428 olla64525

because we think there may be some kind of interaction between the variables… basic question: can the state of one variable be predicted from the state of another variable? if not, they are independent polishedburnishedmatte bowl47283 jar30428 olla64525

expected counts a baseline to which observed counts can be compared counts that would occur by random chance if the variables are independent, over the long run for any cell E = (col total * row total)/table total

MF PP41545% 2.32.75 Pot15655%2.73.36 Total561156 45%55%

significance = probability of getting, by chance, a table as or more deviant than the observed table, if the variables are independent –deviant defined in terms of expected table no causality is necessarily implied by the outcome –but, causality may well be the reason for observed association… –e.g.: grave goods and sex

Fishers Exact Test just for 2 x 2 tables useful where chi-square test is inappropriate gives the exact probability of all tables with the same marginal totals as or more deviant than the observed table…

P = (a+b)!(a+c)!(b+d)!(c+d)! / (N!a!b!c!d!) P = 5!5!6!6! / 11!4!1!1!5! = 5*6!6! / 11! P = 5*6!6! / 11! = 5*6! / 11*10*9*8*7 P = 5*6! / 11*10*9*8*7 = 3600 / 55440 P =.065 ab cd 41 15

use R (or Excel) if the counts arent too large… > fisher.test(x) ab cd 41 15

055 516 5611 145 426 56 235 336 56 325 246 56 415 156 56 505 066 56

05 51 14 42 23 33 32 24 41 15 50 06 2.32.7 3.3 0.013 0.162 0.433 0.325 0.065 0.002 P = 0.065+0.002 = 0.067 or P = 0.067+0.013 = 0.080 (observed) (expected)

2-tailed test = 0.067+0.013 = 0.080 1-tailed test = 0.065+0.002 = 0.067 MF PP415 Pot156 5611 > fisher.test(x, alt = "two.sided") > fisher.test(x, alt = greater) [i.e.: H 1 : odds ratio > 1] in R:

CHI-SQUARE

Chi-square Statistic an aggregate measure (i.e., based on the entire table) the greater the deviation from expected values, the larger (exponentially!) the chi- square statistic… one could devise others that would place less emphasis on large deviations |o-e|/e

X 2 is distributed approximately in accord with the X 2 probability distribution X 2 probabilities are in table showing CPD need degrees of freedom df = total number of cells that can be varied without changing marginal totals df = (r-1)*(c-1)

X 2 is distributed approximately in accord with the X 2 probability distribution X 2 probabilities are traditionally found in a table showing threshold values from a CPD –need degrees of freedom –df = (r-1)*(c-1) just use R…

(43*24) 91 (7-11.8) 2 11.8 = 2.025

X 2 assumptions & problems must be based on counts: –not percentages, ratios or weighted data fails to give reliable results if expected counts are too low: 23 33 2.272.72 3.27 obs.exp. X 2 =0.74 P(Fishers)=1.0 5 6 65

rules of thumb 1.no expected counts less than 5 –almost certainly too stringent 2.no exp. counts less than 2, and 80% of counts > 5 –more relaxed (but more realistic)

collapsing tables can often combine columns/rows to increase expected counts that are too low –may increase or reduce interpretability –may create or destroy structure in the table no clear guidelines –avoid simply trying to identify the combination of cells that produces a significant result

obs. counts exp. counts obs. counts exp. counts

chi-square is basically a measure of significance it is not a good measure of strength of association can help you decide if a relationship exists, but not how strong it is

X 2 =1.07 alpha=.30 X 2 =2.13 alpha=.14

also, chi-square is a global statistic says nothing (directly) about which parts of a table may be driving a large chi-square statistic chi-square contributions from individual cells can help:

Monte Carlo test of X 2 significance based on simulated generation of cell-counts under imposed conditions of independence randomly assign counts to cells:

significance is simply the proportion of outcomes that produced a X 2 statistic >= observed not based on any assumptions about the distribution of the X 2 statistic overcomes the problems associated with small expected frequencies

example from R insert example around here add slide on 3 models for simulation

TWOWAY

G Test a measure of significance for any r x c table look up critical values of G 2 in an ordinary chi-square table; figure out degrees of freedom the same way conforms to chi-square distribution better than the chi-square statistic

an R function for G 2 gsq.test function(obs) { df (nrow(obs)-1) * (ncol(obs)-1) exp chisq.test(obs)\$expected G 2*sum(obs*log(obs/exp)) 2*dchisq(G, df) }

ASSOCIATION

Measures of Association

Phi-Square ( 2 ) an attempt to remove the effects of sample size that makes chi-square inappropriate for measuring association divide chi-square by n 2 =X 2 /n limits: 0:variables are independent 1:perfect association in a 2x2 table; no upper limit in larger tables

2 =0.18

Cramers V also a measure of strength of association an attempt to standardize phi-square (i.e., control the lack of an upper boundary in tables larger than 2x2 cells) V= 2 /m where m=min(r-1,c-1) ; i.e., the smaller of rows-1 or columns-1) limits: 0-1 for any size table; 1=highest possible association

Yules Q often used to assess the strength of presence / absence association range is –1 (perfect negative association) to 1 (perfect positive association); values near 0 indicate a lack of association Q = -.72

Yules Q not sensitive to marginal changes (unlike Phi 2 ) multiply a row or column by a constant; cancels out… (Q=.65 for both tables)

Yules Q cant distinguish between different degrees of complete association cant distinguish between complete and absolute association

percent association works best for 2 x 2 tables 0 = no association 1 or –1 = perfect association simple to calculate

M F RHS 0.67 0.33 1.00 LHS 0.25 0.75 1.00 0.42 asymmetric

odds ratio easiest with 2 x 2 tables what are the odds of a man being buried on his right side, compared to those of a woman?? if there is a strong level of association between sex and burial position, the odds should be quite different…

ab cd a c b d odds ratio =

29/11=2.64 14/33=0.42 2.64/0.42=6.21 if there is no association, the odds ratio=1 departures from 1 range between 0 and infinity >1 =positive association <1 =negative association

Goodman and Kruskals Tau ( ) proportional reduction of error how are the probabilities of correctly assigning cases to one set of categories improved by the knowledge of another set of categories??

Goodman and Kruskals Tau ( ) limits are 0-1; 1=perfect association same results as Phi 2 w/ 2x2 table sensitive to margin differences asymmetric –get different results predicting row assignments based on columns than from column assignments based on rows

=[P(error|rule 1)-P(error|rule 2)] / P(error|rule 1) rule 1: random assignments to one variable are made with no knowledge of 2 nd variable rule 2: random assignments to one variable are made with knowledge of 2 nd variable B1B2 A1 A2 61420 B1B2 A160 A2014 6 20

TABLE STANDARDIZATION

Table Standardization even very large and highly significant X 2 (or G 2 ) statistics dont necessarily mean that all parts of the table are equally deviant (and therefore interesting) usually need to do other things to highlight loci of association or interaction which cells diverge the most from expected values? very difficult to decide when both row and column totals are unequal…

Percent standardization highly intuitive approach, easy to interpret often used to control the effects of sample- size variation have to decide if it makes better sense to standardize based on rows, or on columns

usually, you want to standardize whatever it is you want to compare –i.e., if you want to compare columns, base percents on column totals you may decide to make two tables, one standardized on rows, the other on columns…

MNIs

CELL-BASED MEASURES

Binomial Probabilities P(n,k,p): probability of k successes in n trials, with p probability of success in any one trial n = 13 k = 5 p = 3.7/13

Binomial Probabilities in R: > pbinom(k, n, p) easy to build into a function…

K-S TEST

10 20 30 40 50 60 70 80 90 100 percent 10 20 30 40 50 60 70 80 90 100 cumulative percent K-S test for cumulative percents

10 20 30 40 50 60 70 80 90 100 cumulative percent

10 20 30 40 50 60 70 80 90 100 cumulative percent some useful statistical measures (ordinal or ratio scale) can be misleading when used with nominal data good for comparing data sets Cumulative Percent Graph

K-S test find Dmax: –maximum difference between 2 cumulative proportion distributions –compare to critical value for chosen sig. level C*((n 1 +n 2 )/(n 1 n 2 ))^.5 –alpha =.05, C=1.36 –alpha =.01, C=1.63 –alpha =.001, C=1.95

example 2 mortuary data (Shennan, p. 56+) burials characterized according to 2 wealth (poor vs. wealthy) and 6 age categories (infant to old age) RichPoor Infans I 623 Infans II 821 Juvenilis 1125 Adultus 2936 Maturus 1927 Senilis 34 Total 76136

burials for younger age-classes appear to be more numerous among the poor can this be explained away as an example of random chance? or do poor burials constitute a different population, with respect to age-classes, than rich burials?

we can get a visual sense of the problem using a cumulative frequency plot:

K-S test (Kolmogorov-Smirnov test) assesses the significance of the maximum divergence between two cumulative frequency curves H 0 :dist 1 =dist 2 an equation based on the theoretical distribution of differences between cumulative frequency curves provides a critical value for a specific alpha level observed differences beyond this value can be regarded as significant at that alpha level

if alpha =.05, the critical value = 1.36* (n 1 +n 2 )/n 1 n 2 1.36* (76+136)/76*136 = 0.195 the observed value = 0.178 0.178 < 0.195; dont reject H 0 D max =.178

age <= 30age > 30 strongly disagree87 mildly disagree59 disagree66 no opinion01 agree22 mildly agree13 strongly agree23 statement/question: Oil exploration should be allowed in coastal California… example 2

OTHER HYP. TESTS

example 3 survey data 100 sites broken down by location and time: earlylateTotal piedmont311950 plain19 31 50 Total50 100

we can do a chi-square test of independence of the two variables time and location H 0 :time & location are independent alpha =.05 time location H0H0 time H1H1

2 values reflect accumulated differences between observed and expected cell-counts expected cell counts are based on the assumptions inherent in the null hypothesis if the H 0 is correct, cell values should reflect an even distribution of marginal totals earlylateTotal piedmont50 plain50 Total50 100 25

chi-square = ((o-e) 2 /e) observed chi-square = 4.84 we need to compare it to the critical value in a chi-square table:

chi-square = ((o-e) 2 /e) observed chi-square = 4.84 chi-square table: critical value (alpha =.05, 1 df) is 3.84 observed chi-square (4.84) > 3.84 we can reject H 0 H 1 : time & location are not independent

what does this mean? earlylateTotal piedmont311950 plain19 31 50 Total50 100

Download ppt "Contingency Table Analysis. contingency tables show frequencies produced by cross-classifying observations e.g., pottery described simultaneously according."

Similar presentations