Count Data. HT Cleopatra VII & Marcus Antony C c Aa.

Count Data

Cleopatra VII & Marcus Antony C c Aa

EVENODD 1 st 12 2 nd 12 3 rd 12 EVENODD

Gregor Mendel, 1822-1884

RY Ry rY ry Total Obs. 950 250 350 50 1600 Expect( ) 900 300 100 1600 Which statement is right or ?

RY Ry rY ry Total Obs. 950 250 350 50 1600 Expect 900 300 100 1600 O-E 50 -50 50 -50 0 2500 10000

1 2 3 4 Total Obs. 950 250 350 50 1600 Expect 900 300 100 1600 O-E 50 -50 50 -50 0 25/9 25/3 2525*15/9

1 3 5 8 15 24 ∞ 0.975 0.001 0.216 0.831 2.180 6.262 12.401 0.95 0.004 0.352 1.145 2.733 7.261 13.848 0.05 3.841 7.815 11.071 15.507 24.996 36.415 0.025 5.024 9.348 12.833 17.535 27.488 39.364 ∞ ∞ ∞ ∞

RY Ry rY ry Total Obs. 950 250 350 50 1600 Expect( ) 900 300 100 1600 > x <- c(950,250,350,50) > p <- c(9,3,3,1)/16 > chisq.test(x, p=p) Chi-squared test for given probabilities data: x X-squared = 44.4444, df = 3, p-value = 1.214e-09

YyTotal R9502501200 r35050400 Total13003001600 Yy R r 1

Yy R r 1 Yy R r 1 Chi-square test for Independence test

RY Ry rY ry Total Obs. 950 250 350 50 1600 Expect( ) 1600 97522532575 Yy R r 1 Yy R1200 r400 13003001600

RY Ry rY ry Total Obs. 950 250 350 50 1600 Expect( ) 1600 0.64 2.77 1.92 8.33 13.67 97522532575 > mx<- matrix(c(950,250,350,50),2,) > chisq.test(mx,correct=F) Pearson's Chi-squared test data: mx X-squared = 13.6752, df = 1, p-value = 0.0002173 > mx [,1] [,2] [1,] 950 350 [2,] 250 50

Yy R r 1

y1…ymTot r1 … rk Tot1

Total Obs. 81271491060 Expec ( ) 10 60 0.4 0.91.60.103.4 > x <- c(8,12,7,14,9,10) > p <- rep(1,6)/6 > chisq.test(x,p=p) Chi-squared test for given probabilities data: x X-squared = 3.4, df = 5, p-value = 0.6386

HTTotal Obs. 6040100 Expec( ) 50 100 224 > chisq.test(c(60,40),p=c(1,1)/2) Chi-squared test for given probabilities data: c(60, 40) X-squared = 4, df = 1, p-value = 0.0455

|| ? : : 560 440 640 360 > head2 <- c( 560, 640) > toss2 <- c( 1000, 1000) > prop.test(head2, toss2) 2-sample test for equality of proportions …. data: head2 out of toss2 X-squared = 13.0021, df = 1, p-value = 0.0003111 alternative hypothesis: two.sided 95 percent confidence interval: -0.12379728 -0.03620272 sample estimates: prop 1 prop 2 0.56 0.64 CaesarTolemy Head560640 Tail440360 > chisq.test(mx,cor=F) Pearson's Chi-squared test data: mx X-squared = 13.3333, df = 1, p-value = 0.0002607 > chisq.test(mx) Pearson's Chi-squared test with Yates‘ continuity correction data: mx X-squared = 13.0021, df = 1, p-value = 0.0003111 > mx <- matrix(c(560,440,640,360),2,) > mx [,1] [,2] [1,] 560 640 [2,] 440 360 Chi-square test for Homogeneity of distributions

> > # H0 : all four coins have the same proportion showing head side > # H1 : at least one coin have different proportion to the others > > head4 <- c( 83, 90, 129, 70 ) > toss4 <- c( 86, 93, 136, 82 ) > prop.test(head4, toss4) 4-sample test for equality of proportions without continuity correction data: head4 out of toss4 X-squared = 12.6004, df = 3, p-value = 0.005585 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 0.9651163 0.9677419 0.9485294 0.8536585 Coin 1Coin 2Coin 3Coin 4 Head839012970Alive Tail33712Dead Total869313682Total Hospital 1Hospital 2Hospital 3Hospital 4 > mx <- matrix(c(83,3,90,3,129,7,70,12),2,) > chisq.test(mx) Pearson's Chi-squared test data: mx X-squared = 12.6004, df = 3, p-value = 0.005585

DWWD CC3719094 CR235923 RC1014128 RR155826 Australia rare plants data Common (C ) & Rare (R ) in ( South Australia, Victoria) and (Tasmania ) The number of plants: in Dry (D ), Wet (W ) and Wet or Dry (WD ) regions. Question (null hypothesis): Is the distribution of plants for (D,W,WD) are equal for all CC, CR, RC and RR?

Australia rare plants data > rareplants<-matrix(c(37,23,10,15,190,59,141,58,94,23,28,16),4,) > dimnames(rareplants)<-list(c("CC","CR","RC","RR"),c("D","W","WD")) > rareplants > (sout<- chisq.test(rareplants) ) Pearson's Chi-squared test data: rareplants X-squared = 34.9863, df = 6, p-value = 4.336e-06 > round( sout$expected,1 ) D W WD CC 39.3 207.2 74.5 CR 12.9 67.8 24.4 RC 21.9 115.6 41.5 RR 10.9 57.5 20.6 > round( sout$resid,3 ) D W WD CC -0.369 -1.196 2.263 CR 2.828 -1.067 -0.275 RC -2.547 2.368 -2.099 RR 1.242 0.072 -1.023

The lady tasting tea http://www.youtube.com/watch?v=lgs7d5saFFc http://en.wikipedia.org/wiki/Fisher's_exact_test

Fisher’s exact test for 2 X 2 tables with small n (n<25) > chisq.test(matrix(c(7,2,1,5),2,)) Pearson's Chi-squared test with Yates' continuity correction X-squared = 3.2254, df = 1, p-value = 0.0725 Warning message: 카이 자승 근사는 부정확할지도 모릅니다 > fisher.test(matrix(c(7,2,1,5),2,)) Fisher's Exact Test for Count Data p-value = 0.04056 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.8646648 934.0087368 sample estimates: odds ratio 13.59412 > fisher.test(matrix(c(7,2,1,5),2,),alter="greater") Fisher's Exact Test for Count Data p-value = 0.03497 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 1.179718 Inf sample estimates: odds ratio 13.59412 Guess\MakingMilk 1 st Tea 1 st Sum Milk 1 st 718 Tea 1 st 257 sum9615

There are 7 possible tables for given marginal counts. G\M M 1 st T 1 st Sum M 1 st 808 T 1 st 167 sum9615 G\M M 1 st T 1 st Sum M 1 st 718 T 1 st 257 sum9615 G\M M 1 st T 1 st Sum M 1 st 628 T 1 st 347 sum9615 G\M M 1 st T 1 st Sum M 1 st 538 T 1 st 437 sum9615 G\M M 1 st T 1 st Sum M 1 st 448 T 1 st 527 sum9615 G\M M 1 st T 1 st Sum M 1 st 358 T 1 st 617 sum9615 G\M M 1 st T 1 st Sum M 1 st 268 T 1 st 707 sum9615 What is the probability that each table will show at the experiment ?

G\M M 1 st T 1 st Sum M 1 st aba+b T 1 st cdc+d suma+cb+dn G\M M 1 st T 1 st Sum M 1 st rqv T 1 st 1-r1-q1-v sum111 means no discernible ability. Odds ratio : with some correction

G\M M 1 st T 1 st Sum M 1 st 808 T 1 st 167 sum9615 G\M M 1 st T 1 st Sum M 1 st 718 T 1 st 257 sum9615 G\M M 1 st T 1 st Sum M 1 st 628 T 1 st 347 sum9615 G\M M 1 st T 1 st Sum M 1 st 538 T 1 st 437 sum9615 G\M M 1 st T 1 st Sum M 1 st 448 T 1 st 527 sum9615 G\M M 1 st T 1 st Sum M 1 st 358 T 1 st 617 sum9615 G\M M 1 st T 1 st Sum M 1 st 268 T 1 st 707 sum9615 0.00140 0.033560.19580 When 0.00560 0.39161 0.29370 0.07832 0.00140 + 0.03356 + 0.00560 = 0.04056 (See, p-value of the fisher exact test; two-sided test) 0.00140 + 0.03356 = 0.03497 (one-sided test)

G\M M 1 st T 1 st Sum M 1 st 909 T 1 st 066 sum9615 G\M M 1 st T 1 st Sum M 1 st 448 T 1 st 527 sum9615 100% correct answers Some are misclassified Fisher exact test considers only the cases with the same fixed margins. The probabilities of tables with different margins are completely ignored. This is referred to data-respecting (?) inference, from time to time.

Use Fisher’s exact test only for small n ( less than 25). > Pearson's Chi-squared test X-squared = 10.8036, df = 1, p-value = 0.001013 > chisq.test(matrix(c(14,4,2,10),2,)) Pearson's Chi-squared test with Yates' continuity correction X-squared = 8.4877, df = 1, p-value = 0.003576 > fisher.test(matrix(c(14,4,2,10),2,)) Fisher's Exact Test for Count Data p-value = 0.002185 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 2.123319 202.143800 sample estimates: odds ratio 15.40804 Guess\MakingMilk 1 st Tea 1 st Sum Milk 1 st 14216 Tea 1 st 41014 sum181230 No big difference when n is large !

Yates’ continuity correction G\M M 1 st T 1 st Sum M 1 st aba+b T 1 st cdc+d suma+cb+dn

Odds ratio :

Regression Generalized Linear Model (GLM) ANOVA Linear Model (LM)

- Regression, - ANOVA Generalized Linear Model (GLM) Poisson Regression Binomial Regression ( Logistic Regression )

Guess\MakingMilk 1 st Tea 1 st Sum Milk 1 st 718 Tea 1 st 257 sum9615 are observed! Logistic regression

> tm<-data.frame(gm=c(7,1),gt=c(2,5), making=c("M","T")) > summary( glm(cbind(gm,gt)~making,family=binomial, data=tm) ) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.2528 0.8018 1.562 0.118 makingT -2.8622 1.3575 -2.108 0.035 * (Dispersion parameter for binomial family taken to be 1) Null deviance: 5.7863e+00 on 1 degrees of freedom Residual deviance: 8.8818e-16 on 0 degrees of freedom AIC: 8.1909 Number of Fisher Scoring iterations: 4 Logistic regression with the lady tasting tea data

ABCDEF 1011033 7171559 2021712315 141126522 141634315 121413616 101725113 231715110 171935326 202105226 14712624 13 444

> sx<-rep(LETTERS[1:6],e=12) > dx<-c(10,7,20,14,14,12,10,23,17,20,14,13,11,17,21,11,16,14,17,17,19,21,7,13, + 0,1,7,2,3,1,2,1,3,0,1,4,3,5,12,6,4,3,5,5,5,5,2,4,3,5,3,5,3,6,1,1,3,2,6, + 4,11,9,15,22,15,16,13,10,26,26,24,13) > ax<- 30-dx > insect<-data.frame(dead=dx,alive=ax,spray=sx) > gout<-glm(cbind(dead,alive)~spray,family=binomial, data=insect) > summary( gout ) Call: glm(formula = cbind(dead, alive) ~ spray, family = binomial, data = insect) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.06669 0.10547 -0.632 0.5272 sprayB 0.11114 0.14913 0.745 0.4561 sprayC -2.52856 0.23259 -10.871 <2e-16 *** sprayD -1.56288 0.17719 -8.821 <2e-16 *** sprayE -1.95769 0.19513 -10.033 <2e-16 *** sprayF 0.28983 0.14958 1.938 0.0527. (Dispersion parameter for binomial family taken to be 1) Null deviance: 614.07 on 71 degrees of freedom Residual deviance: 171.24 on 66 degrees of freedom AIC: 416.16 Number of Fisher Scoring iterations: 4

> gres<-rbind(unique(fitted(gout)),unique(predict(gout))) > dimnames(gres)[[2]]<-LETTERS[1:6] > gres A B C D E F [1,] 0.48333333 0.51111111 0.06944445 0.1638889 0.1166667 0.5555556 [2,] -0.06669137 0.04445176 -2.59525468 -1.6295728 -2.0243818 0.2231436 > anova(gout) Analysis of Deviance Table Model: binomial, link: logit Response: cbind(dead, alive) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 71 614.07 spray 5 442.83 66 171.24

Correlation and causality The more STBK stores, the higher will APT price increase ? The more Starbucks, the higher APT price ! APT prices in Seoul

STBKAPT price 강남구451030 강동구2530 중구24520 중랑구0330 STBK: the number of Starbucks stores APT price: Average APT price by a 1 m 2

y<-c(45, 2,1,4,4,6,4,2,1,0,2,3,10,8,21,3,5,5,3,12,7,1,20,24,0) x<-c(3373,1907,1115,1413,1286,1861,1218,1018,1250,1135,1240,1528, 1675,1220,2854,1644,1247,2427,2034,1723,2594,1138,1634,1729,1101) xm<- x/(3.3) # 평단가 ( res<- glm(y~xm, family=poisson) ) anova(res) summary(res) plot(xm,y,ylab="Starbucks",xlab="APT price/m2") points(xm,fitted(res),col="red",pch=16) # exp(predict(res))=fitted(res)

> summary(res) Call: glm(formula = y ~ xm, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max -2.6923 -1.7239 -0.6041 0.5783 5.3036 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.0072064 0.2128074 -0.034 0.973 xm 0.0035630 0.0003009 11.841 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 235.19 on 24 degrees of freedom Residual deviance: 111.52 on 23 degrees of freedom AIC: 195.4 Number of Fisher Scoring iterations: 5

> anova(res) Analysis of Deviance Table Model: poisson, link: log Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 24 235.19 xm 1 123.67 23 111.52

A 0.75 0.05 B 0.1 0.5 0.1 C 0.05 0.75 distribution & likelihood

What is ? is observed.

distribution & likelihood 0123456 0.00 0.10 0.20 0.30 0.00.20.40.60.81.0 0.0 0.1 0.2 0.3 0.4 p likelihood 0.1330.587 0.15

link function (for Poisson family) the number of parameters linear modeling for the link function

link function (for binomial family) linear modeling for the link function

Independence test in GLM for Australia rare plants data > rareplants<-matrix(c(37,23,10,15,190,59,141,58,94,23,28,16),4,) > dimnames(rareplants)<-list(c("CC","CR","RC","RR"),c("D","W","WD")) > (sout<- chisq.test(rareplants) ) Pearson's Chi-squared test data: rareplants X-squared = 34.9863, df = 6, p-value = 4.336e-06 > wdx<-rep(c("D","W","WD"),e=4) > crx<-rep(c("CC","CR","RC","RR"),3) > rplants<-data.frame(wd=wdx,cr=crx,r=c(rareplants)) > anova( glm(r~wd*cr,family=poisson,data=rplants) ) Analysis of Deviance Table Model: poisson, link: log, Response: r Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 11 522.11 wd 2 305.28 9 216.83 cr 3 181.88 6 34.95 wd:cr 6 34.95 0 -9.77e-15 DWWD CC3719094 CR235923 RC1014128 RR155826 > 1-pchisq(34.95,6) [1] 4.406699e-06

> # H0 : all four coins have the same proportion showing head side > # H1 : at least one coin have different proportion to the others > > head4 <- c( 83, 90, 129, 70 ) > toss4 <- c( 86, 93, 136, 82 ) > prop.test(head4, toss4) 4-sample test for equality of proportions without continuity correction X-squared = 12.6004, df = 3, p-value = 0.005585 alternative hypothesis: two.sided > coins<-factor(LETTERS[1:4]) > anova(glm(cbind(head4,toss4-head4)~coins,family=binomial)) Analysis of Deviance Table Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 3 10.667 coins 3 10.667 0 1.132e-14 Coin 1Coin 2Coin 3Coin 4 Head839012970Alive Tail33712Dead Total869313682Total Hosp’l 1Hosp’l 2Hosp’l 3Hosp’l 4 > 1-pchisq(10.667,3) [1] 0.01366980 Homogeneity test in GLM for coin tossing example

Thank you !!

Count Data. HT Cleopatra VII & Marcus Antony C c Aa.

Similar presentations

Presentation on theme: "Count Data. HT Cleopatra VII & Marcus Antony C c Aa."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Count Data. HT Cleopatra VII & Marcus Antony C c Aa.

Similar presentations

Presentation on theme: "Count Data. HT Cleopatra VII & Marcus Antony C c Aa."— Presentation transcript:

Similar presentations

About project

Feedback