Download presentation
Presentation is loading. Please wait.
Published byCameron Whitehead Modified over 4 years ago
1
Categorical Data Analysis & Logistic Regression
수원대학교 통계정보학과 김 진 흠 ㈜ 마케팅랩 파트너스 선임연구원 이 은 경
2
Outline Two-way contingency tables: RR, Odds ratio, Chi-square tests
Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio Logistic regression: Dichotomous response Logistic regression: Polytomous response
3
First example: Aspirin & heart attacks
Clinical trials table of aspirin use and MI Test whether regular intake of aspirin reduces mortality from cardiovascular disease Data set Prospective sampling design: Cohort studies, Clinical trials Myocardial Infarction Group Yes No Total Placebo 189 10,845 11,034 Aspirin 104 10,933 11,037
4
Second example: Smoking & heart attacks
Case-control study: table of smoking status and MI Compare ever-smokers with nonsmokers in terms of the proportion who suffered MI Data set Retrospective sampling design: Case-control study, Cross-sectional design Remark: Observational studies vs. experimental study Ever- Smoker Myocardial Infarction Controls Yes 172 173 No 90 346 Total 262 519
5
Comparing proportions in table
Difference: Relative risk: Useful when both proportions or 1 : RR is more informative : Response is independent of group
6
Example (revisited) 1st example 2nd example
=0.0171-0.0094=0.0077, 95% CI=(0.005, 0.011) Taking aspirin diminishes heart attack , 95% CI=(1.43, 2.3) Risk of MI is at least 43% higher for the placebo group 2nd example , : Not estimable, meaningless even though possible Estimate proportions in the reverse direction Proportion of smoking given MI status: (suffering MI), (Not suffered MI)
7
Association measure: odds ratio
Def’n: Meaning When two variables are independent, i.e., When odds of success (in row 1) > (in row 2) When odds of success (in row 1) < (in row 2) Remark: When both variables are response, (called cross-product ratio) using joint probabilities
8
Properties of odds ratio
Values of father from 1 in a given direction represent stronger association When one value is the inverse of the other, two values of are the same strength of association, but in the opposite directions Not changed when the table orientation reverses Unnecessary to identify one classification as a response variable
9
Example (revisited) 1st example 2nd example , 95% CI=(1.44, 2.33)
Estimated odds is 83% higher for the placebo group 2nd example Rough estimate of RR=3.8 Women who had ever smoked were about four times as likely to suffer as women who had never smoked
10
Independence tests Hypothesis: Two chi-square tests
Under , estimated expected frequency Pearson’s = Likelihood ratio(LR) statistic For a large sample, follow a chi-squared null distribution with Remark: When the chi-squared approximation is good. If not, apply Fisher’s exact test
11
Example: AZT use & AIDS Development of AIDS symptoms in AZT use and race Study on the effects of AZT in slowing the development of AIDS symptoms Data set Symptoms Race AZT Use Yes No Total White 14 93 107 32 81 113 Black 11 52 63 12 43 55
12
Three interests in table
Conditional independence? When controlling for race, AZT treatment and development of AIDS symptom are independent Use Cochran-Mantel-Haenszel(CMH) test Summarize the information from partial tables Homogeneous association? Odds ratios of AZT treatment and development of AIDS symptom are common for each race Use Breslow-Day test Common odds ratio? Use Mantel-Haenszel estimate
13
Example (AZT use & AIDS revisited)
CMH=6.8( =1) with -value=0.0091 Not independent! Breslow-Day=1.39( =1) with -value=0.2384 Homogeneous association! Common odds ratio=0.49 For each race, estimated odds of developing symptoms are half as high for those who took AZT
14
Overview of types of generalized linear models(GLMs)
Three components: Random component (response variable), Linear predictor (linear combination of covariates), Link function Types of GLMs Random Component Link Systematic Model Normal Binomial Poisson Multinomial Identity Logit Log Generalized logit Continuous Categorical Mixed Regression Analysis of variance Analysis of covariance Logistic regression Loglinear Multinomial response
15
Logistic regression with a quantitative covariate
Model: Another representations Odds= Odds at level equals the odds at multiplied by Curve ascends ( ) or descends ( ) The rate of change increases as increases
16
Example: Horseshoe crabs
Binary response if a female crab has at least one satellite; otherwise Covariate: female crab’s width Data set Width Number Cases Number Having Satellites < 23.25 23.25-24.25 24.25-25.25 25.25-26.25 26.25-27.25 27.25-28.25 28.25-29.25 > 29.25 14 28 39 22 24 18 5 4 17 21 15 20
17
Example: Horseshoe crabs
18
Goodness-of-fit tests
Working model: number of settings: number of parameters in : Hypothesis: fits the data Pearson’s statistic: Deviance statistic: approximately follow a chi-square null distribution with
19
Inference for parameters
Interval estimation: Two significance tests: Wald test: Use Likelihood ratio test: Use , log-likelihood function Two tests have a large-sample chi-squared null distribution with
20
Example (Horseshoe crabs revisited)
Fitted model: : larger at lager width ( ) There is a 64% increase in estimated odds of a satellite for each centimeter increase in width ( ) with -value=0.506; with -value=0.4012 95% CI for =(0.298, 0.697) Significance test: Wald=23.9 ( =1) with -value < ; LRT=31.3 ( =1) with -value <
21
Logistic regression with qualitative predictors: AIDS symptoms data
Use indicator variables for representing categories of predictors Logits implied by indicator variables Logit 1
22
Logistic regression with qualitative predictors: AIDS symptoms data
=difference between two logits (i.e., log of odds ratio) at a fixed category of Homogeneous association model
23
Equivalence of contingency table & logistic regression
Conditional independence: CMH test vs. Homogeneous association: Breslow-Day test vs. Goodness-of-fit test Common odds ratio estimate: Mantel-Haenszel estimate vs.
24
Computer Output for Model with AIDS Symptoms Data
Log Likelihood - Analysis of MaximumLikelihood Estmates Parameter Estimate Std Error Wald Chi-Square Pr > ChiSq Intercept azt race -1.0736 -0.7195 0.0555 0.2629 0.2790 0.2886 6.6507 0.0370 <.0.001 0.0099 0.8476 LR Statistics Source Df Chi-Square Pr>ChiSq 1 6.87 0.04 0.0088 0.8473 Obs race azt y n pi_hat lower upper 1 2 3 4 14 32 11 12 107 113 63 55
25
Logistic regression with mixed predictors: Horseshoe crabs data
For color=medium light, For color=medium, For color=medium dark, For controlling
26
Computer Output for Model for Horseshoe Crabs Data
Parameter Estimate Std. Error Likelihood Ratio % Confidence Limits Chi Square Pr > Chi Sq intercept c1 c2 c3 width - 1.3299 1.4023 1.1061 0.4680 2.7618 0.8525 0.5484 0.5921 0.1055 - -0.2738 0.3527 -0.0279 0.2713 -7.5788 3.1354 2.5260 2.3138 0.6870 21.20 2.43 6.54 3.49 19.66 < .0001 0.1188 0.0106 0.0617 LR Statistics Source DF Chi-Square Pr > Chi Sq width color 1 3 26.40 7.00 < .0001 0.0720
27
Estimated probabilities for primary food choice
28
Logistic regression: ploytomous
Model categorical responses with more than two categories Two ways Use generalized logits function for nominal response Use cumulative logits function for ordinal response Notation number of categories response probabilities with
29
Generalized logit model: nominal response
Baseline-category logit: Pair each category with a baseline category when is the baseline Model with a predictor The effects vary according to the category paired with the baseline These pairs of categories determine equations for all other pairs of categories Eg, for a pair of categories Remark: Parameter estimates are same no matter which category is the baseline
30
Example: Alligator food choice
59 alligators sample in Lake Gorge, Florida Response: Primary food type found in alligator’s stomach Fish(1), Invertebrate(2), Other(3, baseline category) Predictor: alligator length, which varies 1.24~3.89(m) ML prediction equations Larger alligator seem to select fish than invertebrates Independence test: Food choice & length LRT= ( ) with -value=0.0002
31
Cumulative logit model: ordinal response
Logit of a cumulative probability Categories 1 to : combined, Categories to : combined Cumulative proportional odds model with a predictor The effect of are identical for all cumulative logits Any one curve for is identical to any of others shifted to the right or shifted to the left For =log of odds ratio is Proportional to the difference between values Same for each cumulative probability
32
Example: Political ideology & party affiliation
Response: Political ideology with five-point ordinal scale Predictors: Political party(Democratic, Republican) Political Party Political Ideology Very Liberal Slightly Moderate Conservative Democratic 80 81 171 41 55 Republican 30 46 148 84 99
33
Example: Political ideology & party affiliation
Parameter inference , Democrats tend to be more liberal than Republicans Wald=57.1( ) with -value < Strong evidence of an association 95% CI for =(0.72, 1.23) or =(2.1, 3.4) At least twice as high for Democrats as for Republicans Goodness-of-fit with -value= Good adequacy!
34
Another logit forms for ordinal response categories
Adjacent-categories logit Adjacent-categories logits determine the logits for all pairs of response categories Continuation-ratio logit Form1: Contrast each category with a grouping of categories from lower levels of response scale Form2: Contrast each category with a grouping of categories from higher
35
Summary Two-way contingency tables: RR, Odds ratio, Chi-square tests
Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio Logistic regression: Dichotomous response Logistic regression: Polytomous response
36
References Agresti, A. (1996). An Introduction to Categorical
Data Analysis, Wiley: New York (Also the 2nd edition is available) Stokes, M.E., Davis, C.S., and Koch, G.G. (2000). Categorical Data Analysis Using The SAS System, Second Ed., SAS Inc.: Cary
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.