Download presentation

Presentation is loading. Please wait.

Published byGilbert Bunts Modified over 3 years ago

1
Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1

2
Overview Data Types Contingency Tables Logit Models ◦ Binomial ◦ Ordinal ◦ Nominal 2

3
Things not covered (but still fit into the topic) Matched pairs/repeated measures ◦ McNemar’s Chi-Square Reliability ◦ Cohen’s Kappa ◦ ROC Poisson (Count) models Categorical SEM ◦ Tetrachoric Correlation Bernoulli Trials 3

4
Data Types (Levels of Measurement) Discrete/Categorical/ Qualitative Continuous/ Quantitative Nominal/Multinomial: Properties: Values arbitrary (no magnitude) No direction (no ordering) Example: Race: 1=AA, 2=Ca, 3=As Measures: Mode, relative frequency Rank Order/Ordinal: Properties: Values semi-arbitrary (no magnitude?) Have direction (ordering) Example: Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree Measures: Mode, relative frequency, median Mean? Binary/Dichotomous/ Binomial: Properties: 2 Levels Special case of Ordinal or Multinomial Examples: Gender (Multinomial) Disease (Y/N) Measures: Mode, relative frequency, Mean? 4

5
Contingency Tables Often called Two-way tables or Cross-Tab Have dimensions I x J Can be used to test hypotheses of association between categorical variables 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female 256863 Male 240223201 5 Code 1.1

6
Contingency Tables: Test of Independence Chi-Square Test of Independence (χ 2 ) ◦ Calculate χ 2 ◦ Determine DF: (I-1) * (J-1) ◦ Compare to χ 2 critical value for given DF. 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female 256863 Male 240223201 C1=265C2=331C3=264 R1=156 R2=664 N=820 Where: O i = Observed Freq E i = Expected Freq n = number of cells in table 6

7
Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ 2 ) ◦ H 0 : No Association ◦ H A : Association….where, how? Not appropriate when Expected ( E i ) cell size freq < 5 ◦ Use Fisher’s Exact Chi-Square 2 X 3 TableAge Groups Gender <40 Years40-50 Years>50 Year Female 256863 Male 240223201 C1=265C2=331C3=264 R1=156 R2=664 N=820 7 Code 1.2

8
Contingency Tables 2x2 ab cd a+ba+b c+dc+d b+db+da+ca+c a+b+c+d Disorder (Outcome) Risk Factor/ Exposure YesNo Yes No 8

9
Contingency Tables: Measures of Association a= 25 b= 10 c= 20 d= 45 35 65 5545 100 Depression Alcohol Use YesNo Yes No Probability : Odds: Contrasting Probability: Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol Contrasting Odds: The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 9

10
Why Odds Ratios? a= 25 b= 10*i c= 20 d= 45*i (25 + 10*i) 55*i45 Depression Alcohol Use YesNo Yes No (20 + 45*i) (45 + 55*i) i=1 to 45 10

11
The Generalized Linear Model General Linear Model (LM) ◦ Continuous Outcomes (DV) ◦ Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA Generalized Linear Model (GLM) ◦ John Nelder and Robert Wedderburn ◦ Maximum Likelihood Estimation ◦ Continuous, Categorical, and Count outcomes. ◦ Distribution Family and Link Functions Error distributions that are not normal 11

12
Logistic Regression “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2 nd Ed.) Binary Response Predicting Probability (related to the Probit model) Assume (the usual): ◦ Independence ◦ NOT Homoscedasticity or Normal Errors ◦ Linearity (in the Log Odds) ◦ Also….adequate cell sizes. 12

13
Logistic Regression 13

14
Logistic Regression: Example The Output as Logits ◦ Logits: H 0 : β=0 Y=DepressedCoefSEZPCI α (_constant)-1.510.091-16.7<0.001-1.69, -1.34 Freq.Percent Not Depressed 67281.95 Depressed14818.05 14 Code 2.1

15
Logistic Regression: Example Y=DepressedORSEZPCI α (_constant)0.2200.020-16.7<0.0010.184, 0.263 Freq.Percent Not Depressed 67281.95 Depressed14818.05 15 Code 2.2

16
Logistic Regression: Example Y=DepressedCoefSEZPCI α (_constant)-2.240.489-4.58<0.001-3.20, -1.28 β (age)0.0130.0091.520.127-0.004, 0.030 AS LOGITS: Interpretation: A 1 unit increase in age results in a 0.013 increase in the log-odds of depression. Hmmmm….I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = 1.013 For a 1 unit increase in age, there is a 1.013 increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change] 16 Code 2.3

17
Logistic Regression: GOF 17 Overall Model Likelihood-Ratio Chi-Square Omnibus test for the model Overall model fit? Relative to other models Compares specified model with Null model (no predictors) Χ 2 =-2*(LL 0 -LL 1 ), DF=K parameters estimated

18
Logistic Regression: GOF (Summary Measures) Pseudo-R 2 Pseudo-R 2 ◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden) ◦ Only comparable within nested models of the same outcome. Hosmer-Lemeshow ◦ Models with Continuous Predictors ◦ Is the model a better fit than the NULL model. X 2 ◦ H 0 : Good Fit for Data, so we want p>0.05 ◦ Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 ◦ Conservative (rarely rejects the null) Pearson Chi-Square ◦ Models with categorical predictors ◦ Similar to Hosmer-Lemeshow ROC-Area Under the Curve ◦ Predictive accuracy/Classification 18 Code 2.4

19
Logistic Regression: GOF (Diagnostic Measures) Outliers in Y (Outcome) ◦ Pearson Residuals Square root of the contribution to the Pearson χ 2 ◦ Deviance Residuals Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model. Outliers in X (Predictors) ◦ Leverage (Hat Matrix/Projection Matrix) Maps the influence of observed on fitted values Influential Observations ◦ Pregibon’s Delta-Beta influence statistic ◦ Similar to Cook’s-D in linear regression Detecting Problems ◦ Residuals vs Predictors ◦ Leverage Vs Residuals ◦ Boxplot of Delta-Beta 19 Code 2.5

20
Logistic Regression: GOF Y=DepressedCoefSEZPCI α (_constant)-2.240.489-4.58<0.001-3.20, -1.28 β (age)0.0130.0091.520.127-0.004, 0.030 H-L GOF: Number of Groups: 10 H-L Chi 2 :7.12 DF:8 P:0.5233 McFadden’s R 2 : 0.0030 20 L-R χ 2 (df=1): 2.47, p=0.1162

21
Logistic Regression: Diagnostics Linearity in the Log-Odds ◦ Use a lowess (loess) plot ◦ Depressed vs Age 21 Code 2.6

22
Logistic Regression: Example Y=DepressedORSEZPCI α (_constant)0.5450.091-3.63<0.0010.392, 0.756 β (male)0.2990.060-5.99<0.0010.202, 0.444 AS OR: Interpretation: The odds of depression are 0.299 times lower for males compared to females. We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females. Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34. 22 Code 2.7

23
Ordinal Logistic Regression Also called Ordered Logistic or Proportional Odds Model Extension of Binary Logistic Model >2 Ordered responses New Assumption! ◦ Proportional Odds BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) The predictors effect on the outcome is the same across levels of the outcome. Bmi3grp (1 vs 2,3) = B(age) Bmi3grp (1,2 vs 3) = B(age) 23

24
Ordinal Logistic Regression 24

25
Ordinal Logistic Regression Example Y=bmi3grpCoefSEZPCI β1 (age)-0.0260.006-4.15<0.001-0.381, -0.014 β2 (blood_press)0.0120.0052.480.0130.002, 0.021 Threshold1/cut1-0.6960.6678-2.004, 0.613 Threshold2/cut20.7730.6680-0.536, 2.082 AS LOGITS: Y=bmi3grpORSEZPCI β1 (age)0.9740.006-4.15<0.0010.962, 0.986 β2 (blood_press)1.0120.0052.480.0131.002, 1.022 Threshold1/cut1-0.6960.6678-2.004, 0.613 Threshold2/cut20.7730.6680-0.536, 2.082 AS OR: For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater. 25 Code 3.1

26
Ordinal Logistic Regression: GOF Assessing Proportional Odds Assumptions ◦ Brant Test of Parallel Regression H 0 : Proportional Odds, thus want p >0.05 Tests each predictor separately and overall ◦ Score Test of Parallel Regression H 0 : Proportional Odds, thus want p >0.05 ◦ Approx Likelihood-ratio test H 0 : Proportional Odds, thus want p >0.05 26 Code 3.2

27
Ordinal Logistic Regression: GOF Pseudo R 2 Diagnostics Measures ◦ Performed on the j-1 binomial logistic regressions 27 Code 3.3

28
Multinomial Logistic Regression Also called multinomial logit/polytomous logistic regression. Same assumptions as the binary logistic model >2 non-ordered responses ◦ Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model 28

29
Multinomial Logistic Regression 29

30
Multinomial Logistic Regression Example Y=religion (ref=Catholic(1)) ORSEZPCI Protestant (2) β (supernatural)1.1260.0901.470.1410.961, 1.317 α (_constant)1.2190.0972.490.0131.043, 1.425 Evangelical (3) β (supernatural)1.2180.1172.060.0391.010, 1.469 α (_constant)0.6190.059-5.02<0.0010.512, 0.746 Does degree of supernatural belief indicate a religious preference? AS OR: For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic. 30 Code 4.1

31
Multinomial Logistic Regression GOF Limited GOF tests. ◦ Look at LR Chi-square and compare nested models. ◦ “Essentially, all models are wrong, but some are useful” –George E.P. Box Pseudo R 2 Similar to Ordinal ◦ Perform tests on the j-1 binomial logistic regressions 31

32
Resources “Categorical Data Analysis” by Alan Agresti UCLA Stat Computing: http://www.ats.ucla.edu/stat/ 32

Similar presentations

Presentation is loading. Please wait....

OK

Introduction to Categorical Data Analysis

Introduction to Categorical Data Analysis

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Download ppt on statistics for class 10th Ppt on road accidents in malaysia Ppt on mid day meal programme Ppt on women empowerment Ppt on seven segment display common Ppt on evolution of life Ppt on principles of object-oriented programming vs procedural programming Ppt on c programming functions Ppt on bill gates as a leader Ppt on data handling for class 10