Download presentation
Published byGeorgia Alexander Modified over 9 years ago
1
Logistic Regression and Discriminant Function Analysis
2
Logistic Regression vs. Discriminant Function Analysis
Similarities Both predict group membership for each observation (classification) Dichotomous DV Requires an estimation and validation sample to assess predictive accuracy If the split between groups is not more extreme than 80/20, yield similar results in practice
3
Logistic Reg vs. Discrim: Differences
Discriminant Analysis Assumes MV normality Assumes equality of VCV matrices Large number of predictors violates MV normality can’t be accommodated Predictors must be continuous, interval level More powerful when assumptions are met Many assumptions, rarely met in practice Categorical IVs create problems Logistic Regression No assumption of MV normality No assumption of equality of VCV matrices Can accommodate large numbers of predictors more easily Categorical predictors OK (e.g., dummy codes) Less powerful when assumptions are met Few assumptions, typically met in practice Categorical IVs can be dummy coded
4
Logistic Regression Outline:
Categorical Outcomes: Why not OLS Regression? General Logistic Regression Model Maximum Likelihood Estimation Model Fit Simple Logistic Regression
5
Categorical Outcomes: Why not OLS Regression?
Dichotomous outcomes: Passed / Failed CHD / No CHD Selected / Not Selected Quit/ Did Not Quit Graduated / Did Not Graduate
6
Categorical Outcomes: Why not OLS Regression?
Example: Relationship b/w performance and turnover Line of best fit?! Errors (Y-Y’) across values of performance (X)?
7
Problems with Dichotomous Outcomes/DVs
The regression surface is intrinsically non-linear Errors assume one of two possible values, violate assumption of normally distributed errors Violates assumption of homoscedasticity Predicted values of Y greater than 1 and smaller than 0 can be obtained The true magnitude of the effects of IVs may be greatly underestimated Solution: Model data using Logistic Regression, NOT OLS Regression
8
Logistic Regression vs. Regression
Logistic regression predicts a probability that an event will occur Range of possible responses between 0 and 1 Must use an s-shaped curve to fit data Regression assumes linear relationships, can’t fit an s-shaped curve Violates normal distribution Creates heteroscedascity
9
Example: Relationship b/w Age and CHD (1 = Has CHD)
10
General Logistic Regression Model
Y’ (outcome variable) is the probability that having one outcome or another based on a nonlinear function of the best linear combination of predictors Where: Y’ = probability of an event Linear portion of equation (a + b1x1) used to predict probability of event (0,1), not an end in itself
11
The logistic (logit) transformation
DV is dichotomous purpose is to estimate probability of occurrences (0, 1) Thus, DV is transformed into a likelihood Logit/logistic transformation accomplishes (linear regression eq. takes log of odds)
12
Probability Calculation
Where: The relation b/w logit (P) and X is intrinsically linear b = expected change of logit(P) given one unit change in X a = intercept e = Exponential
13
Ordinary Least Squares (OLS) Estimation
Purpose is obtain the estimates that would best minimize the sum of squared errors, sum(y-y’)2 The estimates chosen best describe the relationships among the observed variables (IVs and DV) Estimates chosen maximize the probability of obtaining the observed data (i.e., these are the population values most likely to produce the data at hand)
14
Maximum Likelihood (ML) estimation
OLS can’t be used in logistic regression because of non-linear nature of relationships In ML, the purpose is to obtain the parameter estimates most likely to produce the data ML estimators are those with the greatest joint likelihood of reproducing the data In logistic regression, each model yields a ML joint probability (likelihood) value Because this value tends to be very small (e.g., ), it is multiplied by -2log The -2log transformation also yields a statistic with a known distribution (chi-square distribution)
15
Model Fit In Logistic Regression, R & R2 don’t make sense
Evaluate model fit using the -2log likelihood (-2LL) value obtained for each model (through ML estimation) The -2LL value reflects fit of model; used to compare fit of nested models The -2LL measures lack of fit – extent to which model fits data poorly When the model fits the data perfectly, -2LL = 0 Ideally, the -2LL value for the null model (i.e., model with no predictors, or “intercept-only” model) would be larger than then the model with predictors
16
Comparing Model Fit The fit of the null model can be tested against the fit of the model with predictors using chi-square test: Where: 2 = chi-square for improvement in model fit (where df = kNull – kModel) -2LLMO = -2 Log likelihood value for null model (intercept-only model) -2LLM1 = -2 Log likelihood value for hypothesized model Same test can be used to compare nested model with k predictor(s) to model with k+1 predictors, etc. Same logic as OLS regression, but the models are compared using a different fit index (-2LL)
17
Pseudo R2 Assessment of overall model fit Calculation
Two primary Pseudo R2 stats: Nagelkerke less conservative preferred by some because max = 1 Cox & Snell more conservative Interpret like R2 in OLS regression
18
Unique Prediction In OLS regression, the significance tests for the beta weights indicate if the IV is a unique predictors In Logistic regression, the Wald test is used for the same purpose
19
Similarities to Regression
You can use all of the following procedures you learned about OLS regression in logistic regression Dummy coding for categorical IVs Hierarchical entry of variables (compare changes in % classification; significance of Wald test) Stepwise (but don’t use, its atheoretical) Moderation tests
20
Simple Logistic Regression Example
Data collected from 50 employees Y = success in training program (1 = pass; 0 = fail) X1 = Job aptitude score (5 = very high; 1= very low) X2 = Work-related experience (months)
21
Syntax in SPSS DV LOGISTIC REGRESSION PASS /METHOD = ENTER APT EXPER
/SAVE = PRED PGROUP /CLASSPLOT /PRINT = GOODFIT /CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) . IVs
22
Results Block O: The Null Model results Block 1: Method = Enter
Can’t do any worse than this Block 1: Method = Enter Tests of the model of interest Interpret data from here Tests if model is significantly better than the null model. Significant chi-square means yes! Step, Block & Model yield same results because all IVs entered in same block
23
Results Continued -2 Log Likelihood an index of fit - smaller number means better fit (Perfect fit = 0) Pseudo R2 – Interpret like R2 in regression Nagelkerke preferred by some because max = 1, Cox & Snell more conservative estimate uniformly
24
Classification: Null Model vs. Model Tested
52% correct classification Model Tested 72% correct classification
25
Variables in Equation B effect of one unit change in IV on the log odds (hard to interpret) *Odds Ratio (OR) Exp(B) in SPSS = more interpretable; one unit change in aptitude increases the probability of passing by 1.7x Wald Like t test, uses chi-square distribution Significance to determine if wald test is significant
26
Histogram of Predicted Probabilities
27
To Flag Misclassified Cases
SPSS syntax COMPUTE PRED_ERR=0. IF LOW NE PGR_1 PRED_ERR=1. You can use this for additional analyses to explore causes of misclassification
28
Results Continued An index of model fit. Chi-square compares the fit of the data (the observed events) with the model (the predicted events). The n.s. results means that the observed and expected values are similar this is good!
29
Hierarchical Logistic Regression
Question: Which of the following variables predict whether a woman is hired to be a Hooters girl? Age IQ Weight
30
Simultaneous v. Hierarchical
Block 1. IQ Block 1. IQ, Age, Weight Cox & Snell .002; Nagelkerke .003 Block 2. Age Cox & Snell .264; Nagelkerke .353 Block 3. Weight Cox & Snell .296; Nagelkerke .395
31
Simultaneous v. Hierarchical
Block 1. IQ Block 1. IQ, Age, Weight Block 2. Age Block 3. Weight
32
Simultaneous v. Hierarchical
Block 1. IQ Block 1. IQ, Age, Weight Block 2. Age Block 3. Weight
33
Multinomial Logistic Regression
A form of logistic regression that allows prediction of probability into more than 2 groups Based on a multinomial distribution Sometimes called polytomous logistic regression Conducts an omnibus test first for each predictor across 3+ groups (like ANOVA) Then conduct pairwise comparisons (like post hoc tests in ANOVA)
34
Objectives of Discriminant Analysis
Determining whether significant differences exist between average scores on a set of variables for 2+ a priori defined groups Determining which IVs account for most of the differences in average score profiles for 2+ groups Establishing procedures for classifying objects into groups based on scores on a set of IVs Establishing the number and composition of the dimensions of discrimination between groups formed from the set of IVs
35
Discriminant Analysis
Discriminant analysis develops a linear combination that can best separate groups. Opposite of MANOVA In MANOVA, groups are usually constructed by researcher and have clear structure (e.g., a 2 x 2 factorial design). Groups = IVs In discriminant analysis, the groups usually have no particular structure and their formation is not under experimental control. Groups = DVs
36
How Discrim Works Linear combinations (discriminant functions) are formed that maximize the ratio of between-groups variance to within-groups variance for a linear combination of predictors. Total # discriminant functions = # groups – 1 OR # of predictors (whichever is smaller) If more than one discriminant function is formed, subsequent discriminant functions are independent of prior combinations and account for as much remaining group variation as possible.
37
Assumptions in Discrim
Multivariate normality of IVs Violation more problematic if overlap between groups Homogeneity of VCV matrices Linear relationships IVs continuous (interval scale) Can accommodate nominal but violates MV normality Single categorical DV Results influenced by: Outliers (classification may be wrong) Multicollinearity (interpretation of coefficients difficult)
38
Sample Size Considerations
Observations: # Predictors Suggested 20 observations per predictor Minimum required 5 observations per predictor Observations: Groups (in DV) Minimum: smallest group size exceeds # of IVs Practical Guide: Each group should have 20+ observations Wide variation in group size impacts results (i.e., classification is incorrect)
39
Example In this hypothetical example, data from 500 graduate students seeking jobs were examined. Available for each student were three predictors: GRE(V+Q), Years to Finish the Degree, and Number of Publications. The outcome measure was categorical: “Got a job” versus “Did not get a job.” Half of the sample was used to determine the best linear combination for discriminating the job categories. The second half of the sample was used for cross-validation.
40
DISCRIMINANT /GROUPS=job(1 2) /VARIABLES=gre pubs years /SELECT=sample(1) /ANALYSIS ALL /SAVE=CLASS SCORES PROBS /PRIORS SIZE /STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW CORR COV GCOV TCOV TABLE CROSSVALID /PLOT=COMBINED SEPARATE MAP /PLOT=CASES /CLASSIFY=NONMISSING POOLED .
47
Interpreting Output Box’s M Eigenvalues Wilks Lambda
Discriminant Weights Discriminant Loadings
49
Violates Assumption of Homogeneity of VCV matrices
Violates Assumption of Homogeneity of VCV matrices. But this test is sensitive in general and sensitive to violations of multivariate normality too. Tests of significance in discriminant analysis are robust to moderate violations of the homogeneity assumption.
51
Discriminant Weights Data from both these outputs indicate that one of the predictors best discriminates who did/did not get a job. Which one is it? Discriminant Loadings
52
This is the raw canonical discriminant function.
The means for the groups on the raw canonical discriminant function can be used to establish cut-off points for classification.
53
Classification can be based on distance from the group centroids and take into account information about prior probability of group membership.
55
Two modes?
57
Violation of the homogeneity assumption can affect the classification
Violation of the homogeneity assumption can affect the classification. To check, the analysis can be conducted using separate group covariance matrices.
58
No noticeable change in the accuracy of classification.
59
Discriminant Analysis: Three Groups
The group that did not get a job was actually composed of two subgroups—those that got interviews but did not land a job and those that were never interviewed. This accounts for the bimodality in the discriminant function scores. The discriminant analysis of the three groups allows for the derivation of one more discriminant function, perhaps indicating the characteristics that separate those who get interviews from those who don’t, or, those who have successful interviews from those whose interviews do not produce a job offer.
60
Remember this? Two modes?
63
DISCRIMINANT /GROUPS=group(1 3) /VARIABLES=gre pubs years /SELECT=sample(1) /ANALYSIS ALL /SAVE=CLASS SCORES PROBS /PRIORS SIZE /STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW CORR COV GCOV TCOV TABLE CROSSVALID /PLOT=COMBINED SEPARATE MAP /PLOT=CASES /CLASSIFY=NONMISSING POOLED .
65
Separating the three groups produces better homogeneity of VCV matrices.
Still significant, but just barely. Not enough to worry about.
66
Two significant linear combinations can be derived, but they are not of equal importance.
67
Weights What do the linear combinations mean now? Loadings
69
DF2 DF1 +4 -4 -2 +2 unemployed interview only got a job
70
DF2 DF1 +4 -4 -2 +2 unemployed interview only got a job Loadings
-4 -2 +2 unemployed interview only got a job Loadings Weights DF1 DF2 No. Pubs -1.246 .521 Yrs to finish 1.032 .602 GRE .734 .194 DF1 DF2 No. Pubs -.466 .867 Yrs to finish .401 .796 GRE .008 .354
71
DF2 DF1 +4 -4 -2 +2 unemployed interview only got a job This figure shows that discriminant function #1, which is made up of number of publications and years to finish, reliably differentiates between those who got jobs, had interviews only, and had no job or interview. Specially, a high value on DF1 was associated with not getting a job, suggesting that having few publications (loading = -.466) and taking a long time to finish (loading = .401) was associated with not getting a job.
74
Territorial Map Canonical Discriminant Function 2
ôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòô 6.0 ô ô ó ó ó ó 4.0 ô ô ô ô ô ô ô ó ó ó ó ó ó 2.0 ô ô ô ô ô ô ó ó ó * ó .0 ô ô ô ô * ô ó * ó ó ó ó ó ó ó -2.0 ô ô ô ô ô ô ó ó ó ó -4.0 ô ô ô ô ô ô ô ó ó ó ó ó ó -6.0 ô ô Canonical Discriminant Function 1 Symbols used in territorial map Symbol Group Label Unemployed Got a Job Interview Only * Indicates a group centroid
76
Classification A classification function is derived for each group. The original data are used to estimate a classification score for each person, for each group. The person is then assigned to the group that produces the largest classification score.
78
Is the classification better than would be expected by chance
Is the classification better than would be expected by chance? Observed values: Expected Unemployed Got a Job Interview Only All Actual 51 3 54 20 71 13 112 125 64 135 250
79
Expected classification by chance E = (Row x Column)/Total N
Unemployed Got a Job Interview Only All Actual (51x54) 250 (64x54) (135x54) 54 (51x71) (64x71) (135x71) 71 (51x125) (64x125) (135x125) 125 51 64 135
80
Correct classification that would occur by chance:
Expected Unemployed Got a Job Interview Only All Actual 11.016 13.824 29.16 54 14.484 18.176 38.34 71 25.5 32 67.5 125 250
81
The difference between chance expected and actual classification can be tested with a chi-square as well. = Chi squared = Where degree of freedom = (# groups -1)2 df = 4
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.