# Logistic regression.

## Presentation on theme: "Logistic regression."— Presentation transcript:

Logistic regression

Regression Regression is a set of techniques for exploiting the presence of statistical ASSOCIATIONS among variables to make predictions of values of one variable (the DV, TARGET or CRITERION) from knowledge of the values of other variables (the IVs or REGRESSORS).

Simple and multiple regression
In SIMPLE regression, there is just one IV. In MULTIPLE regression, there are two or more IVs.

Simple regression

An example In a study of the effects of media violence, children were measured on their Actual violence and their Exposure to screen violence. Here is a scatterplot of Actual violence against Exposure. Each point represents the scores of an individual child.

The regression line is drawn through the points

The ‘best-fitting’ line
The regression line of Actual violence upon Exposure is the ‘best-fitting’ line according to what is known as the LEAST SQUARES criterion.

General form of the simple regression equation

Estimates For a given value of the independent variable X, the corresponding point on the regression line Y/ serves as an ESTIMATE of the true value of the target variable Y. We know the true value of Y in this particular data set; but we are really interested in the question of how well the Actual violence of children IN GENERAL can be predicted from knowledge of their exposure to media violence.

Parameters A question about children IN GENERAL is a question about the characteristics of a POPULATION, not those of one particular data set. Such questions are about PARAMETERS, not STATISTICS.

Regression parameters
To be in a position to predict values of Y from X in the future, we assume that, in the population, there is a linear relationship between Actual violence and Exposure. The value of the slope and intercept we have calculated from our own data are ESTIMATES of the corresponding population parameters.

John’s scores John scored 9 on Exposure and 8 on Actual.
John’s predicted score from regression Y/ is the point on the line above the value 9 on the x-axis. The error in prediction is Y – Y/, a quantity known as the RESIDUAL score e. John’s residual score is shown.

Goodness-of-fit: The LEAST-SQUARES criterion

Ordinary least-squares (OLS) regression
This approach to regression is known as ORDINARY LEAST SQUARES (OLS) regression. There are other kinds of regression (such as LOGISTIC REGRESSION, today’s topic) that do not work in this way.

Regression and correlation
Regression and correlation are two sides of the same associative coin. The stronger the association, the narrower will be the elliptical scatterplot, the higher will be the value of the correlation coefficient and the smaller will be the residuals from regression. THE CORRELATION AND THE REGRESSION COEFFICIENT ALWAYS HAVE THE SAME SIGN. For fixed values of the variances of X and Y, the greater the value of r, the steeper will be the slope of the regression line, i.e., the greater will be the value of b1. The slope of the regression line b1 and r are related according to …

The coefficient of determination (r2)
The square of the Pearson correlation is known as the COEFFICIENT OF DETERMINATION. It is so-called because r2 is the proportion of the variance of Y that is accounted for by regression upon X.

Coefficient of determination

Prediction without regression
Suppose you know nothing of the association between X and Y. But you are told that the mean of the target variable Y has a certain value MY. You are asked to predict values of Y for various values of X. It can be shown that your best strategy is to guess the value of MY, irrespective of the value of X. This is termed INTERCEPT-ONLY prediction.

A baseline model In multiple regression and several other related techniques, the first step is to formulate a baseline model, which takes no account of any association among the variables. The baseline model is the equivalent of guessing the mean every time. This is ‘Step 0’ in several SPSS regression and modelling routines. Step 0 provides a comparison for the evaluation of later models that include one or more of the IVs.

Two or more IVs: multiple regression
We could try to predict a person’s actual violence not only from exposure to screen violence, but also from additional variables, such as number of years of education and other characteristics of the parents. We should then have to determine the relative importance of the various IVs and whether we needed to include all of them in the regression model. These are problems in MULTIPLE REGRESSION.

Multiple regression

Partial regression coefficients
In multiple regression, a PARTIAL REGRESSION COEFFICIENT is the estimated average change in the DV resulting from an increase of one unit in one particular IV with ALL THE OTHER IVs HELD CONSTANT.

The multiple correlation coefficient R
The MULTIPLE CORRELATION R is the correlation between the target variable Y and the corresponding predictions of Y from regression Y/.

R can never be negative The ABSOLUTE value of the correlation r is unchanged by linearly transforming either or both of the variables involved. So if the correlation between Y and X is +0.6, so is the correlation between Y and 3X The correlation between Y and –3X + 4 is – If, on the other hand, the correlation between Y and X is –0.6, the correlation between Y and –3X + 4 is +0.6. IF THE TRANSFORMATION HAS A NEGATIVE SLOPE, THE SIGN OF THE ORIGINAL CORRELATION CHANGES. If a correlation is negative, so also is the slope of the regression line, with the result that the correlation between Y and Y/ is positive.

Coefficient of determination in multiple regression
In multiple regression, the COEFFICIENT OF DETERMINATION is the square of the multiple correlation coefficient.

The case of one IV The multiple correlation coefficient is defined even in simple regression, where there is only one IV. Here, remembering that R can never be negative, it takes the absolute value of the Pearson correlation between X and Y, even when that has a negative value.

The coefficient of multiple determination R2
In multiple regression, the coefficient of determination, the proportion of variance of the target variable Y that is accounted for by regression, is R2, the square of the multiple correlation coefficient.

What if the DV is a set of categories?
Simple and multiple OLS regression assume that the DV and IVs consist of measures on an independent scale with units. The term CONTINUOUS VARIABLE is used for this sort of DV. But suppose we want to predict whether a person will suffer from a heart attack or contract a certain illness with known risk factors. Here, we are not predicting a VALUE, but membership of a CATEGORY.

Category prediction: the OLS approach
You are trying to predict the presence or absence of a blood condition, which is thought to be made more likely by smoking and alcohol consumption Why not let 0 = Condition Absent; let 1 = Condition Present and calculate the usual OLS multiple regression equation?

Problems There are serious problems with running OLS regression when the DV is a set of categories. None of the proposed solutions is entirely satisfactory. There are better approaches.

Regression with a categorical DV
The 2 most commonly used techniques are: Logistic regression Discriminant analysis

Discriminant analysis
If all (or most) IVs are continuous, you might consider using DISCRIMINANT ANALYSIS (DA). But the DA model makes assumptions about the distributions of the IVs (such as multivariate normality) which data sets often fail to satisfy. Moreover, DA doesn’t like qualitative IVs, such as sex or nationality.

Logistic regression Logistic regression makes fewer assumptions than discriminant analysis. Logistic regression, moreover, is happy with qualitative IVs; in fact, logistic regression is happy even if ALL the IVs are qualitative.

Logistic regression… It is suspected that smoking and drinking are risk factors in the incidence of a pre-morbid blood condition, characterised by the presence of an antibody. Records of the incidence of the condition in 100 patients are available, together with estimates of the amount they smoke and drink.

First, explore your data. Let’s find out how many of the patients have the condition.

Forty-four patients have the condition

Step 0 in logistic regression
We know that 44/100 people have the condition. Armed only with this fact, and with no knowledge of any associations there might be among the variables, we shall maximise our hit rate if we predict ABSENCE of the condition for ANY person selected at random. This, in logistic regression, is the equivalent of no-regression prediction in OLS regression: you just guess MY, whatever the value of X.

Here is the logistic regression output for Step 0

The model proper assumes …
Either you have the disease or you don’t. As smoking and alcohol increase, however, we assume that the probability of developing the condition increases CONTINUOUSLY as a function of the IVs. In logistic regression, we estimate the probability of the condition with the LOGISTIC REGRESSION FUNCTION If the estimated probability exceeds a cut-off (usually 0.5), the case is classified by the program as a Yes, rather than a No.

A logistic regression function

The odds Last week, I discussed the ODDS.
In an EXPERIMENT OF CHANCE (tossing a coin, rolling a die) the ODDS in favour of an event is the number of ways in which the event could occur, divided by the number of ways in which it could fail to occur.

The odds … Suppose we know that out of 100 people, 44 have a certain antibody in their blood. We select a person at random from this group. The ODDS in favour of the person having the antibody are 44 to 56 or 44/56.

The log odds (logit) The odds measure suffers from ASYMMETRY OF RANGE.
Unlikely events have odds between 0 and 1; likely events can have huge odds. The LOG ODDS (LOGIT) is the natural logarithm of the odds. Logit = ln(odds) = loge(odds).

When the logit is zero Suppose the odds were 50 to 50 (50/50 =1).
Since the log of 1 is zero, a logit of zero means that the odds for are equal to the odds against.

Range of the logit The logit has a symmetrical range: a positive sign means the odds are in favour; a negative sign means the odds are against. The logit has no upper or lower limit: it has an unlimited range of values.

Example The odds in favour of a case having the antibody are 44/56 = 11/14. Logit = ln(11/14) = –.24 The event is less likely than not. If the odds in favour were 56/44, the logit would be ln(56/44) = +.24. Notice the symmetry of the scale of magnitude around the neutral point at 1.

Probability A probability is a measure of likelihood ranging from 0 (an impossibility) to 1 (a certainty). The probability p of an event is the number of ways it can happen divided by the total number of outcomes. The probability of a six when a die is rolled is 1/6.

Relationship between p and odds
A probability and the odds are both measures of likelihood. They are related according to the equation on the left.

Antilogs We can write any finite real number as an ANTILOG, that is, as the BASE raised to the power of the LOG.

The antilog

Now, by substituting, we have the logistic regression function.

Logistic regression function

The logit equation

The logistic regression equation

The problem In the logit equation, we must find values for the intercept and the regression coefficients such that the accuracy of assignment of cases to categories of the dependent variable is maximised.

No mathematical solution
In logistic regression, there is no equivalent of the formulae for the intercept and coefficients in OLS regression. A ‘brute force’ computing algorithm is used whereby, starting at arbitrary values of the coefficients, the values are progressively adjusted to try to arrive at a set which maximises the likelihood of obtaining the observed frequencies. In a process known as ITERATION, estimates of the parameters are calculated again and again in the hope that they will ‘converge’ to stable values. IT DOESN’T ALWAYS HAPPEN! We must therefore check that this ‘convergence’ really has been achieved by examining the ITERATION HISTORY in the SPSS output.

Potential difficulties
The algorithm will not run successfully if the IVs are too highly correlated. This is the familiar MULTICOLLINEARITY PROBLEM we encountered in OLS regression.

Centring As with OLS multiple regression, it is a good idea to CENTRE variables, by subtracting the mean from each score. This move makes the algorithm more robust to substantial correlations among the variables.

Attributing causality
The IVs are likely to be correlated. As with any multiple regression, when the IVs are correlated, it can be difficult to attribute the DV (category membership in this case) unequivocally to any one IV. Moreover, should the battery of IVs be changed, the whole picture may change.

The meaning of a logistic regression coefficient
The regression coefficient is the increase in the logit in favour of an individual having the condition produced by an increment of one unit in the IV. Suppose that for Smoking, b = An increase of one smoking unit (eg 10 cigarettes) increases the logit (the log odds) by 1.1.

Regression coefficients …
In terms of the ODDS, an increase of one unit in the IV MULTIPLIES the original odds by the ANTILOG of b, that is, by eb, or exp(b). Exp(1.1) = 3.0 So an increase of one smoking unit results in the odds being MULTIPLIED by 3, that is, the event is THREE times as likely to happen.

Are the IVs in our data set closely associated?

Explore the data first.

Observations There’s a substantial correlation between at least one of the IVs and the DV. Good. There’s little association between the IVs. Very good. On the other hand, there is little association between Alcohol and the DV, which is bad. A logistic regression is feasible.

Covariates In SPSS logistic regression dialogs, IVs that are continuous variables are known as COVARIATES.

Always ask for the ITERATION HISTORY, so that you can check whether the algorithm was able to arrive at a stable estimate.

Dire warning! Should the iteration history show failure to converge, the results of the analysis can be ridiculous! The effects of failure to converge are not limited to the IV concerned: they can mess up the whole analysis!

Step 0: the ‘no-regression’, intercept-only model

The iteration history

Fitting a model The goodness-of-fit of a model is measured by a LOG LIKELIHOOD statistic LL. Its value is multiplied by 2 to obtain a chi-square statistic.

The Nagelkerke R2 statistic
The Nagelkerke statistic is the counterpart of the coefficient of determination R2 in OLS multiple regression. It is a measure of the proportion of the total variation in incidence of the blood condition accounted for by regression.

A regression model is now applied.
Hit rate using the regression model. This is obviously much better than the ‘no-regression’ hit rate of 56%. A regression model is now applied.

The Wald statistic The WALD STATISTIC tests a regression coefficient for significance. The null hypothesis is that, in the population, the coefficient is zero. The Wald statistic is B2/SE2 (not B/SE as Andy Field says on page 224) and is distributed as chi-square.

This is the antilog of the coefficient of Smoking in the logit equation. Increasing Smoking by one unit MULTIPLIES the odds in favour of occurrence by about 10.

Summary The incidence of the blood condition is indeed predictable from regression and raises the hit rate from 54% to 85%. Smoking contributes significantly to the model. Alcohol does not contribute significantly to the model.

The next step This session has been merely an introduction to the technique of logistic regression. The next step is to do some further reading.

Getting started There’s an elementary section on logistic regression in Kinnear, P., & Gray, C. (2007). SPSS14 made simple. Hove: Psychology Press. Chapter 14. This is mainly a practical, get-started guide; but there is an outline of the rationale of the technique as well.

An excellent textbook Howell, D. C. (2007). Statistical methods for psychology (6th ed.). Belmont, CA: Thomson/Wadsworth. There’s a helpful introduction to logistic regression in Chapter 15, the multiple regression chapter.

Sage paperbacks Menard, S. (2002). Applied logistic regression analysis (2nd ed.). London: Sage. Jaccard, J. (2001). Interaction effects in logistic regression. London: Sage.

Tabachnik, B. G. , & Fidell, L. S. (2007)
Tabachnik, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Allyn & Bacon. Chapter 10. Field, A. (2005). Discovering statistics using SPSS for Windows: Advanced techniques for the beginner (2nd ed.). London: Sage. Chapter 6.