Presentation on theme: "1 MRes 3rd March 2010 Logistic regression. 2 Programme 2pm – 3:15pm. A talk. A break for coffee. 3:45pm – 4:30pm. A short exercise."— Presentation transcript:
1 MRes 3rd March 2010 Logistic regression
2 Programme 2pm – 3:15pm. A talk. A break for coffee. 3:45pm – 4:30pm. A short exercise.
3 Background Logistic regression is a special kind of regression designed for a specific type of situation. To understand logistic regression, however, you must be clear about some of the fundamentals of ORDINARY LEAST SQUARES (OLS) regression. Ill review those first, before I talk about logistic regression itself.
4 A study Does watching screened violence promote violent behaviour in children? In a study of the effects of media violence, some children were measured on their Actual violence and on their Exposure to screened violence. Here is a scatterplot of Actual violence against Exposure.
5 The scatterplot Each point in the plot represents one child. The coordinates of the point are the childs scores on Exposure to and Actual violence. A strong statistical ASSOCIATION between Exposure to and Actual violence is evident from the elliptical shape of the cloud of points.
6 A basically linear association
7 The Pearson correlation In this situation, the strength of an association is measured by the Pearson correlation, the formula for which is:
8 Regression Regression is a set of statistical techniques enabling the researcher to exploit an association among variables to PREDICT the values of one variable from those of others. From regression, you can also ascertain the extent to which the variance of a target variable can be EXPLAINED or accounted for in terms of the other variables.
9 Some key terms The variable we are trying to predict or account for is known variously as the DEPENDENT VARIABLE (DV), the CRITERION, or the TARGET VARIABLE. The predictors are known as the INDEPENDENT VARIABLES (IVs) or REGRESSORS. In our current example, the DV is Actual violence and the IV is Exposure to screen violence.
10 Simple regression and multiple regression In SIMPLE REGRESSION, there is just ONE IV or regressor. In MULTIPLE REGRESSION, there are TWO OR MORE IVs or regressors.
11 The regression line In simple regression, a line called the REGRESSION LINE is drawn through the points. The regression line is the line that fits the points most closely, according to the LEAST SQUARES CRITERION. The traditional approach to regression is therefore referred to as ORDINARY LEAST SQUARES (OLS) REGRESSION.
12 Equation of a straight line
13 In general …
14 The regression line
15 The regression equation
16 On the graph …
17 Interpretation of slope or regression coefficient The slope or REGRESSION COEFFICIENT is the average number of units of change in the DV that result from a change of one unit on the IV. In our example, slope =.74. So, an increase of one unit in Exposure produces, on average, an increase of.74 (rather less than one unit) in Actual violence.
18 Residuals Joe scored 8 on Exposure and 9 on Actual. Joes predicted score from regression Y / is the point on the line above the value 8 on the x-axis. This predicted score is 8. The error in prediction (e ) is (Y – Y / ), a quantity known as the RESIDUAL score. Joes residual score is (9 – 8 = 1), as shown in the following diagram.
19 The residual (e)
20 Least squares criterion for goodness-of-fit The values of the slope and intercept of the regression line are such that the sum of the squares of the residuals (SS error ) is a minimum.
21 A unique solution The values of b and c needed to achieve the least squares criterion are given by the formula below. Clearly, the regression coefficient b is closely related to the Pearson correlation.
22 Other kinds of regression In ordinary least-squares (OLS) regression, a regression line is fitted, such that the sum of the squares of the residuals is a minimum. There are other kinds of regression (such as LOGISTIC REGRESSION, todays topic) that do not work in this way.
23 Regression and correlation Regression and correlation are two sides of the same associative coin. The higher the correlation, the narrower the elliptical cloud of points in the scatterplot. For fixed values of the variances of X and Y, the higher the correlation, the greater will be the value of the regression coefficient.
24 The violence data
25 A negative correlation So far, I have considered only positive correlations. Heres a negative one. Does the number of complaints made against GPs very inversely with the average length of their appointments? The following scatterplot supports this hypothesis.
26 A strong negative correlation
27 Relation between the regression coefficient and the correlation coefficient The value of the regression coefficient is directly proportional to the value of the correlation coefficient.
28 The signs of b and r The regression coefficient and the correlation always have the same sign. For the violence data, both are positive. For the data on GPs appointments, both are negative.
29 Complete independence I take two random samples, each of size 10,000 from a normal population with mean 100 and SD 25. (The syntax for doing this is in an appendix.) Since there should be no association between the two samples, the correlation between them should be zero. The scatterplot will be CIRCULAR. The regression line will be HORIZONTAL, that is, with zero slope.
30 No association
31 Intercept-only regression The regression line is horizontal and passes through the value 100 on the y-axis. This is the mean value of the distribution of the dependent variable. Here the intercept of the regression line is equal to the mean value of Y and its slope is zero. When X and Y are independent, you can only predict the mean value of Y whatever the value of X. This is known as INTERCEPT- ONLY REGRESSION.
32 Model-building When testing the goodness-of-fit of regression models to the data, a useful baseline is provided by the INDEPENDENCE MODEL, which makes intercept-only predictions of the dependent variable by predicting the mean value of the DV whatever the value of the IV. In several computing procedures, this is labelled as STEP 0 in the analysis. A good regression model should be a big improvement upon the independence model.
33 The coefficient of determination (r 2 ) The square of the Pearson correlation is known as the COEFFICIENT OF DETERMINATION. It is so-called because r 2 is the proportion of the variance of Y that is accounted for by regression upon X.
34 Coefficient of determination
35 Two or more IVs: multiple regression We could try to predict a childs actual violence not only from level of exposure to screen violence, but also from additional variables, such as level of parental violence and parental education. We should then have to determine the relative importance of the various IVs and whether we needed to include all of them in the regression model. These are problems in MULTIPLE REGRESSION.
36 Equations for simple and multiple regression In the multiple regression equation, c is the CONSTANT and b 1, b 2, …,b p are the PARTIAL REGRESSION COEFFICIENTS.
37 Partial regression coefficients In multiple regression, a PARTIAL REGRESSION COEFFICIENT is the estimated average change in the DV resulting from an increase of one unit in one particular IV with ALL THE OTHER IVs HELD CONSTANT.
38 The multiple correlation coefficient R The MULTIPLE CORRELATION COEFFICIENT (R) is the correlation between the target variable Y and the corresponding predictions Y / of Y from regression upon X.
39 Notation When it is necessary to specify which variables are involved in a multiple regression, a subscript notation is used. The multiple correlation between Y and X 1, X 2, …, X p is
40 Properties of R R can never take a negative value, because the sign of the slope of the regression line is always the same as that of the correlation. Recall that the Pearson correlation can only vary within the range from –1 to +1, inclusive. In contradistinction, R can only take values between zero and +1, inclusive.
41 The case of one IV The multiple correlation coefficient is defined even in simple regression, where there is only one IV. Here, remembering that R can never be negative, it takes the ABSOLUTE VALUE of the Pearson correlation (r ) between X and Y, even when r has a negative value. So in SPSS, R is included in the output for simple regression.
42 The coefficient of multiple determination R 2 In multiple regression, THE COEFFICIENT OF MULTIPLE DETERMINATION R 2 is the proportion of variance of the dependent variable Y that is accounted for by regression upon the IVs.
43 A spatial representation of the coefficient of multiple determination
44 What if the DV is a set of categories? Simple and multiple OLS regression assume that all the variables are CONTINUOUS, that is, measures on an independent scale with units. But suppose we want to predict whether a person will suffer from a heart attack or contract a certain illness with known risk factors. Here, we are predicting not a VALUE, but CATEGORY MEMBERSHIP.
45 Regression with a categorical DV The two most commonly used techniques are: 1.Logistic regression 2.Discriminant analysis
46 Discriminant analysis If all (or most) IVs are continuous, you might consider using DISCRIMINANT ANALYSIS (DA). But the DA model makes assumptions about the distributions of the IVs (such as multivariate normality) which research data often fail to satisfy. Moreover, DA doesnt like qualitative IVs, such as sex or nationality. For these reasons, logistic regression is increasingly preferred to DA when the DV is categorical.
47 Categorical IVs Unlike DA, logistic regression is happy with qualitative IVs; in fact, logistic regression is happy even if ALL the IVs are qualitative.
48 A research question It is suspected that smoking and drinking are risk factors in the incidence of a pre- morbid blood condition, characterised by the presence of an antibody. Records of the incidence of the antibody in 100 patients are available, together with estimates of the amounts that they smoke and drink.
49 The data
50 How many of the patients have the antibody?
51 Use Frequencies
52 Frequencies dialog
53 Forty-four of the hundred patients have the antibody
54 The odds In an EXPERIMENT OF CHANCE (tossing a coin, rolling a die) the ODDS in favour of an event is the number of ways in which the event could occur, divided by the number of ways in which it could fail to occur. If a die is rolled, there is one way of getting a six and there are five ways of not getting a six. The odds in favour of a six are therefore 1/5.
55 Odds in favour of having the antibody We know that out of 100 patients, 44 have the antibody. We select a person at random from this group. There are 44 ways of selecting a person with the antibody; and 56 ways of selecting someone without it. The odds in favour of the person having the antibody are 44/56 =.79.
56 Probability A probability is a measure of likelihood ranging from 0 (an impossibility) to 1 (a certainty). The classical definition of probability, like that of the odds, also arises in the context of an experiment of chance. The probability p of an event is the number of ways it can happen divided by the TOTAL number of possible outcomes. When a die is rolled, there are six possible outcomes. There is one way of getting a six. The probability of a six when a die is rolled is therefore 1/6.
57 Relationship between probability and the odds Probability and the odds are both measures of likelihood and have been defined in the same context – an experiment of chance. They are related according to the equation on the left.
58 Logarithms In a logarithmic system, numbers are expressed as powers (logs) of a constant called the BASE of the system. In COMMON LOGS, the base is 10. In NATURAL LOGS, the base is the mathematical constant e, where e is approximately 2.72.
59 Logs and antilogs Before the IT revolution, calculations involving large numbers were done by converting the numbers to logs, working with the logs (which are much smaller numbers), then reversing the log function with the ANTILOG FUNCTION to get back to the original number scale.
60 The antilog function
61 Log notation (base 10)
62 Log notation (base e)
63 An asymmetrical measure The odds measure suffers from ASYMMETRY OF RANGE. Extremely unlikely events have odds confined between 0 and 1; whereas very likely events can have huge odds running into millions. Two very likely events could be separated by millions in terms of odds; two very unlikely events will be separated by minute fractions.
64 The log odds or logit The LOG ODDS (LOGIT) is the natural logarithm (log to the base e) of the odds. Logit = ln(Odds) = log e (Odds).
65 Even Steven Suppose the odds were 50 to 50 (50/50 =1). The natural log of 1 is zero (e 0 = 1). So for raw odds of 50 to 50, the logit (log odds) is zero.
66 Range of the logit The logit has a symmetrical range: a positive sign means the odds are in favour; a negative sign means the odds are against. Unlike the odds, which has a lower limit of zero, the logit has neither an upper nor a lower limit.
67 Example In our current example, the odds in favour of a case having the antibody are 44/56 = 11/14 =.79 Logit = ln(.79) = –.24 The event is less likely than not, hence the negative sign. If the odds in favour were 56/44, the logit would have been ln(56/44) = ln(1.27) = +.24. Notice the symmetry of the scale of magnitude around the neutral point at 0.
68 Odds as antilogs A number such as the odds can be written as an ANTILOG, that is, the base e to the power of the natural log of the odds (the logit):
69 Probability and the logit We can therefore express the probability in terms of the logit, rather than the odds. We shall use the symbol Z for the logit.
70 The logistic regression function We have arrived at the LOGISTIC REGRESSION FUNCTION, in which Z is the logit or log odds.
71 Assumptions of logistic regression Either you have the antibody or you dont. As smoking and alcohol increase, however, the probability of having the antibody is assumed to increase CONTINUOUSLY as a function of the IVs. In logistic regression, we estimate the probability of having the antibody with the LOGISTIC REGRESSION FUNCTION If the estimated probability exceeds a cut-off (usually set at 0.5), the case is classified by the program as a Yes, rather than a No.
72 A logistic regression function
73 Logistic regression and logit functions We have seen that the logistic regression function is non-linear. The logit function (Z), however, is assumed to be linear.
74 The logit equation The logit is assumed to be a linear function Z of the independent variables. Z looks like an OLS linear regression equation, with a constant and partial regression coefficients.
75 Typical graph of the logit function Z
76 The decision rule
77 The log of the product is the sum of the logs Taking antilogs of both sides of the equation shows that the product of the original numbers is the product of the antilogs.
78 Interpretation of a logistic regression coefficient The partial regression coefficient is the increase in the LOG ODDS or LOGIT (Z ) arising from an increase of one unit in the independent variable. The antilog of the partial regression coefficient is the factor by which the original odds must be MULTIPLIED to give the new odds when the IV increases by a unit.
79 Interpretation of b A unit increase in Smoking increases Z to Z + b.
80 Example In terms of the ODDS, an increase of one unit in the IV MULTIPLIES the original odds by the ANTILOG of b, that is, by e b, or exp(b). If b = 1.1, exp(1.1) = 3.0 So an increase of one smoking unit results in the odds being MULTIPLIED by 3, that is, the antibody is THREE times as likely to be present in the blood of those who smoke a unit more.
81 The regression problem In the logit equation, we must find values of the constant and partial regression coefficients such that correct assignment to categories by the logistic regression function is maximised.
82 No mathematical solution In logistic regression, there is no equivalent of the formulae for the intercept and coefficients in OLS regression. A brute force computing algorithm is used whereby, starting at arbitrary values of the coefficients, the values are progressively adjusted to try to arrive at a set which maximises the likelihood of obtaining the observed frequencies.
83 Iteration and convergence In a process known as ITERATION, estimates of the parameters are calculated again and again in the hope that they will converge to stable values. IT DOESNT ALWAYS HAPPEN! We must therefore check that this convergence really has been achieved by examining the ITERATION HISTORY in the SPSS output.
84 Potential difficulties The algorithm will not run successfully if the IVs are too highly correlated. This is the familiar MULTICOLLINEARITY PROBLEM sometimes encountered in OLS regression.
85 Centring As with OLS multiple regression, it is a good idea to CENTRE variables, by subtracting the mean from each score, so that the mean of the transformed scores is zero. Centring leaves the correlations among the variables unchanged. But centring makes the algorithm more robust to substantial correlations among the variables.
86 Finding binary logistic regression
87 Covariates In SPSS logistic regression dialogs, IVs that are continuous variables are known as COVARIATES.
88 Always ask for the ITERATION HISTORY, so that you can check whether the algorithm was able to arrive at a stable estimate.
89 Dire warning! Should the iteration history show failure to converge, the results of the analysis can be ridiculous! The effects of failure to converge are not limited to the IV concerned: they can mess up the whole analysis!
90 The logistic regression dialog
91 The Options dialog
92 Fitting a model The goodness-of-fit of a model is measured by a log likelihood chi-square statistic. The SMALLER the value of chi-square, the BETTER the fit. The LARGER the p -value the better.
93 Step 0 in logistic regression We know that 44/100 people have the condition. Armed only with this fact, and with no knowledge of any associations there might be among the variables, we shall maximise our hit rate if we predict ABSENCE of the condition for ANY person selected at random. This is the equivalent, in logistic regression, of intercept-only (no-regression) prediction in OLS regression: you just guess M Y, whatever the value of X.
94 Here is the logistic regression output for Step 0
95 Classification table at Step 0
96 The iteration history
97 The Nagelkerke R 2 statistic The Nagelkerke statistic is the counterpart of the coefficient of determination R 2 in OLS multiple regression. It is a measure of the proportion of the total variation in incidence of the antibody accounted for by regression.
98 The Nagelkerke R 2 statistic
99 Cohens guidelines
100 Hosmer and Lemeshow contingency table
101 Goodness-of-fit test
102 Classification table at Step 1 (after the regression model has been applied)
103 The Wald statistic The WALD STATISTIC tests a regression coefficient for significance. The null hypothesis is that, in the population, the coefficient is zero. The Wald statistic is distributed approximately as chi-square on one degree of freedom.
104 Some regression statistics The Wald statistic confirms that Smoking has an effect (p-value is very small) but Alcohol does not (the p-value is large).
105 The regression coefficient
106 The logit equation
107 The logistic regression function
108 Graph of accuracy of prediction
109 Conclusion The incidence of the blood condition is indeed predictable from regression and raises the hit rate from 54% to 85%. Smoking contributes significantly to the model. Alcohol does not contribute significantly to the model.
110 The next step This session has been merely an introduction to the technique of logistic regression. The next step is to do some further reading.
111 Getting started Theres an elementary section on logistic regression in –Kinnear, P., & Gray, C. (2010). PASW Statistics 17 made simple. Hove: Psychology Press. Chapter 14. This is mainly a practical, get-started guide; but there is an outline of the rationale of the technique as well.
112 Dugard, P., Todman, J., & Staines, H. (2010) Approaching multivariate analysis: a practical introduction. (2 nd ed.) London & New York: Routledge. Next stop
113 Sage paperbacks Menard, S. (2002). Applied logistic regression analysis (2 nd ed.). London: Sage. Jaccard, J. (2001). Interaction effects in logistic regression. London: Sage.
114 Tabachnik, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5 th ed.). Boston: Allyn & Bacon. Chapter 10. Field, A. (2005). Discovering statistics using SPSS for Windows: Advanced techniques for the beginner (2 nd ed.). London: Sage. Chapter 6.
115 Appendix Using syntax to draw random samples from specified populations
116 Drawing two samples from a normal distribution with mean 100 and SD 15