# Introduction to Data Analysis

## Presentation on theme: "Introduction to Data Analysis"— Presentation transcript:

Introduction to Data Analysis
Introduction to Logistic Regression

This week’s lecture Categorical dependent variables in more complicated models. Logistic regression (for binary categorical dependent variables). Why can’t we just use OLS? How does logistic regression work? How do we compare logistic models? Reading: A & F chapter 15.

But, first an experiment
I’m going to show you a short video of some students playing basketball. There are 6 people; 3 dressed in black shirts and 3 in white shirts. I’d like you to count the number of times that the white shirted students pass the ball to each other in two different ways. An ‘aerial’ pass (without touching the ground on the way). A ‘bounce’ pass (touching the ground on the way). Thus after the video has ended you should have two totals, one for aerial passes by white shirts and one for bounce passes by white shirts.

“Gorillas in our midst” (1)

“Gorillas in our midst” (2)
This is a real bit of psychology research by Simons and Chabris (1999) at Harvard. They find that the harder the task, the more likely it is that people don’t spot the gorilla. Only 50% of his subjects spotted the gorilla… How is this relevant to us? Imagine we wanted to predict whether someone saw the gorilla or not, this is a binary dependent variable. We might have independent variables like concentration span, difficulty of the task, time of day and so on.

Predicting gorilla sightings (1)
Our dependent variable is just like the variables we were using earlier. e.g. vote choice in 1950s Britain was Labour or Conservative. But let’s say with this example we want to predict whether the gorilla will be spotted by a person with a particular set of characteristics. In this case, let’s say with a particular concentration span (measured on a scale). Since our independent variable is interval level data we can’t use cross-tabs.

Predicting gorilla sightings (2)
So, what we want to know is the probability that any person will be a gorilla spotter or not for any value of concentration span. Remember if we know this, we will know the proportion of people that will spot the gorilla at each level of concentration of span on average. We could use simple linear regression (SLR) here, with the dependent variable coded as 0 (no gorilla spot) or 1 (gorilla spotted). Well, why can’t we…?

What’s wrong with SLR? We want to predict a probability, this can only vary between zero and 1. But our SLR may predict values that are below zero or above 1… Let’s quickly fit a SLR to our example. Our sample here is the 108 subjects that Simons and Chabris used. I’ve added some extra data on their concentration spans. A scatter-plot isn’t all that much use here.

Scatter-plot (1) More low concentration people spot the gorilla.
More high concentration people DON’T spot the gorilla.

Scatter-plot (2) People with CS below 21 have > 1 probability of being a spotter… People with CS above 92 have < 0 probability of being a spotter… Could add a linear regression line

Other problems If you think about it, that’s just one problem.
For linear regression we assumed that the population distribution was normally distributed around the mean, for each value of the X variable. That’s not going to be the case if we’ve got a binary response. The distribution around the mean is going to be quite different. Looking at our data, when CS=50 we’ll have about 60% of cases scoring 1 (being spotters) and 40% of cases scoring 0 (not being spotters). That doesn’t sound much like a normal distribution…

What to do (1) Instead of linear OLS regression we use something called logistic regression. This is a very widely used method, and it’s important to understand how it works. Probably more widely used (especially if include variants) than linear OLS, as interesting dependent variables are often categorical. A randomly selected academic (by the name of Tilley) has used logistic regression in 55.5% of all his sociology and politics articles.

What to do? (2) Somehow we need to dump the linear OLS bit of our model for this binary categorical variable. So what we want to do is assume a different kind of relationship between the probability of seeing gorillas (or whatever) and concentration span. Maybe something like this…

What to do? (3) Here’s a more realistic representation of the relationship between the probability of gorilla spotting and CS

The logistic transformation (1)
This type of relationship is described by a special formula. Remember, if the relationship was linear then the equation is just: But the relationship on the graph is actually described by:

The logistic transformation (2)
This is just the odds. As the probability increases (from zero to 1), the odds increase from 0 to infinity. So if β is ‘large’ then as X increases the log of the odds will increase steeply. The log of the odds then increases from –infinity to +infinity. The steepness of the curve will therefore increase as β gets bigger.

Fitting this model (1) So that’s what we want to do, but how do we do it? With SLR we tried to minimize the squares of the residuals, to get the best fitting line. This doesn’t really make sense here (remember the errors won’t be normally distributed as there’s only two values). We use something called maximum likelihood to estimate what the β and α are.

Fitting this model (2) Maximum likelihood is an iterative process that estimates the best fitted equation. The iterative bit just means that we try lots of models until we get to a situation where tweaking the equation any further doesn’t improve the fit. The maximum likelihood bit is kind of complicated, although the underlying assumptions are simple to understand, and very intuitive. The basic idea is that we find the coefficient value that makes the observed data most likely.

Back to the gorillas So pressing the appropriate buttons in STATA or SPSS, allows us to fit a logistic regression to our gorilla spotting data. The numbers that we get out are not immediately interpretable however. Remember for OLS linear regression, a change of one unit on the X variable meant that the Y variable would increase by the coefficient for X. That’s not what the coefficient associated with X in our logistic regression means.

Gorilla results Variable Coefficient value Standard error p-value Concentration -0.07 0.01 0.00 Intercept 3.69 0.72 This is how logistic regression results are often reported in articles. It’s clear that concentration span has a negative (and statistically significant) effect on gorilla sightings. But what does the actually mean?

Interpreting the coefficients (1)
What we need to do is think about the equation again, and what an increase in X means. So an increase in X of 1 unit will decrease our log (odds) by 0.07. If we antilog both sides then we could see how the odds change… Remember the ‘hat’ sign means the predicted value.

Interpreting the coefficients (2)
Antilog both sides and we get the odds on the LH side. If we enter a value of X we can work out what the predicted odds will be. Thus the odds of spotting the gorilla (as opposed to not spotting the gorilla) are nearly 5. For every 5 spotters there should be one non-spotter.

Interpreting the coefficients (3)
We can also think about what happens to the odds when we increase X by a certain amount. Another way of writing ea+bX is ea(eb)X. That means that a one unit increase in X multiples the odds by eb (as it’s to the power of 1). In our case therefore a one unit increase in X multiplies the odds by e-0.07, or 0.93. When X increases from 30 to 31, the odds are 4.90*0.93, or 4.56. When X increases from 30 to 40, the odds are 4.90*(0.93)10, or 2.37.

Yet more coefficient interpretation (1)
The other way of thinking about things is in terms of probabilities. If we rearrange the ‘antilogged’ equation then we work out what the probability (for a particular value of X) would be. The probability of a person with CS=30 of gorilla spotting is thus 83%.

Yet more coefficient interpretation (2)
When CS=30, probability of spotting the gorilla is 83% Perhaps the most useful thing to do is to plot the predicted probabilities (it is easiest to do this in STATA).

Including other interval level independent variables and categorical independent variables is as easy as in multiple linear regression. The logic is the same as before, we are examining the effects of one independent variable when the other is held constant. The important bit is to understand what the coefficients from extra independent variables actually mean. Since this is less clear cut than in multiple linear regression we need to be careful in interpretation.

Let’s say we think that people that own monkeys are more adept at spotting the gorilla. We could include a dummy variable for monkey owner (1 if you are a monkey owner, and 0 if not). Variable Coefficient value Standard error p-value Concentration -0.09 0.02 0.000 Monkey owner 3.15 0.96 0.001 Intercept 4.01 0.83

Interpreting extra variables (1)
So owning a monkey (holding concentration span constant), multiplies the odds by e3.15, or 23.3 times. The odds of monkey owners spotting the gorilla are 23 times the odds of non-monkey owners spotting the gorilla. The probability of a person with a CS of 50 that owns a monkey being a gorilla spotter is 93%, and the probability of a person with a CS of 50 that does not own a monkey being a gorilla spotter is only 40%. With such a simple model we can still display it graphically. A linear model would have two parallel lines for each type of person (monkey or none) by CS. Our lines are NOT parallel.

Monkeys and no monkeys Monkey owners Non-monkey owners

Interpreting extra variables (2)
Generally, we want to present information from a logistic regression in the form of probabilities as these are easiest to understand. If we have lots of variables, then we normally set them to a particular value and then examine how the predicted probability of the dependent outcome varies. e.g. if I had more independent variables (age, sex, eyesight), I would produce the first graph before for men of average age with average eyesight not owning a monkey. Then I could see how concentration alone affected the predicted probability of a gorilla sighting.

Interactive monkeys (1)
We can also include interaction effects. Again though we need to be careful interpreting these. Variable Coefficient value Standard error p-value Concentration -0.12 0.02 0.00 Monkey owner -1.92 2.00 0.34 Monkey*concentration 0.08 0.04 Intercept 5.07 1.14

Interactive monkeys (2)
Monkey owners Non-monkey owners

Comparing models (1) One of the most important differences between logistic regression and linear regression is in how we compare models. Remember for linear regression we looked at how the adjusted R2 changed. If there was a significant increase when we added another variable (or interaction) then we thought the model had improved. For logistic regression there are a variety of ways of looking model improvement.

Comparing models (2) The best way of comparing models is to use something called the likelihood-ratio test. When we were using OLS regression, we were trying to minimize the sum of squares, for logistic regression we are trying to maximize something called the likelihood function (normally called L). To see whether our model has improved by adding a variable (or interaction, or squared term), we can compare the maximum of the likelihood function for each model (just like we compared the R2 before for OLS regressions).

Comparing models (3) In fact, just to complicate matters we actually compare the maximised values of -2*log L. By logging the Ls and multiplying them by -2, this statistic conveniently ends up with a chi-square distribution. This means we test whether there is a statistically significant improvement with reference to the χ2 distribution. First model’s maximised value Second model’s maximised value

Comparing models (4) Model -2logL
Each addition to the model here improves the model fit. We can test each improvement with a χ2 test using the appropriate DF. Each of these is statistically significant at the 0.05 level. Model -2logL Concentration 85.5 Concentration + monkey 65.2 Concentration + monkey + concent*monkey 60.0 Each model uses an extra degree of freedom, as we’re adding an extra parameter. From earlier weeks, we know that a χ2 test of 20 with 1 df is highly statistically significant, so the model clearly fits better with the extra terms.

Non-binary variables? A lot of categorical variables are not binary though, what can we do with these? Often we can recode them to a binary response. You often see vote choice in Britain coded to Conservative or not (with the not category including Labour, the Liberal Democrats and everyone else). We could use something called multinomial logistic regression. This allows the dependent categorical variable to have more than two categories. More on this next in POLS 7050.

Some warnings This course is only an introduction (and a very brief one at that) to statistical methods. Hopefully you can now pick up a journal and understand the results of a linear regression or logistic regression. Hopefully you can run models yourself and interpret the results. But, be careful on both counts. I really haven’t covered very much of the underlying math to the concepts I’ve talked about. Plus, there are specific things that are worth looking out for in your own and other people’s analysis. The three problems I’ve picked out are all things that will crop up next term in Intermediate Stats.

Warning (1) Are you planning on using time as an independent variable with aggregate data? e.g. predicting presidential approval for every month between 1970 and 2000 in the US (dependent variable), using economic growth (independent variable). STOP. You need to use time-series analysis. When we measure things over time, need to take into account autocorrelation. e.g. the errors we make in predicting presidential approval ratings in May are going to be highly correlated with the errors we make in predicting approval ratings in June.

Time-series analysis Time-series models for this kind of data normally have a lagged dependent variable as an independent predictor. We include Yt-1 as a predictor of Yt. The assumption is normally that the effect of Xt persists over time, the coefficient β0 is just the immediate impact. Broadly speaking the α1 coefficient tells us how long the effect persists for.

Warning (2) Are you planning on using items that are highly correlated, or not well measured, in a regression? e.g. predicting whether women work full-time or part-time (dependent variable) using ten different attitudes to feminism (independent variables). STOP. You need to use factor analysis and create a scale, or use structural equation modelling. Factor analysis tells you how a collection of characteristics are linked together, and whether one can create a scale from what appear to be similar items. SEM is similar to factor analysis and allows one to create latent variables.

Factor analysis This is really useful for attitudinal variables (though can be used in numerous other contexts as well). Imagine that I have questions about what people think is important about being British e.g. speaking English, having British ancestry, having British citizenship, feeling British, being a Christian, being born in Britain, living one’s whole life in Britain. We can use factor analysis to tease out whether there are groups of questions that ‘fit together well’. In this case we might expect to find two factors, one representing ‘civic’ items and the other representing ‘ethnic’ items. People that answer positively to one ‘civic’ question also answer positively to other ‘civic’ questions.

Warning (3) Do you have count data (with ‘smallish’ counts)?
The dependent variable is thus measured as the number of occurrences of a certain event in a given period of time. e.g. number of presidential vetoes in any year, number of strikes in any month, etc. STOP. You need to use a different kind of model that uses the Poisson distribution. There is no upper limit to this kind of data (or a very high limit), and we can’t measure it as a proportion. There is a special distribution (and therefore special kind of model) that describes this data.

Some parting words of “wisdom”
Quantitative data is easily found and can add a lot to your thesis. You don’t have to use fancy statistical methods to find interesting things. Take as many quantitative classes as you can! Quant work is certainly not the only way to analyze data, but a strong background makes you more marketable and well-rounded.

Similar presentations