Statistical Analysis SC504/HS927 Spring Term 2008

Slides:



Advertisements
Similar presentations
Lecture 8: Hypothesis Testing
Advertisements

STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
STATISTICS Random Variables and Distribution Functions
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Statistical Significance and Population Controls Presented to the New Jersey SDC Annual Network Meeting June 6, 2007 Tony Tersine, U.S. Census Bureau.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Overview of Lecture Parametric vs Non-Parametric Statistical Tests.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.
Dummy Dependent variable Models
Chapter 7 Sampling and Sampling Distributions
Solve Multi-step Equations
Simple Linear Regression 1. review of least squares procedure 2
Multilevel Event History Modelling of Birth Intervals
Chapter 4: Basic Estimation Techniques
Quantitative Methods II
Continued Psy 524 Ainsworth
1 Econ 240A Power Four Last Time Probability.
Chi-Square and Analysis of Variance (ANOVA)
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
2009 Foster School of Business Cost Accounting L.DuCharme 1 Determining How Costs Behave Chapter 10.
Hypothesis Tests: Two Independent Samples
Comparing Two Groups’ Means or Proportions: Independent Samples t-tests.
1..
Multivariate Data/Statistical Analysis SC504/HS927 Spring Term 2008 Week 18: Relationships between variables: simple ordinary least squares (OLS) regression.
Inferential Statistics
Rational Functions and Models
© The McGraw-Hill Companies, Inc., Chapter 10 Testing the Difference between Means and Variances.
Correlation and Regression
1 Interpreting a Model in which the slopes are allowed to differ across groups Suppose Y is regressed on X1, Dummy1 (an indicator variable for group membership),
Chapter Thirteen The One-Way Analysis of Variance.
Chapter 8 Estimation Understandable Statistics Ninth Edition
PSSA Preparation.
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Simple Linear Regression Analysis
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13.
Correlation and Linear Regression
Multiple Regression and Model Building
January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.
By Hui Bian Office for Faculty Excellence Spring
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Section 7-2 Estimating a Population Proportion Created by Erin.
Brief introduction on Logistic Regression
Logistic Regression.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
An Introduction to Logistic Regression
Generalized Linear Models
Logistic regression for binary response variables.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
Multiple Logistic Regression STAT E-150 Statistical Methods.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
Logistic regression (when you have a binary response variable)
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Logistic Regression and Odds Ratios Psych DeShon.
The Probit Model Alexander Spermann University of Freiburg SS 2008.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
The Probit Model Alexander Spermann University of Freiburg SoSe 2009
BINARY LOGISTIC REGRESSION
Basic Estimation Techniques
Generalized Linear Models
Basic Estimation Techniques
Introduction to Logistic Regression
Logistic Regression.
Presentation transcript:

Statistical Analysis SC504/HS927 Spring Term 2008 Introduction to Logistic Regression Dr. Daniel Nehring

Outline Preliminaries: The SPSS syntax Linear regression and logistic regression OLS with a binary dependent variable Principles of logistic regression Interpreting logistic regression coefficients Advanced principles of logistic regression (for self-study) Source: http://privatewww.essex.ac.uk/~dfnehr

PRELIMINARIES

The SPSS syntax Simple programming language allowing access to all SPSS operations Access to operations not covered in the main interface Accessible through syntax windows Accessible through ‘Paste’ buttons in every window of the main interface Documentation available in ‘Help’ menu

Using SPSS syntax files Saved in a separate file format through the syntax window Run commands by highlighting them and pressing the arrow button. Comments can be entered into the syntax. Copy-paste operations allow easy learning of the syntax. The syntax is preferable at all times to the main interface to keep a log of work and identify and correct mistakes.

PART I

Simple linear regression Relation between 2 continuous variables Regression coefficient b1 Measures association between y and x Amount by which y changes on average when x changes by one unit Least squares method y Slope x

Multiple linear regression Relation between a continuous variable and a set of i continuous variables Partial regression coefficients bi Amount by which y changes on average when xi changes by one unit and all the other xis remain constant Measures association between xi and y adjusted for all other xi

Multiple linear regression Predicted Predictor variables Response variable Explanatory variables Dependent Independent variables

OLS with a binary dependent variable Binary variables can take only 2 possible values: yes/no (e.g. educated to degree level, smoker/non-smoker) success/failure (e.g. of a medical treatment) Coded 1 or 0 (by convention 1=yes/ success) Using OLS for a binary dependent variable  predicted values can be interpreted as probabilities; expected to lie between 0 and 1 But nothing to constrain the regression model to predict values between 0 and 1; less than 0 & greater than 1 are possible and have no logical interpretation Approaches which ensure that predicted values lie between 0 & 1 are required such as logistic regression

Fitting equation to the data Linear regression: Least squares Logistic regression: Maximum likelihood Likelihood function Estimates parameters with property that likelihood (probability) of observed data is higher than for any other values Practically easier to work with log-likelihood

Maximum Likelihood Estimation (MLE) OLS cannot be used for logistic regression since the relationship between the dependent and independent variable is non-linear MLE is used instead to estimate coefficients on independent variables (parameters) Of all possible values of these parameters, MLE chooses those under which the model would have been most likely to generate the observed sample

Logistic regression Models relationship between set of variables xi dichotomous (yes/no) categorical (social class, ... ) continuous (age, ...) and dichotomous (binary) variable Y

PART II

Logistic regression (1) ‘Logistic regression’ or ‘logit’ p is the probability of an event occurring 1-p is the probability of the event not occurring p can take any value from 0 to 1 the odds of the event occurring = the dependent variable in a logistic regression is the natural log of the odds:

Logistic regression (2) ln (.) can take any value, p will always range from 0 to 1 the equation to be estimated is:

Logistic regression (3) Logistic transformation logit of P(y|x) {

Predicting p let then to predict p for individual i,

Logistic function (1) Probability of event y x

PART III

Interpreting logistic regression coefficients intercept is value of ‘log of the odds’ when all independent variables are zero each slope coefficient is the change in log odds from a 1-unit increase in the independent variable, controlling for the effects of other variables two problems: log odds not easy to interpret change in log odds from 1-unit increase in one independent depends on values of other independent variables but the exponent of b (eb) is not dependent on values of other independent variables and is the odds ratio

Odds ratio odds ratio for coefficient on a dummy variable, e.g. female=1 for women, 0 for men odds ratio = ratio of the odds of event occurring for women to the odds of its occurring for men odds for women are eb times odds for men

General rules for interpreting logistic regression coefficients if b1 > 0, X1 increases p if b1 < 0, X1 decreases p if odds ratio >1, X1 increases p if odds ratio < 1, X1 decreases p if CI for b1 includes 0, X1 does not have a statistically significant effect on p if CI for odds ratio includes 1, X1 does not have a statistically significant effect on p

dependent variable = presence of disability (1=yes,0=no) An example: modelling the relationship between disability, age and income in the 65+ population dependent variable = presence of disability (1=yes,0=no) independent variables: X1 age in years (in excess of 65 i.e. 650, 70  5) X2 whether has low income (in lowest 3rd of the income distribution) data: Health Survey for England, 2000

Example: logistic regression estimate for probability of being disabled, people aged 65+

PART IV

Odds, log odds, odds ratios and probabilities

Odds, odd ratios and probabilities pj = 0.2 i.e. a 20% probability oddsj = 0.2/(1-0.2) = 0.2/0.8 = 0.25 pk = 0.4 oddsk = 0.4/0.6 = 0.67 relative probability/risk pj/pk = 0.2/0.4 = 0.5 odds ratio, oddsi/oddsj = 0.25/0.67 = 0.37 odds ratio is not equal to relative probability/risk except approximately if pj and pk are small………

Points to note from logit example.xls if you see an odds ratio of e.g. 1.5 for a dummy variable indicating female, beware of saying ‘women have a probability 50% higher than men’. Only if both p’s are small can you say this. better to calculate probabilities for example cases and compare these

Predicting p let then to predict p for individual i,

E.g.: Predicting a probability from our model Predict disability for someone on low income aged 75: Add up the linear equation a(=-.912) + [age over 65 i.e.]10*0.078+1*-0.27 =-0.402 Take the exponent of it to get to the odds of being disabled =.669 Put the odds over 1+the odds to give the probability =c.0.4 – or a 40 per cent chance of being disabled

Goodness of fit in logistic regressions based on improvements in the likelihood of observing the sample use a chi-square test with the test statistic = where R and U indicate restricted and unrestricted models unrestricted – all independent variables in model restricted – all or a subset of variables excluded from the model (their coefficients restricted to be 0)

Statistical significance of coefficient estimates in logistic regressions Calculated using standard errors as in OLS for large n, t > 1.96 means that there is a 5% or lower probability that the true value of the coefficient is 0. or p  0.05

95% confidence intervals for logistic regression coefficient estimates For CIs of odds ratios calculate CIs for coefficients and take their exponents