1 OLS & Logistic Regression Analysis – A Recap Cristina Penaloza & Eoin Maloney Health Economics Unit.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Logistic Regression Psy 524 Ainsworth.
Forecasting Using the Simple Linear Regression Model and Correlation
Logistic Regression.
Departments of Medicine and Biostatistics
Nguyen Ngoc Anh Nguyen Ha Trang
Logistic Regression STA302 F 2014 See last slide for copyright information 1.
Lecture 8 Relationships between Scale variables: Regression Analysis
Chapter 13 Multiple Regression
Chapter 2: Looking at Data - Relationships /true-fact-the-lack-of-pirates-is-causing-global-warming/
Linear Regression.
Chapter 13 Additional Topics in Regression Analysis
To accompany Quantitative Analysis for Management, 9e by Render/Stair/Hanna 4-1 © 2006 by Prentice Hall, Inc., Upper Saddle River, NJ Chapter 4 RegressionModels.
QUALITATIVE AND LIMITED DEPENDENT VARIABLE MODELS.
Chapter 12 Multiple Regression
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
The Simple Regression Model
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Chapter 11 Multiple Regression.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Linear Regression and Correlation Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and the level of.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
1 Regression Models with Binary Response Regression: “Regression is a process in which we estimate one variable on the basis of one or more other variables.”
MODELS OF QUALITATIVE CHOICE by Bambang Juanda.  Models in which the dependent variable involves two ore more qualitative choices.  Valuable for the.
Regression and Correlation Methods Judy Zhong Ph.D.
Multiple Choice Questions for discussion
Regression Method.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Understanding Statistics
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
Logistic Regression STA2101/442 F 2014 See last slide for copyright information.
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Limited Dependent Variables Ciaran S. Phibbs May 30, 2012.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Linear correlation and linear regression + summary of tests
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Limited Dependent Variables Ciaran S. Phibbs. Limited Dependent Variables 0-1, small number of options, small counts, etc. 0-1, small number of options,
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
SIMPLE LINEAR REGRESSION AND CORRELLATION
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
ALISON BOWLING MAXIMUM LIKELIHOOD. GENERAL LINEAR MODEL.
1 Introduction to Modeling Beyond the Basics (Chapter 7)
Probability and odds Suppose we a frequency distribution for the variable “TB status” The probability of an individual having TB is frequencyRelative.
Lecturer: Ing. Martina Hanová, PhD. Business Modeling.
Nonparametric Statistics
BPS - 5th Ed. Chapter 231 Inference for Regression.
1/25 Introduction to Econometrics. 2/25 Econometrics Econometrics – „economic measurement“ „May be defined as the quantitative analysis of actual economic.
Lecturer: Ing. Martina Hanová, PhD. Business Modeling.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Instructor: R. Makoto 1richard makoto UZ Econ313 Lecture notes.
BINARY LOGISTIC REGRESSION
Understanding Standards Event Higher Statistics Award
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Introduction to Econometrics
Prepared by Lee Revere and John Large
Simple Linear Regression
Seminar in Economics Econ. 470
Linear Regression Summer School IFPRI
3.2. SIMPLE LINEAR REGRESSION
Logistic Regression.
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Presentation transcript:

1 OLS & Logistic Regression Analysis – A Recap Cristina Penaloza & Eoin Maloney Health Economics Unit

2 Outline What is regression analysis? Relevance of regression analysis Regression modelling process –OLS regression –Logistic regression Exercise

3 What is Regression Analysis? “Regression analysis is concerned with the study of the dependence of one variable, the dependent variable, on one or more other variables, the explanatory variables, … with a view to estimating and/or predicting the (population) mean or average value of the dependent variable in terms of known or fixed (in repeated sampling) values of the explanatory variables.” Gujarati (1995: 16)

4 Terminology Dependent variable, explained variable, outcome variable, outcome, response variable, regressand, output variable, predicted value, predictand, endogenous Explanatory variable, Independent variable, predictor variable, predictor, regressor, stimulus/control variable, exogenous Disturbance (random error) term, residual, residual error

5 Causation / correlation Regression vs causation –“A statistical relationship, however strong and however suggestive, can never establish causal connection: our ideas of causation must come from outside statistics” Gujarati (1995: 20) Regression vs correlation –Correlation analysis: seeks to measure the strength of linear association between two variables –Regression analysis: seeks to estimate or predict the average value of one variable on the basis of fixed values of other variables

6 Why study regression? Adjusting for baseline characteristics in Economic Evaluation (Nathwani et al. 2004; Manca et al. 2005; Hoch et al 2002) Predicting/mapping utility-based outcome measures for use in Economic Evaluation (Gray et al. 2006; Kaambwa et al.2011; Sengupta et al 2004) Predicting costs for use in Economic Evaluation (Smith et al. 2007; Bonizzato et al. 2000; Baumeister et al. 2009) Constructing CEACs (Hoch et al. 2006) Regression imputation for missing data (Billingham et al. 2002; Engels & Diehr, 2003; Blazer et al. 1995) Explaining factors which cause variation in outcome and cost data (Barber &Thomspon, 2004; Kaambwa et al. 2008; Raine et al, 2010)

7 The regression modelling process 1.Statement of hypothesis (theory) 2.Specification of the model 3.Obtaining the data 4.Estimation of the regression model 5.Diagnostic analysis 6.Hypothesis testing 7.Prediction/forecasting

8 1. Statement of hypothesis Example: High Blood Pressure and older people “Amongst those over the age of 65, the incidence of high diastolic blood pressure (dipb) increases with age. Therefore, dipb is, in part, explained by age.”

9 In Functional form: Mean Diastolic High Blood Pressure, DIBP, is some function of age, A: DIBP = f (A) (1) 2. Specification of the model

10 2. Specification of the model (cntd) In Mathematical (linear) form: Y =  1 +  2 X (2) where Y = Mean DIBP and X = age  1 &  2 = parameters

11 E(Y|X) X Linear relationship x1x1 x6x6 x3x3 x4x4

12 Econometric (Regression) model Y =  1 +  2 X + u (3) 2. Specification of the model (cntd) Where Y = Mean DIBP - the dependent variable X = Age - explanatory variable u = Disturbance (random error) term  1 &  2 = parameters

13 The error term (u) Omitted explanatory variables Measurement error Wrong functional form Unavailability of data Inherent randomness etc….

14 3 & 4. Data / estimation of parameters Obtaining the data –observed values of Y and X Estimation of the parameters –Y and X are the variables (“known”) –  1,  2 and u are the parameters (“unknown”)

15 5. Diagnostic analysis Is the model correctly specified? Have all assumptions been met? Are there any unusual observations or outliers that may unduly influence results? More of this later this morning…

16 6. Hypothesis Testing Is estimate statistically close to a postulated value? Or are estimates in accord with expectations from theory? Only after model has been shown to be adequate

17 7. Forecasting or Prediction If hypothesis or theory being tested is confirmed, then future values of the dependent variable can be predicted or forecast Policy recommendations

18 Hypothesis / theory Model specification Data Estimation Specification testing and diagnostic testing Is the model adequate? No Yes Hypothesis testing Policy: prediction and forecasting The practice of regression modelling

19 Sample regression In practice we will never observe the population regression line. Instead we take a random sample of observations in order to estimate the  s. We distinguish the sample regression from the population regression as follows:

20 Sample regression Mathematical Model Econometric Model where = estimator of E(Y/X i ) = estimator of  1 = estimator of  2 = estimate of u i

21 Population regression Mathematical Model Econometric Model where = E(Y/X i ) = constant/Y intercept = coefficient for X i = error term

22 Y4Y4 X1X1 X2X2 X3X3 X4X4 X Y Y2Y2 Y3Y3 Y1Y1

23 Y4Y4 X1X1 X2X2 X3X3 X4X4 X Y Y2Y2 Y3Y3 Y1Y1

24 : The Ordinary Least Squares (OLS) Model Dependent variable is modelled as a linear function of predictor or independent variables. The dependent variable is continuous e.g. Blood pressure, Cholesterol level or Weight.

25 What factors cause variation in an individual’s Diastolic blood pressure? What variables explain movement in Men’s cholesterol level? What variables are predictive of high birth weight in a population of mothers from Birmingham? Dependent variable can take on any numerical value within the limits of the range of that variable. OLS

26 The OLS method seeks to minimise the residual sum of squares: OLS

27 }.. X1X1 X2X2 X3X3 X4X4 { X Y.. Minimising the residual…

28 i.e. the proportion of the variation in Y i which is explained by the regression Coefficient of determination, or R 2, is a measure of the ‘goodness of fit’ of a regression Describing the overall fit of the estimated model 0 < R 2 < 1 But focusing solely on maximising R 2 is not a good idea! (other measures will be consider this afternoon…)

29 Models for Categorical Dependent Variables For use on dependent variables that are either dichotomous (individual has CVD or not), or polytomous (Low, Medium or High cholesterol level) which are quite common in Health-related datasets

30 Models for Categorical Dependent Variables Focus Binary response variable – independent variables are used to predict whether or not some event will occur: Based on certain described characteristics: Will an individual get cancer or not? Will a patient survive or die? will an individual develop CVD or not?

31 Coding of outcomes: Usually coded 1 if the attribute of interest is present and 0 otherwise. Approach to be used: Logistic regression - best for dichotomous dependent variable, and continuous and categorical independent variables. Other commonly used approaches: Probit & Nested Logit

32 Major difference from Ordinary Linear Regression Uses link for relationship between dependent and independent variable Substitute maximum likelihood estimation (MLE) of a link function of the dependent variable for regression's use of least squares estimation of the dependent variable itself. MLE - Method of estimating unknown parameters in such a way that the probability of observing a given dependent variable is as high (or maximum) as possible

33 Issues to consider… Why are OLS models not suitable for dichotomous data? Logit transformation – Link Function Marginal & Conditional Odds and Probability

34 Suppose we want to model Y i = β 0 + β 1 X 1 + ε but and β 0 is the coefficient on the constant term, β 1 is the coefficient on the independent variable, X 1 is the independent variable – e.g. Age, and ε is the error term.

35 Let Y i = 1 if the i th individual has CVD, and 0 otherwise. Let also Y i take the values 1 and 0 with probabilities p i and 1-p i, respectively. i.e. P(Y 1 =1) = P(CVD =1) = p 1 P(Y 1 =0) = P(CVD =0) = 1- p 1

36 Why not just use Simple Linear (OLS) regression? Consider a simple OLS regression model CVD = β 0 + β 1 Age+ ε, Assumptions a)ε ~N(0, δ 2 ) b) var (ε) is constant i.e. Homoscedasticity Binary outcome variables violate these assumptions…

37 Why not just use Simple Linear (OLS) regression? CVD is binary as P takes on only two values. Consequently, ‘ε’ is also binary and therefore ‘normality of residuals’ assumption is violated. The error terms are heteroscedastic, so regression assumption that the variance of the error term is constant is violated. The predicted probabilities can be greater than 1 or less than 0 which can be a problem if the predicted values are used in a subsequent analysis!

38 Logit transformation 1.Move from probabilities to Odds 2. Take logs of both sides, to get log-odds or Logit or equivalently,

39 The Logit transformation removes the floor restriction

40 Logistic Regression Output Part of this output is in form of Odds, Odds ratios and probability. An understanding of these concepts (both marginal and conditional) is therefore cardinal to interpreting Logistic Regression output Key Question to be explored: What factors determine the probability that an individual will or will not develop CVD?

41 Marginal & Conditional odds. The odds of having CVD are 115/85 = This is the marginal or unconditional odds of having CVD.  The conditional odds of having CVD, given “smokers” is 75:25, or 3. A smoker is 3.0 times as likely to have CVD than he is not to have it  The conditional odds of having CVD, given the category “Non-smokers" is 40:60, or A non-smoker is 0.67 times as likely to have CVD than he is not to have it

42 Probability The probability of having CVD is 115/200 = The probability of having CVD given that one is a smoker is 75/100 = 0.75 The probability of having CVD given that one is a non-smoker is 40/100 = 0.40

43 Odds Ratio  The odds ratio of smokers (numerator) to non-smokers (denominator) for CVD, is 3/0.67= (This means that the odds of smokers having CVD are times as high as those of non-smokers having CVD)  Odds ratio is cross-product ratio i.e.  When one moves from being a non-smoker to a smoker, the odds of having CVD increase by 347.8% (i.e. from 0.67 odds for non-smokers to 3 for smokers)

44 Alternative interpretation of Odds Ratio Smokers are times more likely to have CVD as non- smokers The risk of having CVD is times greater for smokers than non-smokers The odds of CVD for smokers are 347.8% higher than the odds of CVD for non-smokers ( ) The predicted odds for smokers are times the odds for non-smokers. A one unit change in the independent variable Smokers (smokers to non-smokers) increases the odds of having CVD by a factor of

45 References Altman D.G Practical Statistics For Medical Research (London: Chapman & Hall/CRC) Gujarati D.N Basic Econometrics (New York: McGraw- Hill, Inc) Johnston J. and J. DiNardo Econometric Methods (London: The McGraw-Hill Companies, Inc) Long J.S Regression Models for Categorical and Limited Dependent. A Volume in the Sage Series for Advanced Quantitative Techniques (Thousand Oaks, CA: Sage Publications Want, MinQi, James M. Eddy, Eugene C. Fitzhugh "Application of Odds Ratio and Logistic Models in Epidemiology and Health Research," Health Values 19 :

46 References Nathwani et al “An economic evaluation of a European cohort from a multinational trial of linezolid versus teicoplanin in serious Gram-positive bacterial infections: the importance of treatment setting in evaluating treatment effects” International Journal of Antimicrobial Agents 23: 315–324 Manca A, Hawkins N, Sculpher M “Estimating mean QALYs in trial-based cost-effectiveness analysis: the importance of controlling for baseline utility” Health Economics 14: Hoch et al “Something old, something new, something blue: a framework for the marriage of health econometrics and cost- effectiveness analysis” Health Econ 11:415–430. Gray et al. 2006, "Estimating the association between SF-12 responses and EQ-5D utility values by response mapping", Med Decis Making., vol. 26, no. 1, pp

47 References Kaambwa et al. 2011, “Mapping utility scores from the Barthel index", Eur. Journal of Health Economics, DOI: /s Sengupta et al. 2004, "Mapping the SF-12 to the HUI3 and VAS in a managed care population", Med Care.,42,9: Smith et al Predicting Costs Of Care In Chronic Kidney Disease: The Role Of Comorbid Conditions. The Internet Journal of Nephrology 4, 1 Bonizzato et al. 2000, “Community-based mental health care: to what extent are service costs associated with clinical, social and service history variables? Psychological Medicine, 30: Baumeister et al. 2009, “Predictive modeling of health care costs: do cardiovascular risk markers improve prediction? European Journal of Cardiovascular Prevention & Rehabilitation

48 References Hoch et al. 2006, “Using the net benefit regression framework to construct cost-effectiveness acceptability curves: an example using data from a trial of external loop recorders versus Holter monitoring for ambulatory monitoring of "community acquired" syncope”, BMC Health Services Research, 6:68 Billingham LJ et al “Patterns, costs and cost-effectiveness of care in a trial of chemotherapy for advanced non-small cell lung cancer: evidence from a randomised trial” Lung Cancer 37: Engels, J.M. & Diehr, P. 2003, “Imputation of missing longitudinal data: a comparison of methods”, Journal of Clinical Epidemiology 56: 968–976 Blazer et al “Health Services Access and Use among Older Adults in North Carolina:Urban vs Rural Residents” American Journal of Public Health, 85, 10:

49 References Barber, J. & Thomspon, S. 2004, “Multiple regression of cost data: use of generalised linear models”, J Health Serv Res Policy 9: Kaambwa, B., Bryan, S., Barton, P., Parker, H., Martin, G., Hewitt, G., Parker, S., & Wilson, A. 2008, "Costs and health outcomes of intermediate care: results from five UK case study sites", Health Soc. Care Community 16: Raine et al. 2010, “Social variations in access to hospital care for patients with colorectal, breast, and lung cancer between 1999 and 2006: retrospective analysis of hospital episode statistics”, BMJ 340:b5479

50 Exercises OLS regression Logistic Regression