# 1 OLS & Logistic Regression Analysis – A Recap Cristina Penaloza & Eoin Maloney Health Economics Unit.

## Presentation on theme: "1 OLS & Logistic Regression Analysis – A Recap Cristina Penaloza & Eoin Maloney Health Economics Unit."— Presentation transcript:

1 OLS & Logistic Regression Analysis – A Recap Cristina Penaloza & Eoin Maloney Health Economics Unit

2 Outline What is regression analysis? Relevance of regression analysis Regression modelling process –OLS regression –Logistic regression Exercise

3 What is Regression Analysis? “Regression analysis is concerned with the study of the dependence of one variable, the dependent variable, on one or more other variables, the explanatory variables, … with a view to estimating and/or predicting the (population) mean or average value of the dependent variable in terms of known or fixed (in repeated sampling) values of the explanatory variables.” Gujarati (1995: 16)

4 Terminology Dependent variable, explained variable, outcome variable, outcome, response variable, regressand, output variable, predicted value, predictand, endogenous Explanatory variable, Independent variable, predictor variable, predictor, regressor, stimulus/control variable, exogenous Disturbance (random error) term, residual, residual error

5 Causation / correlation Regression vs causation –“A statistical relationship, however strong and however suggestive, can never establish causal connection: our ideas of causation must come from outside statistics” Gujarati (1995: 20) Regression vs correlation –Correlation analysis: seeks to measure the strength of linear association between two variables –Regression analysis: seeks to estimate or predict the average value of one variable on the basis of fixed values of other variables

6 Why study regression? Adjusting for baseline characteristics in Economic Evaluation (Nathwani et al. 2004; Manca et al. 2005; Hoch et al 2002) Predicting/mapping utility-based outcome measures for use in Economic Evaluation (Gray et al. 2006; Kaambwa et al.2011; Sengupta et al 2004) Predicting costs for use in Economic Evaluation (Smith et al. 2007; Bonizzato et al. 2000; Baumeister et al. 2009) Constructing CEACs (Hoch et al. 2006) Regression imputation for missing data (Billingham et al. 2002; Engels & Diehr, 2003; Blazer et al. 1995) Explaining factors which cause variation in outcome and cost data (Barber &Thomspon, 2004; Kaambwa et al. 2008; Raine et al, 2010)

7 The regression modelling process 1.Statement of hypothesis (theory) 2.Specification of the model 3.Obtaining the data 4.Estimation of the regression model 5.Diagnostic analysis 6.Hypothesis testing 7.Prediction/forecasting

8 1. Statement of hypothesis Example: High Blood Pressure and older people “Amongst those over the age of 65, the incidence of high diastolic blood pressure (dipb) increases with age. Therefore, dipb is, in part, explained by age.”

9 In Functional form: Mean Diastolic High Blood Pressure, DIBP, is some function of age, A: DIBP = f (A) (1) 2. Specification of the model

10 2. Specification of the model (cntd) In Mathematical (linear) form: Y =  1 +  2 X (2) where Y = Mean DIBP and X = age  1 &  2 = parameters

11 E(Y|X) X................ Linear relationship.............. x1x1 x6x6 x3x3 x4x4

12 Econometric (Regression) model Y =  1 +  2 X + u (3) 2. Specification of the model (cntd) Where Y = Mean DIBP - the dependent variable X = Age - explanatory variable u = Disturbance (random error) term  1 &  2 = parameters

13 The error term (u) Omitted explanatory variables Measurement error Wrong functional form Unavailability of data Inherent randomness etc….

14 3 & 4. Data / estimation of parameters Obtaining the data –observed values of Y and X Estimation of the parameters –Y and X are the variables (“known”) –  1,  2 and u are the parameters (“unknown”)

15 5. Diagnostic analysis Is the model correctly specified? Have all assumptions been met? Are there any unusual observations or outliers that may unduly influence results? More of this later this morning…

16 6. Hypothesis Testing Is estimate statistically close to a postulated value? Or are estimates in accord with expectations from theory? Only after model has been shown to be adequate

17 7. Forecasting or Prediction If hypothesis or theory being tested is confirmed, then future values of the dependent variable can be predicted or forecast Policy recommendations

18 Hypothesis / theory Model specification Data Estimation Specification testing and diagnostic testing Is the model adequate? No Yes Hypothesis testing Policy: prediction and forecasting The practice of regression modelling

19 Sample regression In practice we will never observe the population regression line. Instead we take a random sample of observations in order to estimate the  s. We distinguish the sample regression from the population regression as follows:

20 Sample regression Mathematical Model Econometric Model where = estimator of E(Y/X i ) = estimator of  1 = estimator of  2 = estimate of u i

21 Population regression Mathematical Model Econometric Model where = E(Y/X i ) = constant/Y intercept = coefficient for X i = error term

22 Y4Y4 X1X1 X2X2 X3X3 X4X4 X Y........ Y2Y2 Y3Y3 Y1Y1

23 Y4Y4 X1X1 X2X2 X3X3 X4X4 X Y........ Y2Y2 Y3Y3 Y1Y1

24 : The Ordinary Least Squares (OLS) Model Dependent variable is modelled as a linear function of predictor or independent variables. The dependent variable is continuous e.g. Blood pressure, Cholesterol level or Weight.

25 What factors cause variation in an individual’s Diastolic blood pressure? What variables explain movement in Men’s cholesterol level? What variables are predictive of high birth weight in a population of mothers from Birmingham? Dependent variable can take on any numerical value within the limits of the range of that variable. OLS

26 The OLS method seeks to minimise the residual sum of squares: OLS

27 }.. X1X1 X2X2 X3X3 X4X4 { X Y.. Minimising the residual…

28 i.e. the proportion of the variation in Y i which is explained by the regression Coefficient of determination, or R 2, is a measure of the ‘goodness of fit’ of a regression Describing the overall fit of the estimated model 0 < R 2 < 1 But focusing solely on maximising R 2 is not a good idea! (other measures will be consider this afternoon…)

29 Models for Categorical Dependent Variables For use on dependent variables that are either dichotomous (individual has CVD or not), or polytomous (Low, Medium or High cholesterol level) which are quite common in Health-related datasets

30 Models for Categorical Dependent Variables Focus Binary response variable – independent variables are used to predict whether or not some event will occur: Based on certain described characteristics: Will an individual get cancer or not? Will a patient survive or die? will an individual develop CVD or not?

31 Coding of outcomes: Usually coded 1 if the attribute of interest is present and 0 otherwise. Approach to be used: Logistic regression - best for dichotomous dependent variable, and continuous and categorical independent variables. Other commonly used approaches: Probit & Nested Logit

32 Major difference from Ordinary Linear Regression Uses link for relationship between dependent and independent variable Substitute maximum likelihood estimation (MLE) of a link function of the dependent variable for regression's use of least squares estimation of the dependent variable itself. MLE - Method of estimating unknown parameters in such a way that the probability of observing a given dependent variable is as high (or maximum) as possible

33 Issues to consider… Why are OLS models not suitable for dichotomous data? Logit transformation – Link Function Marginal & Conditional Odds and Probability

34 Suppose we want to model Y i = β 0 + β 1 X 1 + ε but and β 0 is the coefficient on the constant term, β 1 is the coefficient on the independent variable, X 1 is the independent variable – e.g. Age, and ε is the error term.

35 Let Y i = 1 if the i th individual has CVD, and 0 otherwise. Let also Y i take the values 1 and 0 with probabilities p i and 1-p i, respectively. i.e. P(Y 1 =1) = P(CVD =1) = p 1 P(Y 1 =0) = P(CVD =0) = 1- p 1

36 Why not just use Simple Linear (OLS) regression? Consider a simple OLS regression model CVD = β 0 + β 1 Age+ ε, Assumptions a)ε ~N(0, δ 2 ) b) var (ε) is constant i.e. Homoscedasticity Binary outcome variables violate these assumptions…

37 Why not just use Simple Linear (OLS) regression? CVD is binary as P takes on only two values. Consequently, ‘ε’ is also binary and therefore ‘normality of residuals’ assumption is violated. The error terms are heteroscedastic, so regression assumption that the variance of the error term is constant is violated. The predicted probabilities can be greater than 1 or less than 0 which can be a problem if the predicted values are used in a subsequent analysis!

38 Logit transformation 1.Move from probabilities to Odds 2. Take logs of both sides, to get log-odds or Logit or equivalently,

39 The Logit transformation removes the floor restriction

40 Logistic Regression Output Part of this output is in form of Odds, Odds ratios and probability. An understanding of these concepts (both marginal and conditional) is therefore cardinal to interpreting Logistic Regression output Key Question to be explored: What factors determine the probability that an individual will or will not develop CVD?

41 Marginal & Conditional odds. The odds of having CVD are 115/85 = 1.353. This is the marginal or unconditional odds of having CVD.  The conditional odds of having CVD, given “smokers” is 75:25, or 3. A smoker is 3.0 times as likely to have CVD than he is not to have it  The conditional odds of having CVD, given the category “Non-smokers" is 40:60, or 0.67. A non-smoker is 0.67 times as likely to have CVD than he is not to have it

42 Probability The probability of having CVD is 115/200 = 0.575 The probability of having CVD given that one is a smoker is 75/100 = 0.75 The probability of having CVD given that one is a non-smoker is 40/100 = 0.40

43 Odds Ratio  The odds ratio of smokers (numerator) to non-smokers (denominator) for CVD, is 3/0.67= 4.478 (This means that the odds of smokers having CVD are 4.478 times as high as those of non-smokers having CVD)  Odds ratio is cross-product ratio i.e.  When one moves from being a non-smoker to a smoker, the odds of having CVD increase by 347.8% (i.e. from 0.67 odds for non-smokers to 3 for smokers)

44 Alternative interpretation of Odds Ratio Smokers are 4.478 times more likely to have CVD as non- smokers The risk of having CVD is 4.478 times greater for smokers than non-smokers The odds of CVD for smokers are 347.8% higher than the odds of CVD for non-smokers (4.478 - 1.00) The predicted odds for smokers are 4.478 times the odds for non-smokers. A one unit change in the independent variable Smokers (smokers to non-smokers) increases the odds of having CVD by a factor of 4.478.

45 References Altman D.G. 1991. Practical Statistics For Medical Research (London: Chapman & Hall/CRC) Gujarati D.N. 1995. Basic Econometrics (New York: McGraw- Hill, Inc) Johnston J. and J. DiNardo. 1997. Econometric Methods (London: The McGraw-Hill Companies, Inc) Long J.S. 1997. Regression Models for Categorical and Limited Dependent. A Volume in the Sage Series for Advanced Quantitative Techniques (Thousand Oaks, CA: Sage Publications Want, MinQi, James M. Eddy, Eugene C. Fitzhugh. 1995. "Application of Odds Ratio and Logistic Models in Epidemiology and Health Research," Health Values 19 : 59-62.

46 References Nathwani et al. 2004. “An economic evaluation of a European cohort from a multinational trial of linezolid versus teicoplanin in serious Gram-positive bacterial infections: the importance of treatment setting in evaluating treatment effects” International Journal of Antimicrobial Agents 23: 315–324 Manca A, Hawkins N, Sculpher M. 2005. “Estimating mean QALYs in trial-based cost-effectiveness analysis: the importance of controlling for baseline utility” Health Economics 14:487-496 Hoch et al. 2002 “Something old, something new, something blue: a framework for the marriage of health econometrics and cost- effectiveness analysis” Health Econ 11:415–430. Gray et al. 2006, "Estimating the association between SF-12 responses and EQ-5D utility values by response mapping", Med Decis Making., vol. 26, no. 1, pp. 18-29.

47 References Kaambwa et al. 2011, “Mapping utility scores from the Barthel index", Eur. Journal of Health Economics, DOI: 10.1007/s10198- 011-0364-5 Sengupta et al. 2004, "Mapping the SF-12 to the HUI3 and VAS in a managed care population", Med Care.,42,9: 927-937. Smith et al. 2007. Predicting Costs Of Care In Chronic Kidney Disease: The Role Of Comorbid Conditions. The Internet Journal of Nephrology 4, 1 Bonizzato et al. 2000, “Community-based mental health care: to what extent are service costs associated with clinical, social and service history variables? Psychological Medicine, 30: 1205- 1215. Baumeister et al. 2009, “Predictive modeling of health care costs: do cardiovascular risk markers improve prediction? European Journal of Cardiovascular Prevention & Rehabilitation

48 References Hoch et al. 2006, “Using the net benefit regression framework to construct cost-effectiveness acceptability curves: an example using data from a trial of external loop recorders versus Holter monitoring for ambulatory monitoring of "community acquired" syncope”, BMC Health Services Research, 6:68 Billingham LJ et al. 2002. “Patterns, costs and cost-effectiveness of care in a trial of chemotherapy for advanced non-small cell lung cancer: evidence from a randomised trial” Lung Cancer 37:219-225 Engels, J.M. & Diehr, P. 2003, “Imputation of missing longitudinal data: a comparison of methods”, Journal of Clinical Epidemiology 56: 968–976 Blazer et al. 1995. “Health Services Access and Use among Older Adults in North Carolina:Urban vs Rural Residents” American Journal of Public Health, 85, 10:1384-1390

49 References Barber, J. & Thomspon, S. 2004, “Multiple regression of cost data: use of generalised linear models”, J Health Serv Res Policy 9:197-204 Kaambwa, B., Bryan, S., Barton, P., Parker, H., Martin, G., Hewitt, G., Parker, S., & Wilson, A. 2008, "Costs and health outcomes of intermediate care: results from five UK case study sites", Health Soc. Care Community 16: 573 - 581 Raine et al. 2010, “Social variations in access to hospital care for patients with colorectal, breast, and lung cancer between 1999 and 2006: retrospective analysis of hospital episode statistics”, BMJ 340:b5479

50 Exercises OLS regression Logistic Regression

Download ppt "1 OLS & Logistic Regression Analysis – A Recap Cristina Penaloza & Eoin Maloney Health Economics Unit."

Similar presentations