Download presentation
Published byVictor Cox Modified over 9 years ago
1
Advanced Statistics for Interventional Cardiologists
2
What you will learn 1st day 2nd day Introduction
Basics of multivariable statistical modeling Advanced linear regression methods Hands-on session: linear regression Bayesian methods Logistic regression and generalized linear model Resampling methods Meta-analysis Hands-on session: logistic regression and meta-analysis Multifactor analysis of variance Cox proportional hazards analysis Hands-on session: Cox proportional hazard analysis Propensity analysis Most popular statistical packages Conclusions and take home messages 1st day 2nd day
3
What you will learn Logistic regression Generalized Linear Model
Logistic regression model Parameter estimates Odds ratio interpretation Parameter significance testing Model checking Qualitative predictor Multiple logistic regression Examples Generalized Linear Model
4
Logistic regression How can I predict the impact of left ventricular ejection fraction (LVEF) on the 12-month risk of ARC-defined stent thrombosis?
5
Logistic regression How can I predict the impact of left ventricular ejection fraction (LVEF) on the 12-month risk of ARC-defined stent thrombosis? In other words, how can I predict the impact of a given variable (aka independent) on another dichotomous, binary variable (aka dependent)
6
Logistic regression Simple example
The variable “Rained” has two categories “Rainy” (if precipitation > 0.02) and “Dry” (otherwise). Out of the 30 days in April it was rainy 9 days. Therefore, if there is no other information, you predict a 30% chance of rain for every day. Let’s investigate if quantitative variables temperature and barometric pressure can help in finding a more informative prediction of the chance/probability of rain/dry using logistic regression.
7
Logistic regression The goal of the logistic regression is to model the probability of getting a certain response (eg. “dry”) with explanatory variables (eg. “temperature”, “pressure”). For a binary response, p denotes the probability of the first response level (eg. “dry”). Then, (1-p) is the probability of the second response level (“rainy”). We could just model the probability by means of ordinary regression [P (X) = β0 + β1X (linear probability model)] This model has a major structural defect. Probabilities fall between 0 and 1, whereas linear functions take values over the entire real range. So we need a more complex model.
8
Logistic regression We model ln (p/(1-p)) instead of just p, and the linear model is written : ln(p/(1-p)) = ln(p) – ln(1-p) = β0 + β1*X Logistic regression is based on the logit which transforms a dichotomous dependent variable into a continuous one
9
Logistic regression An alternative formula for the probability of getting the first response : Graph
10
Logistic regression Quantitative predictors
Cumulative probability plot on the left shows that the relationship with temperature is very weak. As the temperature changes from 35 to 75, the probability of dry weather only changes from 0.73 to A much stronger relationship with pressure is shown. When pressure is 29 inches, the fitted probability of rain is near 100% (0 probability for Dry at the left of the graph). At 29.8, the probability of rain drops near zero (near 1 for dry).
11
Logistic regression Parameter estimates
The parameter β1 determines the rate of increase of the S-shaped curve. The sign of b1 indicates whether the curve ascends or descends.
12
Logistic regression Odds ratio interpretation
A very popular interpretation of the logistic regression model uses the odds and the odds ratio. For the logit model, the odds of response (eg. “dry”) are This exponential relationship provides an interpretation for β1: the odds increase multiplicatively by for every unit increase in x. That is, the odds at level x+1 equal the odds at x multiplied by When β1 = 0, = 1 and the odds do not change as x changes. Odds ratio associated with unit change of x :
13
Logistic regression Odds ratio interpretation
Odds ratio interpretation of the β1 parameter. Odds ratio associated with one unit increase in temperature = exp( ) = 0.99 (each increase with one degree of the temperature results in 1% decrease in the odds of having a “dry” day). Odds ratio associated with one unit increase in pressure = exp(13.82) =
14
Logistic regression Significance testing
Is the effect of X on the binary response significant ? Is the probability of the response independent of X ? → Hypothesis test H0: β=0 versus H1: β≠0 Wald statistic : has a chi-squared distribution with df=1 for large samples It is also possible to construct confidence intervals to evaluate the significance of the effects. Likelihood-ratio test compares the maximum log-likelihood for the simple model when β1=0 (L0) to the maximum log-likelihood for the full model with unrestricted β1 (L1). The test statistic -2(L0-L1) has a chi-squared distribution with df=1. Likelihood ratio test is more reliable for small sample sizes.
15
Logistic regression Significance testing
What are your conclusions about the significance of the effect of temperature and pressure on the probability of a “dry” day ? Motivate with Wald statistic and with whole-model Likelihood ratio test. Compare with ANOVA table for continuous responses.
16
Logistic Regression Model checking
Let’s find out if a particular model provides a good fit to the observed outcomes. Fitted logistic regression models provide predicted probabilities that Y=1. At each setting of the explanatory variables, one can multiply this predicted probability by the number of subjects to obtain a fitted count. The test of the null hypothesis that the model holds compares the fitted and observed counts using a Pearson χ2 or likelihood-ratio G2 test statistic.
17
Logistic Regression Model checking
For a fixed number of settings, when most fitted counts equal at least 5, χ2 and G2 have approximate chi-squared distributions with df equal to the number of settings of explanatory variables minus the number of model parameters. Large χ2 and G2 provide evidence of lack of fit and the p-value is the right-tailed probability above the observed value. When the fit is poor, residuals and other diagnostic measures are used to describe the influence of individual observations on the model fit.
18
Logistic Regression Model checking
So for grouped observed counts and fitted values we can calculate lack of fit statistics with following formulas:
19
Logistic Regression Model checking
We can also detect lack of fit by using the likelihood-ratio test to compare a working model to more complex ones. This approach is more useful from a scientific perspective. A large goodness-of-fit statistic simply indicates there is some lack of fit. Comparing a model to a more complex model indicates whether lack of fit exists of a particular type.
20
Logistic Regression Model checking
The deviance is the likelihood-ratio statistic for comparing model M to the Saturated model (= most complex model with separate parameter at each explanatory setting = perfect fit). Deviance = -2(Lm- Ls) The deviance, which has the same form as the G2 likelihood-ratio goodness-of-fit statistic, follows a chi-square distribution and is used to test the model fit. In testing whether M fits, we test whether all parameters that are in the saturated model but not in M equal zero. The difference in the deviance of two models is used to compare the fit of any two models. The statistic is large when M0 fits poorly compared to M1.
21
Logistic regression Model checking
Rsquare or Uncertainty coefficient U. The ratio of the reduction of the negative LogLikelihood of the working model to the negative LogLikelihood of the reduced model (the proportion of the uncertainty explained by the model). Lack-of-Fit test is the opposite of the whole-model test. Where the whole-model tests whether anything you have in your model is significant, the lack-of-fit tests whether anything you left out of your model is significant. Lack-of-fit compares the fitted model with the saturated model using the same terms. If the lack-of-fit test is significant, then add more effects to the model using higher orders of terms already in the model.
22
Logistic Regression Model checking
What are your conclusions about the goodness of fit of the model ?
23
Logistic Regression Residuals
Goodness-of-fit statistics such as χ2 and G2 are indicators of overall quality of fit. Additional diagnostics are necessary to describe the nature of any lack of fit. Pearson residuals comparing observed and fitted counts are useful for this purpose. Each residual divides the difference between an observed count and its fitted value by the estimated binomial standard deviation of the observed count.
24
Logistic Regression Residuals
The Pearson statistic for testing the model fit : The Pearson residual has an approximate normal distribution around zero, when the binomial index ni is large. Pearson residuals are treated like standard normal deviates, with values larger than 2 indicating possible lack of fit. Residuals have limited meaning when the fitted values are very small.
25
Logistic Regression Diagnostic Measures of Influence
An influential point is an observation which changes much the estimated parameters when removed from the sample. The measures are algebraically related to an observation’s leverage h. The greater an observation’s leverage the greater its potential influence. Adjusted Pearson residual : Formulas of the measures are rather complex and not reproduced here.
26
What you will learn Logistic regression Generalized Linear Model
Logistic regression model Parameter estimates Odds ratio interpretation Parameter significance testing Model checking Qualitative predictor Multiple logistic regression Examples Generalized Linear Model
27
Logistic Regression Qualitative Predictors
Like ordinary regression, logistic regression extends to models incorporating multiple explanatory variables, some of them can be qualitative. We will use dummy variables for including qualitative predictors, called factors, in the model. Let us have a look at a simple example, using the fitness data. We evaluate the effect of gender on the binary response variable Oxy_H_L having two levels: High (>= 50) and Low (< 50).
28
Logistic Regression Qualitative Predictors – Fitness Example
To evaluate the relation between Sex and Oxy_H_L we can use the crosstable analysis methods. Females have % H and Males have only 6.67 % H. The risk for a female to have high oxygen uptake is 6.56 larger than for a male (relative risk), suggesting an effect of sex. We can also calculate the ratio of the odds. The odds-ratio, dividing the odds for females by the odds for males, is The Pearson X2 (df=1) of 5.56 confirms the dependence (p=0.0184).
29
Logistic Regression Qualitative Predictors
We can use the logits and the logistic regression model to evaluate the effect of sex on the binary response. Sex is incorporated in the model with the dummy variable (“F”=1 and “M”=0). Predicted Logit(Oxy_H_L) = = -2, ,387 . Sex Interpretation: Odds Ratio for Males : exp (-2,638) = 0,0715 Odds Ratio for Females : exp (2,387) = 10,88
30
Multiple Logistic Regression
The logistic regression model, like ordinary regression models, generalizes to allow for several explanatory variables. The predictors can be quantitative, qualitative, or of both types. The model equation is : Logit(Π) = α + β1x1 + β2x2 + … + βkxk βi refers to the effect of Xi on the log odds that Y = 1, controlling the other X’s. For instance, exp(βi) is the multiplicative effect on the odds of a 1-unit increase in Xi, at fixed levels of the other Xs.
31
Multiple Logistic Regression example: fictitious trial
12 months duration Monthly visits Baseline at month 0 Final evaluation at month 12 Placebo versus Drug Primary Objective To show that there are 20% more responders on drug compared to placebo after 12 months of treatment. Primary Efficacy Variable = Disease Activity Scale (DAS) DAS with range 0 – 140 DAS at study entry minimum 50 Responder defined as a 20 point decrease from baseline DAS to month 12
32
Logistic regression fictitious trial: key variables
Variable Description trt Randomized treatment (0 = placebo, 1 = drug) sex Patient’s sex (0 = female, 1 =male) age Patient’s age at baseline (years) grade Disease grade (1 good – 4 very bad) duration Disease duration at baseline in years surgery prior surgery for disease DAS.bl DAS at baseline DAS.12 DAS at month 12 DAS.12.cfb Change from baseline at month 12 in DAS DAS.wd DAS at withdrawal DAS.12.cfb Change from baseline at withdrawal in DAS time.wd Time to withdrawal in days res Responder at month 12 (0 = No, 1 = Yes) res20.wd Responder at withdrawal (0 = No, 1 = Yes) time.res Time to first response in days
33
Logistic regression fictitious trial: primary efficacy analysis
To show that there are 20% more responders on drug compared to placebo after 12 months of treatment. Responder defined as a 20 point decrease from baseline DAS to month 12 Possible analysis Contingency table analysis – X2 test Logistic regression adjusting for prognostic factors
34
Logistic regression fictitious trial: contingency table
Responder? Placebo Drug Total No 83 (74.8%) 45 (42.1%) 128 (58.7%) Yes 28 (25.2%) 62 (57.9%) 90 (41.3%) 111 (100%) 107 (100%) 218 (100%) X2 test p < → treatment and response are dependent.
35
Logistic regression fictitious trial: odds ratio
The odds of being a responder on drug are 62/45 (1.378) The odds of being a responder on placebo are 28/83 (0.337) The odds ratio is divided by which equals 4.08 Patients are more likely to respond on drug compared to placebo. 95% confidence interval for odds ratio is (2.319, 7.342)
36
Logistic regression Model 1: logit(res20.12) = trt
Parameter Estimate Odds ratio 95% conf. int. Intercept (0.216, 0.511) trt (2.318, 7.342) Odds ratio is the exponential of the estimate. Results are identical to previous slide.
37
Logistic regression Model 2: logit(res20.12) = trt + duration
Parameter Estimate Odds ratio 95% conf. int. Intercept (0.200, 1.211) trt (2.317, 7.361) duration (0.708, 1.123) Adjustment for duration does not seem to have much influence on treatment effect. Odds ratio of duration is close to 1 and confidence interval does not include 1 so no real influence on the response.
38
Logistic regression Model 3: logit(res20
Logistic regression Model 3: logit(res20.12) = trt + duration + age + sex Parameter Estimate Odds ratio 95% conf. int. Intercept (0.063, 3.469) trt (2.321, 7.405) duration (0.702, 1.119) age (0.961, 1.039) sex (0.694, 2.223) Duration, age and sex do not seem to have an influence on treatment effect.
39
Logistic regression Model 4: logit(res20.12) = trt + surgery
Parameter Estimate Odds ratio 95% conf. int. Intercept (0.740, 2.308) trt (1.925, 7.849) surgery (0.033, 1.137) Surgery does seem to have an influence on treatment effect. We can conclude that some of the difference in response at month 12 can be put down to prior surgery.
40
Multiple logistic regression Model comparison
One can use the likelihood-ratio method to test hypotheses about parameters in logistic regression models. Compare the maximized log-likelihood L1 for the full model to the maximized log-likelihood L0 for the simpler model with those parameters equal 0, using test statistic -2(L0 – L1) More generally, one can compare maximized log-likelihoods for any pair of models and select the most parsimonious model.
41
Multiple Regression Backward elimination
Backward elimination of predictors, starting with a complex model and successively taking out terms is often used to find a good model. At each stage, we might eliminate the term in the model that has the largest p-value when we test that its parameters equal zero. We first test the highest-order terms for each variable. We do not remove a main effect term if the model contains higher-order-interactions involving that term.
42
What you will learn Logistic regression Generalized Linear Model
Logistic regression model Parameter estimates Odds ratio interpretation Parameter significance testing Model checking Qualitative predictor Multiple logistic regression Examples Generalized Linear Model
43
Logistic regression
44
Logistic regression Sangiorgi et al, AHJ 2008
45
Fajadet et al, Circulation 2006
Other examples Fajadet et al, Circulation 2006 45
46
Other examples Moses et al, NEJM 2003 46
47
Other examples Corbett et al, EHJ 2006 47
48
Other examples Corbett et al, EHJ 2006 48
49
Other examples 49
50
What you will learn Logistic regression Generalized Linear Model
Logistic regression model Parameter estimates Odds ratio interpretation Parameter significance testing Model checking Qualitative predictor Multiple logistic regression Examples Generalized Linear Model
51
Generalized Linear Models
The generalized linear model (GLM) is a flexible generalization of ordinary regression. The GLM is a broad class of models that includes ordinary regression and ANOVA models for continuous response variables as well as models for categorical responses. Logistic regression is one type of GLM.
52
Generalized Linear Models
All generalized linear models have three components : Random component identifies the response variable and assumes a probability distribution for it Systematic component specifies the explanatory variables used as predictors in the model (linear predictor). Link describes the functional relationship between the systematic component and the expected value (mean) of the random component. The GLM relates a function of that mean to the explanatory variables through a prediction equation having linear form. The model formula states that: g(µ) = α + β1x1 + … + βkxk
53
Generalized Linear Models
Through differing link functions, GLM corresponds to other well known models Distribution Name Link Function Mean Function Normal Identity Exponential Inverse Gamma Inverse Gaussian Inverse squared Poisson Log Binomial Logit
54
Generalized linear Models Normal GLM
Ordinary regression and ANOVA models are special cases of GLMs, assuming a normal distribution of the random component, modelling the mean directly. A GLM generalizes ordinary regression in two ways: allows the random component to have a distribution other than normal allows modelling some function of the mean A traditional way of analyzing non-normal data transforms the response, so it becomes normal with constant variance With GLMs it is unnecessary to transform data so that normal-theory methods apply. This is because the GLM fitting process utilizes maximum likelihood methods.
55
Logistic Regression with SPSS
55
56
Questions? 56
57
Take-home messages Logistic regression is used to analyze the effect of explanatory variables on the probability of an outcome of a binary response variable. Logistic regression is based on the logit (ln(p/(1-p)) which transforms the dichotomous dependent variable into a continuous one. A βx parameter estimate is the odds ratio associated with a unit increase of the independent X variable. Likelihood-ratio statistic is used to compare models. Examine Pearson residuals to evaluate lack of fit. A flexible extension of the Regression and ANOVA model is the Generalized Linear Model.
58
And now a brief break… 58
59
For further slides on these topics please feel free to visit the metcardio.org website:
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.