Modelling continuous variables with a spike at zero – on issues of a fractional polynomial based procedure Willi Sauerbrei Institut of Medical Biometry.

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
Sociology 680 Multivariate Analysis Logistic Regression.
1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
CORRELATION. Overview of Correlation u What is a Correlation? u Correlation Coefficients u Coefficient of Determination u Test for Significance u Correlation.
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Detecting an interaction between treatment and a continuous covariate: a comparison between two approaches Willi Sauerbrei Institut of Medical Biometry.
Statistical Tests Karen H. Hagglund, M.S.
CORRELATION. Overview of Correlation u What is a Correlation? u Correlation Coefficients u Coefficient of Determination u Test for Significance u Correlation.
Making fractional polynomial models more robust Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg, Germany.
Flexible modeling of dose-risk relationships with fractional polynomials Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Continuous Random Variables and Probability Distributions
BIOST 536 Lecture 2 1 Lecture 2 - Modeling Need to find a model that relates the outcome to the covariates in a meaningful way  Simplification of the.
Multivariable model building with continuous data Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg, Germany.
BIOST 536 Lecture 4 1 Lecture 4 – Logistic regression: estimation and confounding Linear model.
Classification and Prediction: Regression Analysis
Exercise 6 Dose linearity and dose proportionality
Assessing Survival: Cox Proportional Hazards Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Objectives of Multiple Regression
AS 737 Categorical Data Analysis For Multivariate
DIY fractional polynomials Patrick Royston MRC Clinical Trials Unit, London 10 September 2010.
Building multivariable survival models with time-varying effects: an approach using fractional polynomials Willi Sauerbrei Institut of Medical Biometry.
Improved Use of Continuous Data- Statistical Modeling instead of Categorization Willi Sauerbrei Institut of Medical Biometry and Informatics University.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.
Assessing Survival: Cox Proportional Hazards Model
Week 6: Model selection Overview Questions from last week Model selection in multivariable analysis -bivariate significance -interaction and confounding.
Use of FP and Other Flexible Methods to Assess Changes in the Impact of an exposure over time Willi Sauerbrei Institut of Medical Biometry and Informatics.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
Next Colin Clarke-Hill and Ismo Kuhanen 1 Analysing Quantitative Data 1 Forming the Hypothesis Inferential Methods - an overview Research Methods Analysing.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
The binomial applied: absolute and relative risks, chi-square.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
C M Clarke-Hill1 Analysing Quantitative Data Forming the Hypothesis Inferential Methods - an overview Research Methods.
Leicester Warwick Medical School Health and Disease in Populations Case-Control Studies Paul Burton.
ANOVA: Analysis of Variance.
A generalized bivariate Bernoulli model with covariate dependence Fan Zhang.
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Design and Analysis of Clinical Study 7. Analysis of Case-control Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
LOGISTIC REGRESSION Binary dependent variable (pass-fail) Odds ratio: p/(1-p) eg. 1/9 means 1 time in 10 pass, 9 times fail Log-odds ratio: y = ln[p/(1-p)]
Continuous Random Variables and Probability Distributions
POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)
Tutorial I: Missing Value Analysis
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Analysis of matched data Analysis of matched data.
Additional Regression techniques Scott Harris October 2009.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Remember the equation of a line: Basic Linear Regression As scientists, we find it an irresistible temptation to put a straight line though something that.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
1 BUSI 6220 By Dr. Nick Evangelopoulos, © 2012 Brief overview of Linear Regression Models (Pre-MBA level)
Stats Methods at IC Lecture 3: Regression.
Statistical Data Analysis - Lecture /04/03
Multivariable regression models with continuous covariates with a practical emphasis on fractional polynomials and applications in clinical epidemiology.
Types of T-tests Independent T-tests Paired or correlated t-tests
Elementary Statistics
CHAPTER 29: Multiple Regression*
Comparing Groups.
Lecture 7 The Odds/ Log Odds Ratios
Comparing Populations
Case-control studies: statistics
Presentation transcript:

Modelling continuous variables with a spike at zero – on issues of a fractional polynomial based procedure Willi Sauerbrei Institut of Medical Biometry and Informatics University Medical Center Freiburg, Germany Patrick Royston MRC Clinical Trials Unit, London, UK

2 Problem: A variable X has value 0 for a proportion of individuals “spike at zero”), and a quantitative value for the others Examples: cigarette consumption, occupational exposure. How to model this? Setting here: case-control study 1. Motivation

3 Example : Distribution of smoking in a lung cancer case-control study ______________________________________________________ Controls Cases n % n % No cigarettes/day 0 (Non-smokers)

4 Ad hoc solution: appropriate? Adding binary variable smoker yes/no

5 2. Theoretical results The odds ratio can be expressed as where f 1 and f 0 are the probability density functions of X in cases and controls, respectively Simplest case: X is normal distributed with expectations μ i with i=0 (1) for controls (cases) and equal variance  2. We get OR X=x vs X=x 0 = exp (β(x-x 0 )) with.

6 Next case (spike at zero):. 2. Theoretical results

7

8 So we have theoretically shown that the above situation requires the binary indicator for the correct model. Some other distributions also have simple solutions In reality, we rarely have simple distributions  procedures are more complicated New proposal: Extension of fractional polynomial procedure 2. Theoretical results

9 3. Fractional polynomial models Standard procedure (FP degree 2, FP2 for one covariate X) Fractional polynomial of degree 2 for X with powers p 1, p 2 is given by FP2(X) =  1 X p1 +  2 X p2 Powers p 1, p 2 are taken from a special set {  2,  1,  0.5, 0, 0.5, 1, 2, 3} (0 = log ) Repeated powers (p 1 =p 2 )  1 X p1 +  2 X p1 log X 36 FP2 models 8 FP1 models Linear pre-transformation of X such that values are positive

10 3. Fractional polynomial models Standard procedure for one variable: Test best FP2 against 1.Null model – not significant  no effect 2.Straight line – not significant  X linear 3.Best FP1 –Not significant  FP1 – significant  FP2

11 3. Fractional polynomial models Extended procedure for variable with spike at zero 1.Generate binary indicator for exposure 2.Fit the most complex model (binary indicator z + 2nd degree FP) 3.If significant, follow same FP function selection procedure WITH z included (first stage) 4.Test both z and the remaining FP (resp the linear component) for removal (second stage)

12 4. Examples 4.1 Cigarette consumption and lung cancer Case-control study, 600 cases, 1343 controls. X – average number of cigarettes smoked per day FP2 Model with added binary variable:

13 4. Examples 4.1 Cigarette consumption and lung cancer ModelDeviancediff.d.f.PPower First stage Null < Linear + z < FP1 + + z FP2 + + z , -1 Second stage FP1 + + z FP1 + [dropping z] < z [dropping FP1] < Standard FP analysis (as alternative) , -1

14 4. Examples 4.1 Cigarette consumption and lung cancer Result: First step: selects FP1 transformation Second step: Both the binary and the FP1 term are required FP2 without binary term gives similar result

15 4. Examples 4.1 Cigarette consumption and lung cancer

16 4. Examples 4.2 Gleason Score and prostate cancer (predictors of PSA level) ModelDevianceDev. diff.d.f.PPower First stage Null   Linear + z FP1 + + z  0.5 FP2 + + z272.3  1, 3 Second stage Linear + z273.7  2  Linear [dropping z] z [dropping linear]

17 4. Examples 4.2 Gleason Score and prostate cancer Result: The selected model from first stage is Linear + z Dropping the linear does not worsen the fit Dropping the binary is highly significant  The selected model only comprises the binary variable

18 4. Examples 4.3 Alcohol consumption and breast cancer (case-control study, 706 cases, 1381 controls) ModelDeviancediffd.f.PPower First stage Null Linear + z FP1 + + z FP2 + + z , 0.5 Second stage FP2 + + z , 0.5 FP2 + [dropping z] , 0.5 z [dropping FP2] Standard FP analysis (as alternative) , 0.5

19 Result: First step: FP2 is best transformation Second step: Dropping of FP2 or binary variable worsens fit  FP2 + + z is best model Standard FP (other powers!) has similar fit 4. Examples 4.3 Alcohol consumption and breast cancer

20 4. Examples 4.3 Alcohol consumption and breast cancer

21 5. Summary Procedure to add binary indicator supported by theoretical results Subject matter knowledge (SMK) is an important criteria to decide whether inclusion of indicator is required SMK: indicator required – procedure useful to determine dose- response part SMK: indicator not required – nevertheless, indicator may improve model fit Suggested 2-step FP procedure with adding binary indicator appears to be a useful in practical applications

22 References Becher, H. (2005). General principles of data analysis: continuous covariables in epidemiological studies, in W. Ahrens and I. Pigeot (eds), Handbook of Epidemiology, Springer, Berlin, pp. 595–624. Robertson, C., Boyle, P., Hsieh, C.-C., Macfarlane, G. J. and Maisonneuve, P. (1994). Some statistical considerations in the analysis of case-control studies when the exposure variables are continuous measurements, Epidemiology 5: 164–170. Royston P, Sauerbrei W (2008) Multivariable model-building - a pragmatic approach to regression analysis based on fractional polynomials for modelling continuous variables. Wiley.