RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007

Outline  Introduction  Example: Liver Fibrosis  RandomForest Algorithm  Challenges with Variable Selection  External Cross Validation  Summary and Discussion

Introduction  With high dimensional data we want to reduce the number of variables –Remove “noise” variables –Ease model interpretation –Reduce cost by measuring a subset of variables  Biomarker data are typically high dimensional => excellent candidates for variable reduction

What is a Biomarker?  Characteristic that is objectively measured and evaluated as an indicator of –normal biological processes –pathogenic processes –pharmacologic responses to a therapeutic intervention  Types of biomarkers –Genes –Proteins –Lipids –Metabolites….

Use of Biomarkers  Select patients thought more likely to benefit from new drug  Screen drug candidates for likely efficacy  Identify criteria for dose selection for phase II & III trials  Predict safety problems by identifying toxicity markers that give early warnings  Make diagnosis  Indicate disease status and stage  Predict/monitor clinical response to an intervention  ….

Example – Liver Fibrosis  8 th leading cause of death in the US  Scar formation that occurs as the liver tries to repair damaged tissue  Current approach: Liver biopsy to determine fibrosis stage  Goal: –Identify small panel of biomarkers that can predict fibrosis stage of patient (mild or severe) => Prediction Problem with Variable Selection

Example – Liver Fibrosis  384 Hepatitis C infected patients of various fibrosis stages –61% Mild –39% Severe  Collected 46 serum biomarkers  Select 5-10 biomarkers

Prediction & Variable Selection Tools  Stepwise Regression  PLS, PLS-DA  LARS/LASSO  Elastic Net  RandomForest

A Single Tree Candidate Node 10 Mild 10 Severe Gini Index = 0.5 Daughter Node 5 Mild 0 Severe Gini Index = 0 Biomarker 4 >= 14.45 Biomarker 4 < 14.45 Daughter Node 5 Mild 10 Severe Gini Index = 0.44 Mild Daughter Node 1 Mild 9 Severe Gini Index = 0.18 Daughter Node 4 Mild 1 Severe Gini Index = 0.32 MildSevere Biomarker 32 = 1Biomarker 32 = 0 Node purity is measured by Gini Index New Sample Biomarker 4 = 28.65 Biomarker 32 = 0

Data S1, S2, S3, S4, S5, S6, S7, S8, S9, S10 S1,S2,S2,S3,S5, S6,S7,S8,S9,S9 S2,S3,S4,S4,S4, S5,S7,S8,S8,S10 S1,S1,S2,S3,S3, S4,S7,S8,S9,S10 S1,S6,S6,S6,S8, S8,S9,S9,S9,S9 ……. Draw Bootstrap Samples Tree 1Tree 2Tree3Tree 5000 Grow Trees RandomForest

Data S1, S2, S3, S4, S5, S6, S7, S8, S9, S10 S1,S2,S2,S3,S5, S6,S7,S8,S9,S9 S2,S3,S4,S4,S4, S5,S7,S8,S8,S10 S1,S1,S2,S3,S3, S4,S7,S8,S9,S10 S1,S6,S6,S6,S8, S8,S9,S9,S9,S9 ……. Draw Bootstrap Samples Variable Importance S4 S10 S1 S6 S9 S5 S6 S2,S3 S4,S5 S7,S10 Tree 1Tree 2Tree3Tree 5000 PPPP Drop Down Trees Prediction Accuracy Permuted Prediction Accuracy Drop Down Trees

Making Prediction with RandomForest New Sample M1, M2, …., Mp Tree 1 Tree 2 Tree3 Tree 5000 ……. Mild Severe ……. Results from all Trees: Mild Severe 70% 30% Majority Vote Mild

Challenges with Variable Selection  How many variables are important?  Which variables are important?  How do we validate the model? –Correct way of validating model? –Is prediction accuracy significant? External cross validation Permutation test

A Common Variable Selection Approach is…  Use all data to select variables  Obtain prediction accuracy on reduced data  Introduces selection bias  Used in many publications YX YX*

A Better Variable Selection Approach is …  Separate training and test set  External cross validation (ECV)  Avoid selection bias

External Cross Validation 1. Partition data for 5-fold Cross-Validation. Training SetTest Set Training Set... Y n x 1 X n x p Svetnik et al. 2004 Training Set

2. Build RandomForest for each training set. 3. Use importance measure to rank variables. 4. Record test set predictions. RF YX Training Set Test Set Prediction 1. Marker_6 2. Marker_509 3. Marker_906 98. Marker_57 99. Marker_2 1000. Marker_49.... Variable Importance Ranking

5. Remove fraction of least important variables and rebuild RandomForest. 6. Record test set predictions. 7. Do not re-rank variables. Repeat 5-7 until small # of variables is left. RF YX Training Set Test Set Prediction.... Variable Importance Ranking.... Remove 1. Marker_6 2. Marker_509 3. Marker_906 98. Marker_57 99. Marker_2 1000. Marker_49 1. Marker_6 2. Marker_509 3. Marker_906 98. Marker_57 99. Marker_2 1000. Marker_49 Repeat with remaining variables

mtry = sqrt(p) 3 variables are very important No additional gains by including more variables 8. Compute optimization criterion at each step of variable removal. 9. Replicate to “smooth” out variability. 10. Select p’ = number of variables in the model, based on optimization criterion. No. of Variables Optimization Criterion

11. Pick p‘ most important variables. 12. Repeat 1-11 with permuted Y.

We Discussed How to…  Use RandomForest to do variable selection  Use external cross validation to select variables in proper way and to validate model  Return to example – Identify small set of biomarkers that can predict mild or severe fibrosis stage

Liver Fibrosis – Comparison to Commercial Tests False Positive Rate True Positive Rate RF w. 11 markers RF w. 3 markers FibroTest ActiTest ROC Curves TPR = Sensitivity = P( Pred sev. | Actual sev. ) FPR = 1- Specificity = P( Pred sev. | Actual mild ) AUC 0.73 0.74 0.70 0.65

Summary and Discussion  Approach has found sets of biomarkers that GSK can use to –Predict fibrosis stage –Monitor progression of patients in non-invasive manner –Safe money  Avoid selection bias

Acknowledgements  Kwan Lee, GSK  Mandy Bergquist, GSK  Lei Zhu, GSK  Jack Liu, GSK  Terry Walker, GSK  Peter Leitner, GSK  Andy Liaw, Merck  Christopher Tong, Merck  Vladimir Svetnik, Merck  Duke University

References  DeLong, E., DeLong, D., and Clarke-Pearson, D. (1988), “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach, “ Biometrics, 44, 837-845.  Svetnik, V., Liaw, A., Tong, C., and Wang, T. (2004), “Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules,” Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9-11 June 2004, Cagliari, Italy. F. Roli, J. Kittler, and T. Windeatt (eds.). Lecture Notes in Computer Science, vol. 3077. Berlin:Springer, pp. 334-343.

Backup

Random Forest with all biomarkers: hyaluronic acid alpha-2 macroglobulin VCAM-1 GGT RBP ALT ROC Curves Separation of Metavir F0/F1 from F2-F4 FibroTest biomarkers: alpha-2 macroglobulin haptoglobin ApoA1 total bilirubin GGT AUC 0.70 0.75 0.70 0.65

FibroTest and Random Forest Comparison The “cut-off” is the algorithm score for predicting a subject is Metavir stage F2-F4

Summary and Discussion (Cont.)  Assessment of performance of learning machine with Variable Selection requires great care –Avoid Selection Bias –For small sample size (< 100), ECV with replication –For large sample size (>100), Train/Test-set approach with multiple partitions + ECV for Training set  Prediction Models –Large variety of prediction models & optimization criteria available –Choice of prediction model and optimization criterion may depend on situation

Research and Drug Discovery Process Gene to function to target Target to Lead Lead to candida te Selectio n Candid ate selectio n to FTIH FTIH to Proof of Concept Proof of Concept to Phase III Phase III File & Launch Disease selection Target family selection Target selection Lead (CEDD entry) Candidate selected Commit to FTIH Proof of concept Commit to phase III Commit to file and launch Commit to product type TargetsDrugsProducts Biomarker

A closer look……

Random Input Variables Candidate Node 10 Mild 10 Severe Gini Index = 0.5 X1 X1=0 6/6 X1=1 4/4 Δ Gini=0.00 X2 X2=0 7/6 X2=1 3/4 Δ Gini=0.05 X3 X3=0 9/1 X3=1 1/9 Δ Gini=0.32 X4 X4=0 6/5 X4=1 4/5 Δ Gini=0.01 X5 X5=0 4/6 X5=1 6/4 Δ Gini=0.02 X6 X6=0 9/4 X6=1 1/6 Δ Gini=0.17 Usual Tree Algorithm chooses the best among all Variables: X3 RandomForest chooses the best among a random subset of variables: X6

Example – Breast Cancer Study  Study Details –Breast cancer patients; stages II-IV –Control subjects; matched by  Age  Race  Smoking status –42 serum biomarkers  Goal: Identify panel of biomarkers to –Monitor patient response in a non-invasive and longitudinal manner –Provide more information on underlying biology and mechanisms of drug action

Examples of Biomarkers  Laboratory Tests –Routine, non-routine and novel tests, novel applications, genes, proteins, metabolites, lipids, …  Electrophysiological Measures –ECG, EEG, …  Imaging –fMRI, PET, X-ray, BMD, Ultrasound, CT, …  Histological Analyses –Immunohistochemistry, electron microscopy, …  Physiological Measures –Heart rate, blood pressure, pupil size, …  Behavioral Tests –Cognitive function, motor performance, …

Biomarkers in the Research and Drug Discovery Process Gene to function to target Target to Lead Lead to candida te Selectio n Candid ate selectio n to FTIH FTIH to Proof of Concept Proof of Concept to Phase III Phase III File & Launch Disease selection Target family selection Target selection Lead (CEDD entry) Candidate selected Commit to FTIH Proof of concept Commit to phase III Commit to file and launch Commit to product type TargetsDrugsProducts Biomarker

Prognostic & Predictive Biomarkers  Prognostic Biomarkers –Inform you about clinical outcome independent of therapeutic intervention –Stable during treatment course –Patient enrichment strategies  Predictive Biomarkers –Indicate that effect of new drug relative to control is related to biomarker –Change over course of treatment –High importance for successful drug discovery

RandomForest on Weight Loss Data: Protein Marker Model Based on Baseline Markers and Baseline Weight  The following markers were selected as having the highest median and mean importance ranking in the protein marker model: –Weight Week 0, IGFBP-3, CRP, TNF-α, CD40L, and MMP-9  Lipid model not as good as Protein model, but still better than Weight model.

RandomForest on Weight Loss Data: Models Based on Early Changes in Markers, Baseline Weight, and Early Change in Weight Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers Note: All results in the table are median numbers based on 50 replicates

RandomForest on Weight Loss Data: Lipid Marker Model Based on Early Change in Markers, Baseline Weight, and Early Change in Weight  The following markers were selected as having the highest median and mean importance ranking in the lipid marker model: –Weight Week 0 – Week 3, Weight Week 0, and 2 lipid markers  Very similar results are obtained if Weight Week 0 is excluded. The following markers now have highest importance: –Weight Week 0 – Week 3, and three lipid markers

Weight Change Distribution based on 252 Obese Subjects Day1-Week6 Weight Change Week0-Week6 in lbs Q1Q3 Fast Weight Losers Normal Weight Losers Slow Weight Losers 50 Ottawa Subjects 9 (18 %)30 (60 %)11 (22 %)

Obesity - Models Based on Baseline Markers and Baseline Weight Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers Permutation p-value for Protein Markers: 0.01 Note: All results in the table are median numbers based on 50 replicates

Example – Obesity  50 obese patients  Several hundred protein and lipid biomarkers at different time points  Weight at different time points Start 1200 calorie liquid diet Start 900 calorie liquid diet Week 1 Week 3 Week 6Week 26 Week 52 Subjects Reside in Clinic Subjects return home and regain diet control Sample Collection for Biomarkers Average Weight = 266 lbs Average BMI = 43

Example – Obesity  50 obese patients  Several hundred protein and lipid biomarkers at different time points  Weight at different time points Week 3Week 6Week 26 % Weight Change from Baseline

Example – Obesity  50 obese patients –266 lbs at baseline –BMI of 43 at baseline  Several hundred protein and lipid biomarkers at baseline  Weight at different time points

RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Similar presentations

Presentation on theme: "RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Similar presentations

Presentation on theme: "RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007."— Presentation transcript:

Similar presentations

About project

Feedback