Presentation on theme: "RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007."— Presentation transcript:
RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007
Introduction With high dimensional data we want to reduce the number of variables –Remove “noise” variables –Ease model interpretation –Reduce cost by measuring a subset of variables Biomarker data are typically high dimensional => excellent candidates for variable reduction
What is a Biomarker? Characteristic that is objectively measured and evaluated as an indicator of –normal biological processes –pathogenic processes –pharmacologic responses to a therapeutic intervention Types of biomarkers –Genes –Proteins –Lipids –Metabolites….
Use of Biomarkers Select patients thought more likely to benefit from new drug Screen drug candidates for likely efficacy Identify criteria for dose selection for phase II & III trials Predict safety problems by identifying toxicity markers that give early warnings Make diagnosis Indicate disease status and stage Predict/monitor clinical response to an intervention ….
Example – Liver Fibrosis 8 th leading cause of death in the US Scar formation that occurs as the liver tries to repair damaged tissue Current approach: Liver biopsy to determine fibrosis stage Goal: –Identify small panel of biomarkers that can predict fibrosis stage of patient (mild or severe) => Prediction Problem with Variable Selection
Example – Liver Fibrosis 384 Hepatitis C infected patients of various fibrosis stages –61% Mild –39% Severe Collected 46 serum biomarkers Select 5-10 biomarkers
Data S1, S2, S3, S4, S5, S6, S7, S8, S9, S10 S1,S2,S2,S3,S5, S6,S7,S8,S9,S9 S2,S3,S4,S4,S4, S5,S7,S8,S8,S10 S1,S1,S2,S3,S3, S4,S7,S8,S9,S10 S1,S6,S6,S6,S8, S8,S9,S9,S9,S9 ……. Draw Bootstrap Samples Variable Importance S4 S10 S1 S6 S9 S5 S6 S2,S3 S4,S5 S7,S10 Tree 1Tree 2Tree3Tree 5000 PPPP Drop Down Trees Prediction Accuracy Permuted Prediction Accuracy Drop Down Trees
Making Prediction with RandomForest New Sample M1, M2, …., Mp Tree 1 Tree 2 Tree3 Tree 5000 ……. Mild Severe ……. Results from all Trees: Mild Severe 70% 30% Majority Vote Mild
Challenges with Variable Selection How many variables are important? Which variables are important? How do we validate the model? –Correct way of validating model? –Is prediction accuracy significant? External cross validation Permutation test
A Common Variable Selection Approach is… Use all data to select variables Obtain prediction accuracy on reduced data Introduces selection bias Used in many publications YX YX*
A Better Variable Selection Approach is … Separate training and test set External cross validation (ECV) Avoid selection bias
External Cross Validation 1. Partition data for 5-fold Cross-Validation. Training SetTest Set Training Set... Y n x 1 X n x p Svetnik et al. 2004 Training Set
2. Build RandomForest for each training set. 3. Use importance measure to rank variables. 4. Record test set predictions. RF YX Training Set Test Set Prediction 1. Marker_6 2. Marker_509 3. Marker_906 98. Marker_57 99. Marker_2 1000. Marker_49.... Variable Importance Ranking
5. Remove fraction of least important variables and rebuild RandomForest. 6. Record test set predictions. 7. Do not re-rank variables. Repeat 5-7 until small # of variables is left. RF YX Training Set Test Set Prediction.... Variable Importance Ranking.... Remove 1. Marker_6 2. Marker_509 3. Marker_906 98. Marker_57 99. Marker_2 1000. Marker_49 1. Marker_6 2. Marker_509 3. Marker_906 98. Marker_57 99. Marker_2 1000. Marker_49 Repeat with remaining variables
mtry = sqrt(p) 3 variables are very important No additional gains by including more variables 8. Compute optimization criterion at each step of variable removal. 9. Replicate to “smooth” out variability. 10. Select p’ = number of variables in the model, based on optimization criterion. No. of Variables Optimization Criterion
11. Pick p‘ most important variables. 12. Repeat 1-11 with permuted Y.
We Discussed How to… Use RandomForest to do variable selection Use external cross validation to select variables in proper way and to validate model Return to example – Identify small set of biomarkers that can predict mild or severe fibrosis stage
Liver Fibrosis – Comparison to Commercial Tests False Positive Rate True Positive Rate RF w. 11 markers RF w. 3 markers FibroTest ActiTest ROC Curves TPR = Sensitivity = P( Pred sev. | Actual sev. ) FPR = 1- Specificity = P( Pred sev. | Actual mild ) AUC 0.73 0.74 0.70 0.65
Summary and Discussion Approach has found sets of biomarkers that GSK can use to –Predict fibrosis stage –Monitor progression of patients in non-invasive manner –Safe money Avoid selection bias
Acknowledgements Kwan Lee, GSK Mandy Bergquist, GSK Lei Zhu, GSK Jack Liu, GSK Terry Walker, GSK Peter Leitner, GSK Andy Liaw, Merck Christopher Tong, Merck Vladimir Svetnik, Merck Duke University
References DeLong, E., DeLong, D., and Clarke-Pearson, D. (1988), “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach, “ Biometrics, 44, 837-845. Svetnik, V., Liaw, A., Tong, C., and Wang, T. (2004), “Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules,” Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9-11 June 2004, Cagliari, Italy. F. Roli, J. Kittler, and T. Windeatt (eds.). Lecture Notes in Computer Science, vol. 3077. Berlin:Springer, pp. 334-343.
Random Forest with all biomarkers: hyaluronic acid alpha-2 macroglobulin VCAM-1 GGT RBP ALT ROC Curves Separation of Metavir F0/F1 from F2-F4 FibroTest biomarkers: alpha-2 macroglobulin haptoglobin ApoA1 total bilirubin GGT AUC 0.70 0.75 0.70 0.65
FibroTest and Random Forest Comparison The “cut-off” is the algorithm score for predicting a subject is Metavir stage F2-F4
Summary and Discussion (Cont.) Assessment of performance of learning machine with Variable Selection requires great care –Avoid Selection Bias –For small sample size (< 100), ECV with replication –For large sample size (>100), Train/Test-set approach with multiple partitions + ECV for Training set Prediction Models –Large variety of prediction models & optimization criteria available –Choice of prediction model and optimization criterion may depend on situation
Research and Drug Discovery Process Gene to function to target Target to Lead Lead to candida te Selectio n Candid ate selectio n to FTIH FTIH to Proof of Concept Proof of Concept to Phase III Phase III File & Launch Disease selection Target family selection Target selection Lead (CEDD entry) Candidate selected Commit to FTIH Proof of concept Commit to phase III Commit to file and launch Commit to product type TargetsDrugsProducts Biomarker
Random Input Variables Candidate Node 10 Mild 10 Severe Gini Index = 0.5 X1 X1=0 6/6 X1=1 4/4 Δ Gini=0.00 X2 X2=0 7/6 X2=1 3/4 Δ Gini=0.05 X3 X3=0 9/1 X3=1 1/9 Δ Gini=0.32 X4 X4=0 6/5 X4=1 4/5 Δ Gini=0.01 X5 X5=0 4/6 X5=1 6/4 Δ Gini=0.02 X6 X6=0 9/4 X6=1 1/6 Δ Gini=0.17 Usual Tree Algorithm chooses the best among all Variables: X3 RandomForest chooses the best among a random subset of variables: X6
Example – Breast Cancer Study Study Details –Breast cancer patients; stages II-IV –Control subjects; matched by Age Race Smoking status –42 serum biomarkers Goal: Identify panel of biomarkers to –Monitor patient response in a non-invasive and longitudinal manner –Provide more information on underlying biology and mechanisms of drug action
Biomarkers in the Research and Drug Discovery Process Gene to function to target Target to Lead Lead to candida te Selectio n Candid ate selectio n to FTIH FTIH to Proof of Concept Proof of Concept to Phase III Phase III File & Launch Disease selection Target family selection Target selection Lead (CEDD entry) Candidate selected Commit to FTIH Proof of concept Commit to phase III Commit to file and launch Commit to product type TargetsDrugsProducts Biomarker
Prognostic & Predictive Biomarkers Prognostic Biomarkers –Inform you about clinical outcome independent of therapeutic intervention –Stable during treatment course –Patient enrichment strategies Predictive Biomarkers –Indicate that effect of new drug relative to control is related to biomarker –Change over course of treatment –High importance for successful drug discovery
RandomForest on Weight Loss Data: Protein Marker Model Based on Baseline Markers and Baseline Weight The following markers were selected as having the highest median and mean importance ranking in the protein marker model: –Weight Week 0, IGFBP-3, CRP, TNF-α, CD40L, and MMP-9 Lipid model not as good as Protein model, but still better than Weight model.
RandomForest on Weight Loss Data: Models Based on Early Changes in Markers, Baseline Weight, and Early Change in Weight Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers Note: All results in the table are median numbers based on 50 replicates
RandomForest on Weight Loss Data: Lipid Marker Model Based on Early Change in Markers, Baseline Weight, and Early Change in Weight The following markers were selected as having the highest median and mean importance ranking in the lipid marker model: –Weight Week 0 – Week 3, Weight Week 0, and 2 lipid markers Very similar results are obtained if Weight Week 0 is excluded. The following markers now have highest importance: –Weight Week 0 – Week 3, and three lipid markers
Weight Change Distribution based on 252 Obese Subjects Day1-Week6 Weight Change Week0-Week6 in lbs Q1Q3 Fast Weight Losers Normal Weight Losers Slow Weight Losers 50 Ottawa Subjects 9 (18 %)30 (60 %)11 (22 %)
Obesity - Models Based on Baseline Markers and Baseline Weight Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers Permutation p-value for Protein Markers: 0.01 Note: All results in the table are median numbers based on 50 replicates
Example – Obesity 50 obese patients Several hundred protein and lipid biomarkers at different time points Weight at different time points Start 1200 calorie liquid diet Start 900 calorie liquid diet Week 1 Week 3 Week 6Week 26 Week 52 Subjects Reside in Clinic Subjects return home and regain diet control Sample Collection for Biomarkers Average Weight = 266 lbs Average BMI = 43
Example – Obesity 50 obese patients Several hundred protein and lipid biomarkers at different time points Weight at different time points Week 3Week 6Week 26 % Weight Change from Baseline
Example – Obesity 50 obese patients –266 lbs at baseline –BMI of 43 at baseline Several hundred protein and lipid biomarkers at baseline Weight at different time points