Presentation on theme: "Personalized Medicine: Analytics for Cancer Survival Curves Ran Qi, Shujia Zhou, Yelena Yesha June 13, 2013 IAB Meeting Research Report."— Presentation transcript:
Personalized Medicine: Analytics for Cancer Survival Curves Ran Qi, Shujia Zhou, Yelena Yesha June 13, 2013 IAB Meeting Research Report
Introduction: Cancer Staging (1) Cancer stage is an anatomic description of character and quantity of the extent of cancer spread (usually I to IV) – Prognostic factors Tumor (T): size, location, local extent Nodes (N): number, location of nodal metastases Metastasis (M): presence of distance organ spread
Lung cancer staging (bin model) Stage IT1 N0 M0 Stage IIA T1 N1 M0 T2 N0 M0 Stage IIB T2 N1 M0 T3 N0 M0 Stage IIIA T1, 2 N2 M0 T3 N1, 2M0 Stage IIIBT4 N0,1,2 M0 Stage IIICAny T N3 M0 Stage IVAny T Any N M1 bin
Lung cancer survival curves
A Bin Model Breast cancer: 5 T’s, 4 N’s, 2 M’s - 40 bins Adding grades (3 levels): 120 bins (5x4x2x3) Adding ER (hormonal status, 2 levels) 240 bins Thus, for additional variables, the number of bins that would have to be added to a stage would be enormous, and collapsing into a stage would become impractical. “Bin” is also called “combination”.
Problems How to combine the growing number of prognostic factors into small number of stages – Since the TNM staging system was announced in the 1950’s, many new prognostic factors have been identified. – By 1995, 76 predictive factors for breast cancer. – By 2002, 150 factors for lung cancer. Different prognostic factors have different levels of impacts on the survival curves
Objectives Reduce the number of bins through grouping the similar patients Find the relationship between prognostic factors and survival curve
Approaches Grouping cancer patients according to their similarity Ensemble algorithm for Clustering Cancer Data (EACCD) Grouping algorithm for Cancer Data (GACD)
Initialize groups of patients with cutoff Partitioning clustering + statistical calculations 200,000 patients Combinations Log-rank test Dissimilarity matrix Learnt dissimilarity matrix Hierarchical clustering with dendrogram New groups of patients Kaplan-Meier Estimator Cancer Patient Dataset Step 1: Step 2: Step 3: Step 4: Survival curves The GACD work flow MCMC jump over local minimum Weight Increase efficiency
GACD Features – A deterministic grouping method – Use weighted dissimilarity to improve the grouping efficiency. – Use MCMC to avoid local minima Results – Find that grouping results are sensitive to the partitioning algorithms (e.g., PAM and Fuzzy) – Find that grouping results are different between local-minimum and global-minimum partitioning algorithms. – Implemented weighted dissimilarity
Prognostic factors: Size, node, age, race Number of combinations: 59 Reduce 59 curves to 3
Evaluation Metric for Grouping Results The area enclosed by two Kaplan-Meier curves Linear correlation coefficient between the merging order of dendrogram and the area of Kaplan-Meier curves
Conclusion The expanded TNM system (e.g., EACCD and GACD) can analyze cancer survival with more prognostic factors. GACD improves the efficiency of grouping algorithm through using weights. The area enclosed by two Kaplan-Meier curves appears to be useful for evaluating grouping results.
Acknowledgement This project is sponsored by NIST through NSF CHMPR. We would like to thank D. Chen, D. Henson, A. Schwartz, A. Dima, M. Brady the helpful discussions.