Predicting Risk of Re-hospitalization for Congestive Heart Failure Patients (in collaboration with ) Jayshree Agarwal Senjuti Basu Roy, Ankur Teredesai,

Slides:



Advertisements
Similar presentations
Deema Abdal Hafeth MSc student by research School of Computer Science, University of Lincoln Dr Amr Ahmed Supervisor Dr David Cobham supervisor.
Advertisements

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden A Data Pre-processing Method to Increase.
...visualizing classifier performance in R Tobias Sing, Ph.D. (joint work with Oliver Sander) Modeling & Simulation Novartis Pharma AG 3 rd BaselR meeting.
D ON ’ T G ET K ICKED – M ACHINE L EARNING P REDICTIONS FOR C AR B UYING Albert Ho, Robert Romano, Xin Alice Wu – Department of Mechanical Engineering,
Imbalanced data David Kauchak CS 451 – Fall 2013.
The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.
Classification of the aesthetic value of images based on histogram features By Xavier Clements & Tristan Penman Supervisors: Vic Ciesielski, Xiadong Li.
Classification and risk prediction
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Ensemble Learning (2), Tree and Forest
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Data Mining Chun-Hung Chou
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Evaluation – next steps
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Data mining for credit card fraud: A comparative study.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Analysing Microarray Data Using Bayesian Network Learning Name: Phirun Son Supervisor: Dr. Lin Liu.
Prediction of Malignancy of Ovarian Tumors Using Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel 1, I.
Hospitalization Prediction From Health Care Claims Adithya Renduchintala, Benjamin Martin, & Lance Legel University of Colorado Boulder  Data Mining 
Acknowledgements Contact Information Anthony Wong, MTech 1, Senthil K. Nachimuthu, MD 1, Peter J. Haug, MD 1,2 Patterns and Rules  Vital signs medoids.
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Tessy Badriyah Healthy Computing, 1 nd June 2011 Aim : to contribute to the building of effective and efficient methods to predict clinical outcome that.
Presentation Title Department of Computer Science A More Principled Approach to Machine Learning Michael R. Smith Brigham Young University Department of.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Pam Coleman Reducing Avoidable Re- Hospitalizations and Improving Care Transitions National Academy for State Health Policy October 4, 2011 Pam Coleman.
October 2-3, 2015, İSTANBUL Boğaziçi University Prof.Dr. M.Erdal Balaban Istanbul University Faculty of Business Administration Avcılar, Istanbul - TURKEY.
Rehospitalization Analytics: Modeling and Reducing the Risks of Rehospitalization Chandan K. Reddy Department of Computer Science, Wayne State University.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Data Mining and Decision Support
NTU & MSRA Ming-Feng Tsai
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Blackbox classifiers for preoperative discrimination between malignant and benign ovarian tumors C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel.
Introduction Background Medical decision support systems based on patient data and expert knowledge A need to analyze the collected data in order to draw.
Defect Prediction using Smote & GA 1 Dr. Abdul Rauf.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Healthcare and Medicine: New frontiers for analytics and data mining
Data Mining Introduction to Classification using Linear Classifiers
Evolving Decision Rules (EDR)
Machine Learning – Classification David Fenyő
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Evaluating Policies in Cardiovascular Medicine
Courtney selby, Pharm.d. arcare pgy1 Community pharmacy resident
Source: Procedia Computer Science(2015)70:
Our Data Science Roadmap
Features & Decision regions
Mitchell Kossoris, Catelyn Scholl, Zhi Zheng
Machine Learning with Weka
Predicting Pneumonia & MRSA in Hospital Patients
iSRD Spam Review Detection with Imbalanced Data Distributions
Intro to Machine Learning
CRISP: Consensus Regularized Selection based Prediction
Our Data Science Roadmap
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
COSC 4368 Intro Supervised Learning Organization
Presentation transcript:

Predicting Risk of Re-hospitalization for Congestive Heart Failure Patients (in collaboration with ) Jayshree Agarwal Senjuti Basu Roy, Ankur Teredesai, Si-Chi Chin, David Hazel, Kiyana, Mehrdad, (UWT) Paul Amoroso, Yoshi Williams, Dr. Lester Reed, Sheila, Eric Johnson (MHS)

Motivation Congestive Heart Failure(CHF) Many hospitalizations readmissions 19.6% patients readmitted within 30 days [Jencks et al. 2009] 31.1% patients readmitted within 60 days [Jencks et al. 2009] LOW Readmission rate = HIGH quality of care by hospital No reimbursement for readmission within 30 days $$$COST unplanned re-admits = $17.4 billion [Jencks et al. 2009] 2

MHS - UWT Web and Data Science collaboration objectives  Predict the RISK of Readmission for CHF patients  Reduce the Readmission rate and cost  Improve patient satisfaction and quality of care  Appropriate pre-discharge and post-discharge planning  Proper resource utilization 3

Benefits of predicting Risk or Readmission  Proper resource utilization  Improvement in quality of care  Targeted interventions can be planned  Proper pre-discharge and post-discharge plans can be made  Reduction in cost  Medicare expenditure on potentially preventable re- hospitalization is around $12 billion [Jencks et al. 2009] 4

Problem  Develop models that can predict risk of readmission for CHF patients within  30 days after discharge  60 days after discharge  The readmission may happen for other reasons in addition to CHF 5

Overall Approach  How to solve the problem? – Apply predictive data mining techniques such as, classification  What do these predictive mining techniques require? – Data in homogeneous format Information Extraction, Integration, and data preparation Prepare labeled dataset to train the model; used later on for testing. 6

Our Challenges  Building domain knowledge – Which variables to consider? – How to merge and unify them in a homogeneous format (information extraction and integration) – How to understand the relative importance of the variables in the prediction task?  How to prepare data? – Class label generation – Noisy real world data (missing values, inconsistencies, etc.) – Serious skew in the dataset 7

Solution 8

Building Predictive Classification Models Data Understanding Data Preprocessing Modeling Evaluation 9

Data Understanding Collect initial data Acquire Domain knowledge Describe and explore dataset Create data visualization 10

Building Predictive Classification Models Data Understanding Data Preprocessing Modeling Evaluation 11

Data Preprocessing Define class label Attribute selection Data Integration Removal of incomplete data Finding Eligible CHF admissions 12

Eligible CHF admissions and Generating Class Labels All CHF Admissions Eligible CHF Admissions In hospital deaths removed Is there any readmission within x days of discharge? The class label is assigned as 1 The class label is assigned as 0 YES NO X=30 X=60 13

Attribute selection Yale Model [ Krumholz et al] -Socio-Demographic variable(2) -Comorbidities(35) “Baseline” Additional predictor variables identified by us (14) “New” “Correlated”“All” Chi-square correlation test 14

Data Extraction Labeled data Patient details Primary and Secondary diagnosis Lab measurement Administrative data Data used for training the Models Data Incomplete data removed Table Joins 15

Data Distribution  30 days time frame  60 days time frame 16

Building Predictive Classification Models Data Understanding Data Preprocessing Modeling Evaluation 17

Modeling Logistic regression Naïve Bayes classifier Support Vector Machine Balancing imbalanced data by under-sampling and over sampling Selecting modeling technique for Binary Classification Building prediction models 18

Logistic Regression Model P (Probability of Y) Z > 19

Naïve Bayesian Classification 20

Support Vector Machine  A method of classification for both linear and non linear data  Searches for optimal separating hyperplane separating the two classes 21

Building Predictive Classification Models Data Understanding Data Preprocessing Modeling Evaluation 22

Performance Evaluation Metrics  Precision – percentage of tuples labeled as positive are actually positive = TP/TP+FP  Recall – measures the percentage of positive tuples that are labeled positive = TP/TP+FN  Accuracy – percentage of tuples correctly classified = (TP+TN)/P+N  ROC curves and area under the curve (AUC) – Shows the trade-off between true positive rate and false positive rate. 23

Baseline Model  Hospital 30 day Heart Failure Readmission Measure submitted by Yale University [Krumholz et al. ]  Used Hierarchical Logistic Regression Model  The Area under the Curve (AUC) is

Evaluation Predictive models are assessed using 10 fold cross validation The performance is compared using different evaluation metrics mentioned previously 25

RESULTS

Logistic Regression for 30 days Area Under the Curve (AUC) Recall 27

Logistic regression for 60 days Area Under the Curve (AUC) Recall 28

Naïve Bayes classifier for 30 days 29 Area Under the Curve (AUC)

Support Vector Machine for 30 days 30 Area Under the Curve (AUC)

Results of Logistic regression AttributePrecisionRecallAccuracyF1 scoreAUC 30 days time frame A A A A days time frame A A A A

Results of Naïve Bayes AttributePrecisionRecallAccuracyF1 scoreAUC 30 days time frame A A A A days time frame A A A A

Results of Support Vector Machine AttributePrecisionRecallAccuracyF1 scoreAUC 30 day time frame A A A A day time frame A A A A

Comparison of AUC of different Models Baseline Model Logistic Regression Naïve Bayesian Support vector machine 30 days timeframe days timeframe

Conclusion and Discussion  It is one of the difficult problem to solve  Feature selection gives the best results.  With data balancing recall of the model improves 35

Future Work  Investigate other classifier techniques like ensemble methods, neural networks  To explore additional features and study their relevance  To employ other feature selection techniques  To device a method to impute missing values  Deploying the predictive models 36

Acknowledgement  Multicare health System (MHS) and Dr. Lester Reed for giving us this opportunity  Data architects and domain experts in MHS for their inputs  Professors Dr. Ankur Teredesai and Dr. Senjuti Basu Roy for their guidance  Other team members in UWT for their support 37

References  S. F. Jencks, M. V. Williams, and E. A. Coleman, “Rehospitalizations among Patients in the Medicare Fee-for-Service Program,” New England Journal of Medicine, vol. 360, no. 14, pp. 1418–1428,  J. Han and M. Kamber, Data mining: concepts and techniques. Morgan Kaufmann, 2006  H. M. Krumholz, S. L. T. Normand, P. S. Keenan, Z. Q. Lin, E. E. Drye, K. R. Bhat, Y. F. Wang, J. S. Ross, J. D. Schuur, and B. D. Stauffer, Hospital 30-day heart failure readmission measure methodology. Report prepared for the Centers for Medicare & Medicaid Services. 38

Questions 39