Who Am I Yin Aphinyanaphongs (yinformatics.com) MD, PhD from Vanderbilt University in Nashville, TN. Assistant Professor in the Center for Health Informatics and Bioinformatics. Primary Expertise Machine Learning Predictive Modeling Text Classification Data Mining Social Media Large Medical Datasets Secondary Expertise Search Engine Design/ Information Retrieval Natural Language Processing
What I Teach Introduction to Biomedical Informatics. Introduction to Medicine for Computer Scientists. Data Analytics in R for physicians.
Machine Learning Examples Given an email, classify it as spam or not spam. Given a handwritten digit, assign it the right number. Given descriptions of passengers on the titanic, predict who will survive or not survive. Given a gene expression microarray of a cancer, predict whether the cancer will or will not metastasize.
Email Spam Text Classification http://blog.cyren.com/uploads/blog/google-docs-spam- sample.jpg
Predicting Titanic Survival Passenger class Name Sex Age Number of siblings/ spouses aboard Number of parents/ children aboard Ticket number Passenger fare Cabin Port of Embarkation https://www.kaggle.com/c/titanic- gettingStarted
Molecular Signatures Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest. Golub et al.. (1999)) heatmap
Goal Construct algorithms to learn from data such that a built model from training data will generalize to unseen data.
General Framework Obtain Seq Sample Seq (Optio nal) Label Seq Clean Seq Encode Seq Build a Model Performance Evaluation (Internal) Model Application and Validation (External)
Basic Framework Labeled Examples Unseen Examples Labeled Classification Algorithm Random Forests Regularized Logistic Regression Support Vector Machines etc. ALL AM L ALL AM L
+ Key Concept – Supervised Learning From the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
14 Principles and geometric representation for supervised learning (1/7) Want to classify objects as boats and houses.
15 Principles and geometric representation for supervised learning (2/7) All objects before the coast line are boats and all objects after the coast line are houses. Coast line serves as a decision surface that separates two classes.
16 Principles and geometric representation for supervised learning (3/7) These boats will be misclassified as houses This house will be misclassified as boat
17 Principles and geometric representation for supervised learning (4/7) Longitude Latitude Boat House The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example. First all objects are represented geometrically.
18 Principles and geometric representation for supervised learning (5/7) Longitude Latitude Boat House Then the algorithm seeks to find a decision surface that separates classes of objects
19 Principles and geometric representation for supervised learning (6/7) Longitude Latitude ??? ? ? ? These objects are classified as boats These objects are classified as houses Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it
20 Principles and geometric representation for supervised learning (7/7) Longitude Latitude Object #2 Object #1 Object #3
+ Key Concept – Overfitting, Underfitting From the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
22 Over-fitting (a model to your data) = building a model that is good in original data but fails to generalize well to new/unseen data Under-fitting (a model to your data) = building a model that is poor in both original data and new/unseen data Two problems: Over-fitting & Under-fitting
23 Over/under-fitting are related to complexity of the decision surface and how well the training data is fit
24 Scenario 1 Predictor X Outcome of Interest Y Training Data Future Data
25 Scenario 1 Predictor X Outcome of Interest Y Training Data Future Data
26 Scenario 1 Predictor X Outcome of Interest Y Training Data Future Data
27 Scenario 1 Predictor X Outcome of Interest Y Training Data Future Data This line is good! This line overfits!
28 Predictor X Outcome of Interest Y Training Data Future Data Scenario 2
29 Predictor X Outcome of Interest Y Training Data Future Data Scenario 2
30 Predictor X Outcome of Interest Y Training Data Future Data Over/under-fitting are related to complexity of the decision surface and how well the training data is fit
31 Predictor X Outcome of Interest Y Training Data Future Data This line is good! This line underfits! Over/under-fitting are related to complexity of the decision surface and how well the training data is fit
32 Very important concept… Successful data analysis methods balance training data fit with complexity. Too complex signature (to fit training data well) overfitting (i.e., signature does not generalize) Too simplistic signature (to avoid overfitting) underfitting (will generalize but the fit to both the training and future data will be low and predictive performance small).
+ Key Concept – Performance Estimation From the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
On estimation of classifier accuracy 34 test train data train test train test train test train test data Large sample case: use hold-out validation Small sample case: use N- fold cross- validation
Other versions of this general notion… Leave one out cross validation Leave pair out cross validation Bootstrap Single Holdout
+ Key Concept – The Support Vector Machine From the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
The Support Vector Machine (SVM) approach for building molecular signatures 37 Support vector machines (SVMs) is a binary classification algorithm. SVMs are important because of (a) theoretical reasons: - Robust to very large number of variables and small samples - Can learn both simple and highly complex classification models - Employ sophisticated mathematical principles to avoid overfitting and (b) superior empirical results.
Main ideas of SVMs (1/3) 38 Cancer patientsNormal patients Gene X Gene Y Consider example dataset described by 2 genes, gene X and gene Y Represent patients geometrically (by “vectors”)
39 Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”); Gap Cancer patientsNormal patients Gene X Gene Y Main ideas of SVMs (2/3)
40 If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found; The feature space is constructed via very clever mathematical projection (“kernel trick”). Main ideas of SVMs (3/3)
+ Key Concept - Curse of Dimensionality Thanks to Dr. Gutierrez-Osuna - http://courses.cs.tamu.edu/rgutier/cs790_w02/l5.pdf.
45 10,000-50,000 (gene expression microarrays, aCGH, and early SNP arrays) >500,000 (exon arrays/tiled microarrays/SNP arrays) 10,000-300,000 (MS proteomics) >10,000,000 (LC-MS proteomics) >100,000,000 (next-generation sequencing) The range of features in higher dimensional data include.
46 Some methods do not run at all (classical regression) Some methods give bad results (KNN, Decision trees) Very slow analysis Very expensive/cumbersome clinical application Tends to “overfit” High Dimensionality in Small Samples Causes
+ Cancer Classification Case Study From Golub et al. (1999)
Case Study Classify the values of a gene microarray according to leukemia type. AML ALL Task meta-data 72 samples 47 ALL 25 AML 5,327 genes
Encode Microarray Within each train fold, normalize the values of each column between 0 and 1. Notice that we don’t normalize the entire dataset and then run our classification algorithms (this would result in overfitting).
Build a Model - Support Vector Machine * * * * * * * * * * * * * * * * * * * * * ** * * * * This example illustrates a 2 dimensional space. The x and y axis represent one word each. A full text categorization example could contain upwards of 50,000 words and thus 50,000 dimensions.
Build a Model – K nearest neighbors http://mines.humanoriented.com/classes/2010/fall/csci568/ portfolio_exports/lguo/knn.html
Build a Model – Neural Network http://en.wikipedia.org/wiki/Artifi cial_neural_network
Estimate Performance 54 train test train test train test train test data Small sample case: use N- fold cross- validation
Results Proportion of Correct Classifications Baseline (All in one class)65.0% Support Vector Machine91.7% K Nearest Neighbors87.9% Neural Network84.7%
Conclusions Machine Learning Examples Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality Case Study – Cancer Classification
Thanks. Dr. Gutierrez-Osuna Dr. Alexander Statnikov
+ Molecular Signatures Slides from Dr Alexander Statnikov PhD.
Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest. 60 Definition of a molecular signature
61 Example of a molecular signature Molecular signature Patient with lung cancer Biopsy Gene expression profile Primary Lung Cancer Metastatic Lung Cancer
1. Direct benefits: Models of disease phenotype/clinical outcome Diagnosis Prognosis, long-term disease management Personalized treatment (drug selection, titration) 2. Ancillary benefits 1: Biomarkers for diagnosis, or outcome prediction Make the above tasks resource efficient, and easy to use in clinical practice Helps next-generation molecular imaging Leads for potential new drug candidates 3. Ancillary benefits 2: Discovery of structure & mechanisms (regulatory/interaction networks, pathways, sub-types) Leads for potential new drug candidates 62 Main uses of molecular signatures
OvaSure AgendiaClarientPrediction Sciences Veridex LabCorp University GenomicsGenomic Health BioTheranosticsApplied Genomics Power3 Correlogic Systems 63 Recent molecular signatures available for patient care
Developed by Agendia (www.agendia.com)www.agendia.com 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease Independently validated in >1,000 patients So far performed >10,000 tests Cost of the test is ~$3,000 In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm. TIME Magazine’s 2007 “medical invention of the year”. 65MammaPrint
Oncotype DX Breast Cancer Assay (Launched in 2004) Developed by Genomic Health (www.genomichealth.com)www.genomichealth.com 21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse Independently validated in thousands of patients So far performed >200,000 tests Price of the test is $4,175 Not FDA approved but covered by most insurances including Medicare Its sales in 2012 reached $199M. 66
Economic validity In a 2005 economic analysis of the Recurrence Score result in LN-,ER+ patients receiving tamoxifen, Hornberger et al. performed a cost-utility analysis using a decision analytic model. Using a model, recurrence Score result was predicted on average to increase quality-adjusted survival by 16.3 years and reduce overall costs by $155,128. Instead of using the model, economic benefits can now be assessed from the published clinical utility of the test and actual health plan costs for adjuvant chemotherapy. For example, in a 2 million member plan, approximately 773 women are eligible for the test. If half receive the test, given the high and increasing cost of adjuvant chemotherapy, supportive care and management of adverse events, the use of the Oncotype DX assay is estimated to save approximately $1,930 per woman tested (given an aggregate 34% reduction in chemotherapy use). References about health benefits and cost-effectiveness: “Economic Analysis of Targeting Chemotherapy Using a 21-Gene RT-PCR Assay in Lymph Node- Negative, Estrogen Receptor-Positive, Early-Stage Breast Cancer” Am J Manag Care. 2005; 11(5):313- 324. “Impact of a 21-Gene RT-PCR Assay on Treatment Decisions in Early-Stage Breast Cancer, An Economic Analysis Based on Prognostic and Predictive Validation Studies” Cancer. 2007; 109(6):1011-1018. 67
Oncotype DX Colon Cancer Assay (Launched in 2010) Developed by Genomic Health (www.genomichealth.com)www.genomichealth.com Multigene gene signature to predict risk of recurrence in patients with stage II colon cancer Independently validated in thousands of patients Price of the test is $3,280 Not FDA approved but covered by most insurances including Medicare 68
Oncotype DX Prostate Cancer Assay (Launched in 2013) Developed by Genomic Health (www.genomichealth.com)www.genomichealth.com Multigene gene signature to distinguish aggressive prostate cancer from less threatening one Independently validated Price of the test is $3,820 Not FDA approved but covered by most insurances including Medicare 69
Oncotype DX Business Metrics 70 Data from http://investor.genomichealth.com/
Conclusions Machine Learning Examples Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality Case Study – Cancer Classification Case Study – Molecular Signatures
Thanks. Dr. Gutierrez-Osuna Dr. Alexander Statnikov