Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY O F C ONNECTICUT 1 D EPARTMENT OF C OMPUTER S CIENCE AND E NGINEERING 2 D EPARTMENT OF M OLECULAR AND C ELL B IOLOGY
Motivation 1: Cell-type Identification The Question: Smallest # of genes to identify each cluster: B: Bone C: Myeloid D: Endothelial Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage. In collaboration with: Dr. Hector Leonardo Aguila, UCHC
Motivation 2: Clinical Diagnostics Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoS One 2012 Study# genesSensitivity (%)Specificity (%) Lequerre Stuhlmuller Stuhlmuller Lequerre87128 Sekiguchi Julia89217 Stuhlmuller37117 Tanio86733
Multi-class Classification Problem Multi-class Classification There are 2 or more classes Supervised learning Key Problems: 1. Feature Selection: What are the most predictive biomarkers? 2. Classification: What is the best classification algorithm?
Challenges Different types of data Gene expression Epigenetic data Methylation Histone modification Proteomics Metabolomics Phenotypes Different Platforms Microarray Sequencing In-situ hybridization Different Resolutions Discrete vs Continuous Sparse vs Complete
Minimal Unique Marker Panel Selection (Mumps) Pipeline Feature Selection Classification Parameterize each combination of feature selection and classification algorithms Inner Cross-validation Rank Models by AUC Outer Cross-validation Output: the best features and classifier Input: # of biomarkers: Nested Cross Validation
Feature Selection (SVM)-recursive feature elimination (RFE) ANOVA F-value Random Forests Extra Trees Algorithms Correlation Cosine K-Nearest Neighbors (KNN) Support Vector Machine (SVM) Decision Tree Random Forests Extra Trees Gradient Boosting Classification
Datasets From Broad Institute Affymetrix Gene expression microarray 15 hematopoietic cell types 82 samples 4-7 samples per cell type. Multiple Sources 70 samples Approximately 3-7 samples per cell type. Affymetrix & Illumina Bead Array Different labs
Experiments Complete Complete gene expression profile from microarray datasets. Simulated Sparse 70% and 50% missing data Coverage of a marker followed a Beta distribution. The fraction of cell types having known expression statuses for a marker. Fifteen simulations Cross-validation 3-fold, stratified # features: 2, 8, 16, 32, 64, 96, 128, 256, and 384 Best set of features and classifier for each # features External validation Use Broad data as training Test against external datasets
Performance: Complete Data
By Algorithm: Complete Data
Performance: 70% Missing
Summary: Best Algorithms Complete70% missing # of markersFSCLFSCL 2RFEKNN RFEExtra Trees 8 RFECosineRFECosine 1616 RFECosineRFECosine 32 RFECosineRFECosine 64 RFECosineRFECosine 96 RFECosineRFECorrelation 128 RFECosineRFECorrelation 256 RFECosineRFECorrelation 384 RFECorrelationRFECorrelation
Why the Big Gap? Cross-platform normalization Similarities in cell- types Over-fitting Correlation: Broad vs External
Mesoderm Cell-type Identification Anti-TNF Responsivness Motivation Results # genesAUC 8 73 % % % % % % % % Study# genes Sensitivity (%) Specificity (%) Lequerre Stuhlmuller Stuhlmuller Lequerre87128 Sekiguchi Julia89217 Stuhlmuller37117 Tanio86733 UCONN883 UCONN
Future Work Broader Data-types NCI-60 microarray mRNA microarray microRNA copy number variation protein array SNPs … Minimizing over fitting Cross-platform normalization Different Data types Integrate multiple data types simultaneously
Conclusion and Thanks Thanks to: Ed Hemphill Chih Lee Ion Mandoiu Craig Nelson Smpl Bio A commercial service coming in late 2013
D ON ’ T G O B EYOND, T IS A S ILLY P LACE Extra Slides
Experiment Overview Parameterize each combination of feature selection and classification algorithms Output the best features and classifier Feature Selection Classification Inner Cross-validation Rank Models by AUC Outer Cross-validation Input: # of biomarkers: Nested Cross Validation Test Best Model Output: AUC of best features / classifier Broad Data External Testing
Performance: 50% Missing