Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY.

Slides:



Advertisements
Similar presentations
Control Case Common Always active
Advertisements

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Indian Statistical Institute Kolkata
MiRNA-drug resistance mechanisms Summary Hypothesis: The interplay between miRNAs, signaling pathways and epigenetic and genetic alterations are responsible.
1 Robust diagnosis DLBCL from gene expression data from different laboratories Dimacs Workshop, June 22, 2005 Gyan Bhanot, IBM Research.
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.
Sparse vs. Ensemble Approaches to Supervised Learning
By Russell Armstrong Supervisor Mrs Wei Ji Diagnosis Analysis of Lung Cancer by Genome Expression Profiles.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Sparse vs. Ensemble Approaches to Supervised Learning
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Computational learning of stem cell fates Martina Koeva 09/10/07.
Boosting for tumor classification
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
J AMES L INDSAY 1 C AROLINE J AKUBA 2 I ON MANDOIU 1 C RAIG N ELSON 2 Gene Expression Deconvolution with Single-cell Data U NIVERSITY O F C ONNECTICUT.
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Whole Genome Expression Analysis
J AMES L INDSAY 1 I ON MANDOIU 1 C RAIG N ELSON 2 Towards Whole-Transcriptome Deconvolution with Single-cell Data U NIVERSITY O F C ONNECTICUT 1 D EPARTMENT.
Multiple Examples of tumor tissue (public data from Whitehead/MIT) SVM Classification of Multiple Tumor Types DNA Microarray Data Oracle Data Mining 78.25%
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
University of Washington Institute of Technology Tacoma, WA, USA Ecole des Hautes Etudes en Santé Publique Département Infobiostat Rennes, France Isabelle.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
The Broad Institute of MIT and Harvard Classification / Prediction.
1 Decision tree based classifications of heterogeneous lung cancer data Student: Yi LI Supervisor: Associate Prof. Jiuyong Li Data: 15 th May 2009.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
CLASSIFICATION: Ensemble Methods
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Other genomic arrays: Methylation, chIP on chip… UBio Training Courses.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Cluster validation Integration ICES Bioinformatics.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Musical Genre Categorization Using Support Vector Machines Shu Wang.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Improving gene expression similarity measurement using pathway-based analytic dimension Changwon Keum BMDRC.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
Kelci J. Miclaus, PhD Advanced Analytics R&D Manager JMP Life Sciences
Combining Bagging and Random Subspaces to Create Better Ensembles
David Amar, Tom Hait, and Ron Shamir
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Molecular Classification of Cancer
Gene Selection for Microarray-based Cancer Classification Using Genetic Algorithm 이 정문 2003/04/01 BI Lab.
Claudio Lottaz and Rainer Spang
Machine Learning Week 1.
PCA, Clustering and Classification by Agnieszka S. Juncker
Enhancing Diagnostic Quality of ECG in Mobile Environment
Volume 5, Issue 6, Pages e3 (December 2017)
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Claudio Lottaz and Rainer Spang
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY O F C ONNECTICUT 1 D EPARTMENT OF C OMPUTER S CIENCE AND E NGINEERING 2 D EPARTMENT OF M OLECULAR AND C ELL B IOLOGY

Motivation 1: Cell-type Identification The Question: Smallest # of genes to identify each cluster: B: Bone C: Myeloid D: Endothelial Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage. In collaboration with: Dr. Hector Leonardo Aguila, UCHC

Motivation 2: Clinical Diagnostics Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoS One 2012 Study# genesSensitivity (%)Specificity (%) Lequerre Stuhlmuller Stuhlmuller Lequerre87128 Sekiguchi Julia89217 Stuhlmuller37117 Tanio86733

Multi-class Classification Problem Multi-class Classification There are 2 or more classes Supervised learning Key Problems: 1. Feature Selection: What are the most predictive biomarkers? 2. Classification: What is the best classification algorithm?

Challenges Different types of data Gene expression Epigenetic data Methylation Histone modification Proteomics Metabolomics Phenotypes Different Platforms Microarray Sequencing In-situ hybridization Different Resolutions Discrete vs Continuous Sparse vs Complete

Minimal Unique Marker Panel Selection (Mumps) Pipeline Feature Selection Classification Parameterize each combination of feature selection and classification algorithms Inner Cross-validation Rank Models by AUC Outer Cross-validation Output: the best features and classifier Input: # of biomarkers: Nested Cross Validation

Feature Selection (SVM)-recursive feature elimination (RFE) ANOVA F-value Random Forests Extra Trees Algorithms Correlation Cosine K-Nearest Neighbors (KNN) Support Vector Machine (SVM) Decision Tree Random Forests Extra Trees Gradient Boosting Classification

Datasets From Broad Institute Affymetrix Gene expression microarray 15 hematopoietic cell types 82 samples 4-7 samples per cell type. Multiple Sources 70 samples Approximately 3-7 samples per cell type. Affymetrix & Illumina Bead Array Different labs

Experiments Complete Complete gene expression profile from microarray datasets. Simulated Sparse 70% and 50% missing data Coverage of a marker followed a Beta distribution. The fraction of cell types having known expression statuses for a marker. Fifteen simulations Cross-validation 3-fold, stratified # features: 2, 8, 16, 32, 64, 96, 128, 256, and 384 Best set of features and classifier for each # features External validation Use Broad data as training Test against external datasets

Performance: Complete Data

By Algorithm: Complete Data

Performance: 70% Missing

Summary: Best Algorithms Complete70% missing # of markersFSCLFSCL 2RFEKNN RFEExtra Trees 8 RFECosineRFECosine 1616 RFECosineRFECosine 32 RFECosineRFECosine 64 RFECosineRFECosine 96 RFECosineRFECorrelation 128 RFECosineRFECorrelation 256 RFECosineRFECorrelation 384 RFECorrelationRFECorrelation

Why the Big Gap? Cross-platform normalization Similarities in cell- types Over-fitting Correlation: Broad vs External

Mesoderm Cell-type Identification Anti-TNF Responsivness Motivation Results # genesAUC 8 73 % % % % % % % % Study# genes Sensitivity (%) Specificity (%) Lequerre Stuhlmuller Stuhlmuller Lequerre87128 Sekiguchi Julia89217 Stuhlmuller37117 Tanio86733 UCONN883 UCONN

Future Work Broader Data-types NCI-60 microarray mRNA microarray microRNA copy number variation protein array SNPs … Minimizing over fitting Cross-platform normalization Different Data types Integrate multiple data types simultaneously

Conclusion and Thanks Thanks to: Ed Hemphill Chih Lee Ion Mandoiu Craig Nelson Smpl Bio A commercial service coming in late 2013

D ON ’ T G O B EYOND, T IS A S ILLY P LACE Extra Slides

Experiment Overview Parameterize each combination of feature selection and classification algorithms Output the best features and classifier Feature Selection Classification Inner Cross-validation Rank Models by AUC Outer Cross-validation Input: # of biomarkers: Nested Cross Validation Test Best Model Output: AUC of best features / classifier Broad Data External Testing

Performance: 50% Missing