SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Slides:

Advertisements

Similar presentations

Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3

Advertisements

AIME03, Oct 21, 2003 Classification of Ovarian Tumors Using Bayesian Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens.

Random Forest Predrag Radenković 3237/10

CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

A Multi-PCA Approach to Glycan Biomarker Discovery using Mass Spectrometry Profile Data Anoop Mayampurath, Chuan-Yih Yu Info-690 (Glycoinformatics) Final.

Yue Han and Lei Yu Binghamton University.

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

Machine Learning Bioinformatics Data Analysis and Tools

Sparse vs. Ensemble Approaches to Supervised Learning

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?

Reduced Support Vector Machine

Elena Marchiori Department of Computer Science

Machine Learning techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Data mining and statistical learning - lecture 13 Separating hyperplane.

Feature Selection Lecture 5

Feature Selection Bioinformatics Data Analysis and Tools

CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.

Diagnosis of Ovarian Cancer Based on Mass Spectrum of Blood Samples Committee: Eugene Fink Lihua Li Dmitry B. Goldgof Hong Tang.

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology

Proteomics Informatics Workshop Part III: Protein Quantitation

JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Whole Genome Expression Analysis

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

The Broad Institute of MIT and Harvard Classification / Prediction.

Prediction of Malignancy of Ovarian Tumors Using Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel 1, I.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

High throughput Protein Measurement Techniques Harin Kanani.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Use of Active Learning for Selective Annotation of Training Data in a Supervised Classification System for Digitized Histology Scott Doyle 1, Michael Feldman.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

C. Furlanello – June 22th, Annalisa Barla, Bettina Irler, Stefano Merler, Giuseppe Jurman, Silvano Paoli, Cesare Furlanello ITC-irst,

Consensus Group Stable Feature Selection

Introduction to Biostatistics and Bioinformatics Experimental Design.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Blackbox classifiers for preoperative discrimination between malignant and benign ovarian tumors C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.

Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Basic machine learning background with Python scikit-learn

iSRD Spam Review Detection with Imbalanced Data Distributions

Chapter 7: Transformations

Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam

Overview Variable selection Variable selection SVM-based techniques SVM-based techniques Application to proteomic pattern data Application to proteomic pattern data Results Results Conclusion Conclusion

Variable Selection Select a small subset of input variables (for example genes in gene expression data, m/z values in proteomic pattern data) which are used for building classifier Select a small subset of input variables (for example genes in gene expression data, m/z values in proteomic pattern data) which are used for building classifier Advantages: Advantages:  it is cheaper to measure less variables  the resulting classifier is simpler and potentially faster  prediction accuracy may improve by discarding irrelevant variables  identifying relevant variables gives more insight into the nature of the corresponding classification problem (biomarker detection)

Support Vector Machines Advantages: Advantages:  maximize the margin between two classes in the feature space characterized by a kernel function  are robust with respect to high input dimension Disadvantages: Disadvantages:  difficult to incorporate background knowledge  Sensitive to outliers

w T x + b = 0 w T x + b < 0 w T x + b > 0 f(x) = sign(w T x + b) Binary classification

Linear Separators

SVM: separable classes ρ Support vector margin Optimal hyper-plane Support vectors uniquely characterize optimal hyper-plane

SVM and outliers outlier

SVM-RFE Linear binary classifier decision function Recursive Feature Elimination (SVM-RFE) Recursive Feature Elimination (SVM-RFE)  at each iteration: 1) eliminate threshold% of variables with lower score 2) recompute scores of remaining variables SVM-RFE based algorithms: SVM-RFE based algorithms:  run SVM-RFE with different thresholds  JOIN: select variables occurring more than cutoff times  ENSEMBLE: consider majority vote of resulting classifiers

SVM-RFE I. Guyon et al., Machine Learning, 46, , 2002

SVM-RFE variant Input: Train set, threshold T, number N of variables to be selected Input: Train set, threshold T, number N of variables to be selected Output: subset of variables of size N Output: subset of variables of size N RFE: RFE:  Train: Run linear SVM on train set  Score: generate a sequence of variables ordered wrt the absolute value of their weight  Eliminate: remove T % of variables from ordered sequence  Repeat (train, score, eliminate) on train set restricted to remaining variables until only N variables are left

JOIN and ENSEMBLE SVM-RFE

Case Study: proteomic pattern data Petricoin et al papers Petricoin et al papers  Commercial analysis software (Proteome Quest):  Data sets available at:

Data generation: SELDI-TOF MS Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins. Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins. The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones. The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones. 1 Serum on protein binding plate 2 Insert plate in vacuum chamber 3 Irradiate plate with laser 4 This “launches” the proteins / peptides 5 Measure “time of flight” (TOF) of Ions, related to the molecular weight of proteins

Example of proteomic pattern profile from one blood sample Time of flight Abundance Heavier peptides move slower -> Time of flight corresponds to weight Weight corresponds to peptides Measurement of relative abundance of detected peptides in serum

How to use such data? Diagnostic tool: Diagnostic tool:  design a classifier for discriminating healthy from disease samples Biomarkers identification: Biomarkers identification:  Variable subset selection (VSS): select a subset of input variables (m/z values) that best discriminate the two classes (potential biomarkers)

Commercial Tools Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002) Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002) Propeak (3Z Informatics): separability analysis + bootstrap Propeak (3Z Informatics): separability analysis + bootstrap Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ? Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?

Non-commercial Techniques Pre-processing + ranking + kNN (Zhu et al., PNAS 2003) Pre-processing + ranking + kNN (Zhu et al., PNAS 2003) Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002) Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002) Filter FS + classifier (Liu et al., Genome Informatics 2002) Filter FS + classifier (Liu et al., Genome Informatics 2002) GA + SVM, SVM-RFE ensemble (Jong et al., EvoBIO 2004, Jong et al. CIBCB 2004) GA + SVM, SVM-RFE ensemble (Jong et al., EvoBIO 2004, Jong et al. CIBCB 2004) Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003) Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003)

Goal and Methods Goal: analyze performance of SVM-based techniques for classification and variable selection with proteomic pattern data Goal: analyze performance of SVM-based techniques for classification and variable selection with proteomic pattern data SVM SVM SVM-RFE SVM-RFE Ensemble SVM-RFE: Ensemble SVM-RFE:  Majority vote of SVM-RFE classifiers obtained from SVM-RFE with different cutoff values Join SVM-RFE: Join SVM-RFE:  SVM trained on N variables that have been selected more often by SVM-RFE with different threshold values

DataSets Two proteomic pattern datasets from prostate and ovarian cancer from NCI/CCR and FDA/CBER Clinical proteomics Program Databank: Data sets available at: (15 benign) Ovarian 4/03/ Prostate M/z values healthycancer tot #

Experimental Setup 10 random partitions of dataset:T (50%),H (25%),V (25%) 10 random partitions of dataset:T (50%),H (25%),V (25%) Algorithms: Algorithms:  SVM trained on union of T and H  SVM-RFE(threshold) with thresholds = 0.2,0.3,0.4,0.5, 0.6,0.7  Choose threshold giving best classifier sensitivity on H  JOIN(cutoff, 0.2, 0.3,0.4, 0.5,0.6,0.7) with cutoffs = 1, 2, 3, 4, 5  Choose cutoff giving best classifier sensitivity on H Performance: average (over 10 V's) Performance: average (over 10 V's)

Results Prostate Dataset

Results Ovarian Dataset

Controversy Noise, bias, results reliability and reproducibility in serum proteomics: Noise, bias, results reliability and reproducibility in serum proteomics:  Sorace, Zhan, BMC Bioinformatics, 2004,  Petricoin, BMC Bioinformatics, 2004,  Baggerly, Journal of the National Cancer Institute, vol. 97, No.4,  Liotta, Journal of the National Cancer Institute, vol. 97, No.4,  Ransohoff, Journal of the National Cancer Institute, vol. 97, No.4, 2005.

Conclusion Many machine learning techniques can be used for potential biomarker detection with pattern proteomic data. Many machine learning techniques can be used for potential biomarker detection with pattern proteomic data. SVM based techniques are a possible effective choice because of the high input dimension of such data. SVM based techniques are a possible effective choice because of the high input dimension of such data. Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting. Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting. Problems related to reliability and reproducibility of data are inherent to the laboratory technology and actually addressed by researchers and practitioners. Problems related to reliability and reproducibility of data are inherent to the laboratory technology and actually addressed by researchers and practitioners.

Acknowledgments  Connie Jimenez (Biology, VUMC)  Aad van der Vaart (Statistics, VUA)