A Comparative Study on Variable Selection for Nonlinear Classifiers C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

SISTA seminar Feb 28, 2002 Preoperative Prediction of Malignancy of Ovarian Tumors Using Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1,
AIME03, Oct 21, 2003 Classification of Ovarian Tumors Using Bayesian Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
An Overview of Machine Learning
Minimum Redundancy and Maximum Relevance Feature Selection
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
On-Line Probabilistic Classification with Particle Filters Pedro Højen-Sørensen, Nando de Freitas, and Torgen Fog, Proceedings of the IEEE International.
PhD Hearing (Oct 15, 2003) Predictive Computer Models for Medical Classification Problems Predictive Computer Models for Medical Classification Problems.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Lecture 14 – Neural Networks
Pattern Recognition and Machine Learning
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Reduced Support Vector Machine
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Introduction to machine learning
PhD defense C. LU 25/01/ Probabilistic Machine Learning Approaches to Medical Classification Problems Probabilistic Machine Learning Approaches to.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Reduced the 4-class classification problem into 6 pairwise binary classification problems, which yielded the conditional pairwise probability estimates.
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Prediction of Malignancy of Ovarian Tumors Using Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel 1, I.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
EMBC2001 Using Artificial Neural Networks to Predict Malignancy of Ovarian Tumors C. Lu 1, J. De Brabanter 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.
Feature selection with Neural Networks Dmitrij Lagutin, T Variable Selection for Regression
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Reservoir Uncertainty Assessment Using Machine Learning Techniques Authors: Jincong He Department of Energy Resources Engineering AbstractIntroduction.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
EEE502 Pattern Recognition
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Blackbox classifiers for preoperative discrimination between malignant and benign ovarian tumors C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel.
Ghent University Pattern recognition with CNNs as reservoirs David Verstraeten 1 – Samuel Xavier de Souza 2 – Benjamin Schrauwen 1 Johan Suykens 2 – Dirk.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Introduction Background Medical decision support systems based on patient data and expert knowledge A need to analyze the collected data in order to draw.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning with Spark MLlib
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Machine Learning Feature Creation and Selection
Somi Jacob and Christian Bach
Parametric Methods Berlin Chen, 2005 References:
Chapter 7: Transformations
Multivariate Methods Berlin Chen
Feature Selection Methods
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

A Comparative Study on Variable Selection for Nonlinear Classifiers C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman 2 1 Department of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium, 2 Department of Obstetrics and Gynecology, University Hospitals Leuven, Leuven, Belgium address:

1. Introduction  Variable selection refers to the problem of selecting input variables that are relevant for a given task. In pattern recognition, variable selection can have an impact on the economics of data acquisition and on the accuracy and complexity of the classifiers.  This study aims at input variable selection for nonlinear blackbox classifiers, particularly multi-layer perceptrons (MLP) and least squares support vector machines (LS-SVMs). 2. Feature extraction Variable Selection Variable (feature) measure Heuristic search: forward, backward, stepwise, hill-climbing, branch and bound… Filter approach: filter out irrelevant attributes before induction occurs Wrapper approaches : focus on finding attributes that are useful for performance for a specific type of model, rather than necessarily finding the relevant ones. Feature Extraction Feature selection or variable selection Feature transformation e.g. PCA, not desirable for maintaining data, difficulty in interpretation, and not immune from distortion under transformation Variable measure Correlation Mutual information (MI) Evidence (or Bayes factor) in Bayesian framework Classification performance Sensitivity analysis: change in the objective function J by removing variable i: DJ(i) Statistical partial F test (Chi- square value) Pattern recognition: feature extraction -> classification

3. Considered nonlinear classifiers: MLPs and LS-SVMs LS-SVM Classifier Note: by integrating the MLP(Mackay 1992) or LS-SVM (VanGestel, Suykens 2002) with the Bayesian evidence framework, the tuning of hyperparameters and computation of posterior class probabilities can be done in a unified way. Variable selection can also be done based on the model evidence. solved in dual space Model evidence Bayesian Evidence Framework Inferences are divided into distinct levels. MLP Classifiers

4. Considered variable selection methods MethodVariable measure SearchPredefined parameters (Dis) advantages Mutual information feature selection under uniform information distrib. (MIFS-U) [8] Mutual information I(X;Y) Greedy search: begin from no variables, repeat selecting the feature until predifined k variables are selected Density function estimation (parametric or nonparametric), here the simple discretization method is used. Linear/Nolinear, easy to compute; computational problems increase with k, for very high dimensional data. Information lost due to discretization. Bayesian LS-SVM variable forward selection (LSSVMB-FFS) [1] Model evidence P(D|H) Greedy search, select each time a variable that gives the highest increase in model evidence, until no more increase. Kernel type.(Non)linear. Automatically select a certain number of variables that max the evidence. Gaussian assumption. Computationally expensive for high dimensional data. LS-SVM recursive feature elimination (LSSVM-RFE) [7] For linear kernel, use(w i ) 2 Recursively remove the variable(s) that have the smallest DJ(i). Kernel type, regularization and kernel parameters. Suitable for very high dimensional data. Computationally expensive for large sample size, and nonlinear kernels. Stepwise logistic regression (SLR) Chi-square (statistical partial F-test). Stepwise: recursively add or remove a variable at each step. P-values for determining addition or removal of variables in models. Linear, easy to compute. Troubles in case of multicolinearity.

5. Experimental results on benchmark data sets Table 2. Accuracy on Test set with different number of variables II. Biomedical real life data set (1)Gene selection for leukemia classification [] #variables: 7129, Classes: ALL, AML, #Training data: 38; #test data: 34 I. Synthetic data: noisy XOR problem linearly inseparable. 50 random generated input data, X1, X2: {0,1} random Y: XOR(x1, x2). X3, X4: noise~N(0, 0.3) was added to X1 X2 X5, X6: noise~N(0, 0.5) was added to X1 X2 X7~X16: noise~N(0, 2) Table 1. LssvmRFE (using a polynomial kernel with degree 2) selected correctly the top2 variables 25 times from the 30 random trials based on the 50 noisy training data; Averaged performance on a test set of 100 examples over 30 random trials using. Notes: Linear classifier and selection method can’t solve the XOR problem which is nonlinear. MIFSU: entropy for the first 2 binary variable smaller than the other continuous variables Bayesian LSSVM FFS: evidence for the first 2 binary variables is smaller than other continuous variables; however backward Bayesian LSSVM can always remove the other noisy variables.. *linear kernels are used for lssvm-RFE and lssvmB-FFS - MLP has 1 hidden layer with 2 hidden neurons, using Baysian MLP to determine the regularization parameter. - the LSSVM classifier uses a polynomial kernel with degree 2.

 Good variable selection can improve the performance of the classifiers both in accuracy and computation.  LSSVM-RFE can be suitable for both linear and nonlinear classification problems. And can deal with the very high dimensional data.  Bayesian LSSVM forward selection can identify the important variables in some cases, however should be used with more care in the satisfaction of the assumptions.  A strategy which combines variable ranking and the wrapper methods should give more confidence in the selected variables. References 1. C. Lu, T. Van Gestel, et al. Preoperative prediction of malignancy of ovarian tumors using Least Squares Support Vector Machines (2002), submitted paper. 2. D. Timmerman, H. Verrelst, et al., Artificial neural network models for the preoperative discrimination between malignant and benign adnexal masses. Ultrasound Obstet Gynecol (1999). 3. J.A.K. Suykens, J. Vandewalle, Least Squares support vector machine classifiers, Neural Processing Letters (1999), 9(3). 4. T. Van Gestel, J.A.K. Suykens, et al., Bayesian framework for least squares support vector machine classifiers, Neural Computation (2002), 15(5). 5. D.J.C. MacKay, The evidence framework applied to classification networks, Neural Computation (1992), 4(5). 6. R. Kohavi and G. John, Wrappers for feature subset selection, Artificial intelligence, special issue on relevance 97 (1-2): I. Guyon, J. Weston, et al. Gene selection for cancer classification using support vector machines, Machine learning (2000). 8. N. Kwak and C.H. Choi Input feature selection for classification problems, IEEE Transactions on neural networks (2002) 13 (1). Table 3. Accuracy on test set with different number of variables 6. Conclusions (2) Ovarian tumor classification # variables 27, classes: benign, malignant # training data: 265, #test data: 160