Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Model generalization Test error Bias, variance and complexity
Minimum Redundancy and Maximum Relevance Feature Selection
Model Assessment, Selection and Averaging
CMPUT 466/551 Principal Source: CMU
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Sparse vs. Ensemble Approaches to Supervised Learning
Extending Expectation Propagation for Graphical Models Yuan (Alan) Qi Joint work with Tom Minka.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Reduced Support Vector Machine
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.
Machine Learning CMPT 726 Simon Fraser University
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
Bayesian Learning for Conditional Models Alan Qi MIT CSAIL September, 2005 Joint work with T. Minka, Z. Ghahramani, M. Szummer, and R. W. Picard.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
PATTERN RECOGNITION AND MACHINE LEARNING
Whole Genome Expression Analysis
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Reduced the 4-class classification problem into 6 pairwise binary classification problems, which yielded the conditional pairwise probability estimates.
Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Christopher M. Bishop, Pattern Recognition and Machine Learning.
Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi MIT Media Lab.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Lecture 2: Statistical learning primer for biologists
Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Bayesian Conditional Random Fields using Power EP Tom Minka Joint work with Yuan Qi and Martin Szummer.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
CEE 6410 Water Resources Systems Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Sparse Kernel Machines
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Gene Expression Classification
Computational Intelligence: Methods and Applications
Pattern Recognition and Machine Learning
Extending Expectation Propagation for Graphical Models
Model generalization Brief summary of methods
Neural networks (3) Regularization Autoencoder
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Guest lecture: Feature Selection Alan Qi Dec 2, 2004

Outline Problems Overview of feature selection (FS) –Filtering: correlation & information criteria –Wrapper approach: greedy FS & regularization Classical Bayesian feature selection New Bayesian approach: predictive-ARD

Feature selection Gene expression: thousands of gene measurements Documents: “bag of words” model with more than 10,000 words Images: histograms, colors, wavelet coefficients, etc. Task: find a small subset of features for prediction

Gene Expression Classification Task: Classify gene expression datasets into different categories, e.g., normal v.s. cancer Challenge: Thousands of genes measured in the micro-array data. Only a small subset of genes are probably correlated with the classification task.

Filtering approach Feature ranking based on sensible criteria: –Correlation between features and labels –Mutual information between features and labels

Wrapper Approach Assess subsets of variables according to their usefulness to a given predictor. A combinatorial problem: 2 K combinations given K features. -Sequentially adding/removing features: Sequential Forward Selection (SFS), Backward Sequential Selection (SBS). -Recursively adding/removing features: Sequential Forward Floating Selection (SFFS) (When to stop? Overfitting?) -Regularization: use sparse prior to enhance the sparsity of a trained predictor (classifier).

Regularization Regularization: combining the fit to the data and a penalty for complexity. Minimizing the following Labels: t = [t 1, t 1, …, t N ] Inputs: X = [x 1, x 1, …, x N ] Parameters: w Likelihood for the data set (For classification):

Bayesian feature selection Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation Experiments Conclusions

Motivation Task 1: Classify high dimensional datasets with many irrelevant features, e.g., normal v.s. cancer microarray data. Task 2: Sparse Bayesian kernel classifiers for fast test performance.

Bayesian Classification Model Prior of the classifier w: Labels: t inputs: X parameters: w Likelihood for the data set: Where is a cumulative distribution function for a standard Gaussian.

Evidence and Predictive Distribution The evidence, i.e., the marginal likelihood of the hyperparameters : The predictive posterior distribution of the label for a new input :

Automatic Relevance Determination (ARD) Give the classifier weight independent Gaussian priors whose variance,, controls how far away from zero each weight is allowed to go: Maximize, the marginal likelihood of the model, with respect to. Outcome: many elements of go to infinity, which naturally prunes irrelevant features in the data.

Two Types of Overfitting Classical Maximum likelihood: –Optimizing the classifier weights w can directly fit noise in the data, resulting in a complicated model. Type II Maximum likelihood (ARD): –Optimizing the hyperparameters corresponds to choosing which variables are irrelevant. Choosing one out of exponentially many models can also overfit if we maximize the model marginal likelihood.

Risk of Optimizing X: Class 1 vs O: Class 2 Evd-ARD-1 Evd-ARD-2 Bayes Point

Outline Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation Experiments Conclusions

Predictive-ARD Choosing the model with the best estimated predictive performance instead of the most probable model. Expectation propagation (EP) estimates the leave-one-out predictive performance without performing any expensive cross-validation.

Estimate Predictive Performance Predictive posterior given a test data point EP can estimate predictive leave-one-out error probability where q ( w| t \ i ) is the approximate posterior of leaving out the i th label. EP can also estimate predictive leave-one-out error count

Expectation Propagation in a Nutshell Approximate a probability distribution by simpler parametric terms: Each approximation term lives in an exponential family (e.g. Gaussian)

EP in a Nutshell Three key steps: Deletion Step: approximate the “leave-one-out” predictive posterior for the i th point: Minimizing the following KL divergence by moment matching: Inclusion: The key observation: we can use the approximate predictive posterior, obtained in the deletion step, for model selection. No extra computation!

Comparison of different model selection criteria for ARD training 1 st row: Test error 2 nd row: Estimated leave-one-out error probability 3 rd row: Estimated leave-one-out error counts 4 th row: Evidence (Model marginal likelihood) 5 th row: Fraction of selected features The estimated leave-one-out error probabilities and counts are better correlated with the test error than evidence and sparsity level.

Gene Expression Classification Task: Classify gene expression datasets into different categories, e.g., normal v.s. cancer Challenge: Thousands of genes measured in the micro-array data. Only a small subset of genes are probably correlated with the classification task.

Classifying Leukemia Data The task: distinguish acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL). The dataset: 47 and 25 samples of type ALL and AML respectively with 7129 features per sample. The dataset was randomly split 100 times into 36 training and 36 testing samples.

Classifying Colon Cancer Data The task: distinguish normal and cancer samples The dataset: 22 normal and 40 cancer samples with 2000 features per sample. The dataset was randomly split 100 times into 50 training and 12 testing samples. SVM results from Li et al. 2002

Bayesian Sparse Kernel Classifiers Using feature/kernel expansions defined on training data points: Predictive-ARD-EP trains a classifier that depends on a small subset of the training set. Fast test performance.

Test error rates and numbers of relevance or support vectors on breast cancer dataset. 50 partitionings of the data were used. All these methods use the same Gaussian kernel with kernel width = 5. The trade-off parameter C in SVM is chosen via 10- fold cross-validation for each partition.

Test error rates on diabetes data 100 partitionings of the data were used. Evidence and Predictive ARD-EPs use the Gaussian kernel with kernel width = 5.

Summary Two kinds of feature selection methods: –Filtering and wrapper methods Classical Bayesian feature selection: –Excellent classical approach: Tuning prior to prune features. –However, maximizing marginal likelihood can lead to overfitting in the model space if there are a lot of features. New Bayesian approach: Predictive-ARD, which focus on the prediction performance. –feature selection –sparse kernel learning