Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

CSI :Florida A BAYESIAN APPROACH TO LOCALIZED MULTI-KERNEL LEARNING USING THE RELEVANCE VECTOR MACHINE R. Close, J. Wilson, P. Gader.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Yue Han and Lei Yu Binghamton University.
Model generalization Test error Bias, variance and complexity
Model Assessment, Selection and Averaging
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
What is Statistical Modeling
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Reduced Support Vector Machine
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
Machine Learning CMPT 726 Simon Fraser University
Classification 10/03/07.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
Bayesian Framework EE 645 ZHAO XIN. A Brief Introduction to Bayesian Framework The Bayesian Philosophy Bayesian Neural Network Some Discussion on Priors.
Bayesian Learning for Conditional Models Alan Qi MIT CSAIL September, 2005 Joint work with T. Minka, Z. Ghahramani, M. Szummer, and R. W. Picard.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Reduced the 4-class classification problem into 6 pairwise binary classification problems, which yielded the conditional pairwise probability estimates.
Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Christopher M. Bishop, Pattern Recognition and Machine Learning.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi MIT Media Lab.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Biointelligence Laboratory, Seoul National University
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Sparse Approximate Gaussian Processes. Outline Introduction to GPs Subset of Data Bayesian Committee Machine Subset of Regressors Sparse Pseudo GPs /
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
CEE 6410 Water Resources Systems Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Gene Expression Classification
LECTURE 16: SUPPORT VECTOR MACHINES
Computational Intelligence: Methods and Applications
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
LECTURE 17: SUPPORT VECTOR MACHINES
Michal Rosen-Zvi University of California, Irvine
Learning From Observed Data
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Presentation transcript:

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani

Motivation Task 1: Classify high dimensional datasets with many irrelevant features e.g., normal v.s. cancer microarray data. Task 2: Sparse Bayesian kernel classifiers for fast test performance

Automatic Relevance Determination (ARD) Give the feature weights independent Gaussian priors whose variance,, controls how far away from zero each weight is allowed to go. Maximize the marginal likelihood of the model with respect to. Outcome: many elements of go to infinity, which naturally prunes irrelevant features in the data.

Risk of Optimizing Choosing a simple model can also overfit if we maximize the model marginal likelihood. Particularly, if maximizing the marginal likelihood of the model and the dimension of (the number of the features) is large, there exists the risk of overfitting.

Predictive-ARD Choosing the model with the best estimated predictive performance instead of the most probable model. Expectation propagation (EP) estimates the leave-one-out predictive performance without performing any expensive cross-validation.

Estimate Predictive Performance Predictive posterior given a test data point EP estimate of predictive leave-one-out error probability EP estimate of predictive leave-one-out error count

Comparison of different model selection criteria for ARD training 1 st row: Test error 2 nd row: Estimated leave-one-out error probability 3 rd row: Estimated leave-one-out error counts 4 th row: Evidence (Model marginal likelihood) 5 th row: Fraction of selected features

Gene Expression Classification Task: Classify gene expression datasets into different categories, e.g., normal v.s. cancer Challenge: Thousands of genes measured in the micro-array data. Only a small subset of genes are probably correlated with the classification task.

Classifying Leukemia Data The task: distinguish acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL). The dataset: 47 and 25 samples of type ALL and AML respectively with 7129 features per sample. The dataset was randomly split 100 times into 36 training and 36 testing samples.

Classifying Colon Cancer Data The task: distinguish normal and cancer samples The dataset: 22 normal and 40 cancer samples with 2000 features per sample. The dataset was randomly split 100 times into 50 training and 12 testing samples. SVM results from Li et al. 2002

Summary ARD is an excellent Bayesian feature selection and sparse learning method. However, maximizing marginal likelihood can lead to overfitting if there are a lot of features. We propose Predictive ARD based on EP In practice it works very well.

Sequential Update EP approximates true observations by simpler virtual observations. Based on virtual observations, we can achieve efficient sequential updates without maintaining and updating a full covariance matrix.

Bayesian Sparse Classifiers The trained classifier is defined by a small subset of the training set. Fast test performance.

Test error rates and numbers of relevance or support vectors on breast cancer dataset. 50 partitionings of the data were used. All these methods use the same Gaussian kernel with kernel width = 5. The trade-off parameter C in SVM is chosen via 10- fold cross-validation for each partition.

Test error rates on diabetes data. 100 partitionings of the data were used. Evidence and Predictive ARD-EPs use the Gaussian kernel with kernel width = 5.