Download presentation

Presentation is loading. Please wait.

1
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario A.T. Figueiredo, Senior Member, IEEE IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGNEC VOL. 26, NO.9, SEPTEMBER 2004

2
Outline Introduction to MAP Introduction to Expectation-Maximization Introduction to Generalized linear models Introduction Sparsity-promoting priors MAP parameter estimation via EM Experimental Results Conclusion

3
Introduction to MAP Maximum-likelihood estimation – is considered as parameter vector In MAP estimation – is considered as a random vector described by a pdf p( ) – Assume p( ) is known – Given a set of training samples D={x 1,x 2,…,x n }

4
Introduction to MAP Find the maximum of – If p( ) is uniform or flat enough

5
Introduction to Expectation- Maximization

9
Introduction to generalized linear models Supervised learning can be formalized as the problem of inferring a function y=f(x), based on a training set D={(x 1,y 1 ),…, (x n,y n )} When y is continuous (e.g. y R ), we are in the context of regression, whereas in classification problems, y is of categorical nature (e.g. binary, y={- 1,1}) The function f is assumed to have a fixed structure and to depend on a set of parameters . We write y=f(x, ) and the goal becomes to estimate from the training data.

10
Introduction to generalized linear models Regression functions where is a vector of k fixed functions of the input, often called features. – Linear regression : k=d+1 – Nonlinear regression via a set of k fixed basis functions : – Kernel regression :

11
Introduction to generalized linear models – Assume that the output variables in the training set were contaminated by addictive white Gaussian noise : where is a set of independent zero- mean Gaussian samples with variance 2. – With, the likelihood function : is the so-called design matrix. The element of, denoted, is given by

12
Introduction to generalized linear models – With a zero-mean Gaussian prior for , with covariance – The posterior is still Gaussian

13
Introduction to generalized linear models – In logistic regression, the link function is – Probit link or Probit model:

14
Introduction to generalized linear models Hidden variable where w is a zero-mean unit-variance Gaussian noise If the classification rule is Given training data hidden variables where

15
Introduction to generalized linear models If we had z, we would have a simple linear likelihood with unit noise variance The use of the EM algorithm to estimate , by treating z as missing data.

16
Introduction Given the training set two standard tasks : – Classifier design : To learn a function that most accurately predicts the class of a new example – Feature selection: To identify a subset of the features that is most informative about the class distinction (feature selection)

17
Introduction In this paper, joint classifier and feature optimization (JCFO) – The association of a nonnegative scaling factor with each feature – These scaling factors are then estimated from the data, under an a priori preference for values that are either significantly large or otherwise exactly zero.

18
Introduction In this paper, focus on probabilistic kernel classifiers of the form :

19
Introduction two of the most popular kernels : – rth order polynomial : – Gaussian radial basis function(RBG) :

20
Introduction JCFO seeks sparsity in its use of both basis functions (sparsity in ) and features (sparsity in ) For kernel classification, sparsity in the use of basis functions is known to impact the capacity of the classifier, which controls its generalization performance. Sparsity in feature utilization is another important factor for increased robustness.

21
Sparsity-promoting priors To encourage sparsity in the estimates of the parameter vector and, we adopt a Laplacian prior for each. For small, the difference between and is much larger for a Laplacian than for a Gaussian. As a result, using a Laplacian prior in a learning procedure that seeks to maximize the posterior density strongly favors values of that are exactly 0 over values that are simply close to 0.

22
Sparsity-promoting priors To avoid nondifferentiability at the origin, we use an alternative hierarchical formulation which is equivalent to the Laplacian prior : – Let each have a zero-mean Gaussian prior – Let all the variances be independently distributed according to a common exponential distribution (the hyperprior)

23
Sparsity-promoting priors – The effective prior can be obtained by marginalizing with respect to each For each scaling coefficient, we adopt similar saprsity-promoting prior, but we must ensure that any estimate of is nonnegative.

24
Sparsity-promoting priors – A hierarchical model for similar to the one described above, but with the Gaussian prior replaced by a truncated Gaussian prior that explicitly forbids negative values : – An exponential hyperprior –

25
Sparsity-promoting priors – The effective prior can be obtained by marginalizing with respect to.

26
MAP parameter estimation via EM Given the priors described above, our goal is to find the maximum a posteriori (MAP) estimate :

27
MAP parameter estimation via EM We use EM algorithm that finds MAP estimates using the hierarchical prior models and latent variable interpretation of the probit model. (N+1)-dimensioanl vector function: random function :

28
MAP parameter estimation via EM If a classifier were to assign the label y=1 to an example x whenever and y=0 whenever We would recover the probit mode: Consider the vector of missing variables:

29
MAP parameter estimation via EM The EM algorithm will produce a sequence of estimates and by alternating between the E-step and the M-step. E-step – Compute the expected value of the complete log- posterior conditioned on the data D and the current esitimate of the parameters, and.

30
MAP parameter estimation via EM –

31
M-step

32
Experimental Results Each data is initially normalized to have zero mean and unit variance. The regularization parameters and (controlling the degrees of sparsity enforced on and,respectively ) are selected by cross-validation on the training data.

33
Experimental Results Effect of irrelevant predictor variables – Generate synthetic data from one of two normal distributions with unit variance

34
Experimental Results Results with high-dimensional gene expression data set – Strategy for learning a classifier is likely to be less relevant here than the choice of feature selection method. – Two commonly-analyzed data sets Contains expression level of 7129 genes from 47 patients with acute myeloid leukemia (AML) and 25 patients with acute lymphoblastic leukemia (ALL). Contains expression levels of 2000 genes from 40 tumor and 22 normal colon tissues.

35
Experimental Results – Full leave-one-out cross-validation procedure: Train on n-1 samples and test the obtained classifier on the remaining sample, and repeat this procedure for every sample.

36
Experimental Results Results with low-dimensional benchmark data sets

37
Experimental Results – Our algorithm did not perform any feature selection in the Ripley data set, selected five out of eight variables in the Pima data set, and selected three out of five variables in the Crabs. – JCFO is also very sparse in its utilization of kernel functions: It chooses an average of 4 out of 100, 5 out of 200, 5 out of 80, and 6 out of 300 kernels for the Ripley, Pima, Crabs, and WBC data set, respectively.

38
Conclusion It use of sparsity-promoting priors that encourage many of the to be exactly zero. Experimental results indicate that, for high- dimensional data with many irrelevant features, the classification accuracy of JCFO is likely to be statistically superior to other methods,

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google