# Technical University of Lisbon

## Presentation on theme: "Technical University of Lisbon"— Presentation transcript:

Technical University of Lisbon
Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical University of Lisbon PORTUGAL Work herein reported was done in collaboration with: L. Carin, B. Krishnapuram, and A. Hartemink, Duke University; A. K. Jain and M. Law, Michigan State University.

Part I Supervised Learning
Outline Part I Supervised Learning Introduction Review LASSO Regression The LASSO Penalty for Multinomial Logistic Regression Bound Optimization Algorithms (Parallel and Sequential) Non-linear Feature Weighting/Selection Experimental Results Part II Performance Bounds Part III Unsupervised Learning Introduction Review of Model-Based Clustering with Finite Mixtures Feature Saliency An EM Algorithm to Estimate Feature Saliency Model Selection Experimental Results

Supervised Learning Goal: to learn a functional dependency... set of parameters ...from a set of examples (the training data): Discriminative (non-generative) approach: no attempt to model the joint density

Complexity Control Via Bayes Point Estimation
Bayesian (point estimation) approach: Prior controls the “complexity” of Good generalization requires complexity control. Likelihood function Maximum a posteriori (MAP) point estimate of Prediction for a new “input” :

Bayes Point Estimate Versus Fully Bayesian Approach
Point prediction for a new “input” : Fully Bayesian prediction We will not consider fully Bayesian approaches here.

Linear (w.r.t. b ) Regression
We consider functions which are linear w.r.t. where is some dictionary of functions; e.g., radial basis functions, splines, wavelets, polynomials,... Notable particular cases: linear regression: kernel regression: as in SVM, RVM, etc...

Likelihood Function for Regression
Likelihood function, for Gaussian observation model where Assuming that y and the columns of H are centered, we drop w.l.o.g. …the design matrix. Maximum likelihood / ordinary least squares estimate …undetermined if

Bayesian Point Estimates: Ridge and the LASSO
With a Gaussian prior called ridge regression, or weight decay (in neural nets parlance). With a Laplacian prior promotes sparseness of i.e., its components are either significantly large, or zero. feature selection LASSO regression [Tibshirani, 1996], prunning priors for NNs [Williams, 1995], basis pursuit [Chen, Donoho, Saunders, 1995].

Algorithms to Compute the LASSO
Special purpose algorithms [Tibshirani, 1996], [Fu, 1998], [Osborne, Presnell, Turlach, 2000], Least angle regression (LAR); currently the best approach. [Efron, Hastie, Johnstone, & Tibshirani, 2002] For orthogonal H, closed-form solution: “soft threshold” More insight on the LASSO, see [Tibshirani, 1996].

EM Algorithm for the LASSO
LASSO can be computed by EM using a hierarchical formulation: are independent, Treat the as missing data and apply standard EM. This leads to: which can be called an iteratively reweighted ridge regression (IRRR) This possibility mentioned in [Tibshirani, 1996]; not very efficient, but…

About g The previous derivation opens the door to other priors. For example, a Jeffreys prior: The EM algorithm becomes (still IRRR-type): Interestingly, similar to the FOCUSS algorithm for regression with an penalty [Kreutz-Delgado & Rao, 1998]. Strong sparseness! Problem: non-convex objective, results depend on initialization. Possibility: initialize with OLS estimate.

Some Results Same vectors as in [Tibshirani, 1996]: Design matrices and experimental procedure as in [Tibshirani, 1996] Model error (ME) improvement w.r.t. OLS estimate: Close to best in each case, without any cross-validation; more results in [Figueiredo, NIPS’2001] and [Figueiredo, PAMI’2003].

Classification via Logistic Regression
Recall that may denote the components of , or other (nonlinear) functions of , such as kernels. Binary classification: Multi-class, with “1 of m” encoding: Class

Classification via Logistic Regression
Since , we can set w.l.o.g. Parameters to estimate: Maximum log-likelihood estimate: If is separable, is unbounded, thus undefined.

Penalized (point Bayes) Logistic Regression
Penalized (or point Bayes MAP) estimate where Gaussian prior Penalized log. reg. Laplacian prior (LASSO prior) favors sparseness, feature selection. For linear regression it does, what about for logistic regression?

Laplacian Prior for Logistic Regression
Simple test with 2 training points: class 1 class -1 As decreases, becomes less relevant class 1 class -1 Linear logistic regression w/ Laplacian prior w/ Gaussian prior

Algorithms for Logistic Regression
where Standard algorithm: Newton-Raphson, a.k.a. iteratively reweighted least squares (IRLS) IRLS is easily applied without any prior or with Gaussian prior. IRLS not applicable with Laplacian prior: is not differentiable. Alternative: bound optimization algorithms For ML logistic regression [Böhning & Lindsay, 1988], [Böhning,1992]. More general formulations [de Leeuw & Michailides, 1993], [Lange, Hunter, & Yang, 2000].

Bound Optimization Algorithms (BOA)
Optimization problem: where is such that ....with equality if and only if Bound optimization algorithm: Sufficient (in fact more than sufficient) to prove monotonicity: Notes: should be easy to maximize EM is a BOA

Deriving Bound Functions
Many ways to obtain bound functions For example, well known that Jensen’s inequality underlies EM Via Hessian bound: suppose is concave, with Hessian bounded below, where is a positive definite matrix. Can use r.h.s. as with

Quasi-Newton Monotonic Algorithm
Update equation is simple to solve, leads to This is a quasi-Newton algorithm, with B replacing the Hessian. Unlike the Newton algorithm, it is monotonic.

Aplication to ML Logistic Regression
For logistic regression, can be shown that [Böhning,1992] Kroneker product Also easy to compute the gradient and finally plug into Under a ridge-type Gaussian prior can be computed off-line.

Aplication to LASSO Logistic Regression
For LASSO logistic regression already bounded via Hessian bound ...need bound for log prior quadratic bound Easy to show that, for any ...with equality iff

Aplication to LASSO Logistic Regression
After dropping additive terms, where The update equation is an IRRR which can be rewritten as where

Aplication to LASSO Logistic Regression
The update equation has computational cost May not be OK for kernel classification for large OK for linear classification if not too large. This is the cost of standard IRLS for ML logistic regression ...but now with a Laplacian prior.

Sequential Update Algorithm for LASSO Logistic Regression
Recall that Let’s bound only via the Hessian bound, leaving Maximizing only w.r.t. the –th component of for

Sequential Update Algorithm for LASSO Logistic Regression
The update equation has a simple closed-form expression Can be shown that updating all components has cost may be much less than Usually also uses fewer iterations, since we do not bound the prior and is “incremental”

Sequential Update Algorithm for Ridge Logistic Regression
With a Gaussian prior, the update equation also has a simple closed-form expression v For , we get the update rule for ML logistic regression. Important issue: how to choose the order of update, i.e., which component to update next? In all the results presented, we use a simple cyclic schedule. We’re currently investigating other alternatives (good, but cheap).

Related Work Sequential update for the relevance vector machine (RVM) [Tipping & Faul, 2003] Comments: the objective function of the RVM is not concave, results may depend critically on initialization and order of update. Kernel logistic regression; the import vector machine (IVM) [Zhu & Hastie, 2001] Comments: sparseness not encouraged in the objective function (Gaussian prior) but by early stopping a greedy algorithm. Efficient algorithm for SVM with penalty [Zhu, Rosset, Hastie, & Tibshirani, 2003] Comments: efficient, though not simple, algorithm; the SVM objective is different from the logistic regression objective. Least angle regression (LAR) [Efron, Hastie, Johnstone, & Tibshirani, 2002] Comments: as far as we know, not yet/easily applied to logistic regression.

Experimental Results Three standard benchmark datasets: Crabs, Iris, and Forensic Glass. Three well-known gene expression datasets: AML/ALL, Colon, and Yeast. Penalty weight adjusted by cross-validation. Comparison with state-of-the-art classifiers: RVM and SVM (SVMlight) Summary of datasets: No CV, fixed standard split Not exactly CV, but 30 different 50/12 splits

Experimental Results All kernel classifiers (RBF Gaussian and linear).
For RBF, width tuned by CV with SVM, and used for all other methods. Number of errors Number of kernels Results: BMSLR Bayesian multinomial sparse logistic regression BMGLR Bayesian multinomial Gaussian logistic regression Linear classification of AML/ALL (no kernel): 1 error, (of 7129) features (genes) selected. [Krishnapuram, Carin, Hartemink, and Figueiredo, 2004 (submitted)]

Non-linear Feature Selection
We have considered a fixed dictionary of functions i.e., feature selection is done on parameters appearing linearly or “generalized linearly” Let us now look at “non-linear parameters”, i.e., inside the dictionary:

Non-linear Feature Selection
For logistic regression, We need to further constraint the problem. ...consider parameterizations of the type: Polynomial Gaussian For kernels:

Feature Scaling/Selection
This corresponds to a different scaling of each original feature. Sparseness of feature selection. We can also adopt Laplacian prior for Estimation criterion: Logistic log-likelihood

EM Algorithm for Feature Scaling/Selection
Optimization problem: We use again a bound optimization algorithm (BOA), with Hessian bound for and the quadratic bound for Maximizing can’t be done in closed-form. Easy to maximize w.r.t. , with fixed. Maximization w.r.t. done by conjugate gradient; necessary gradients are easy to derive. JCFO – Joint classifier and feature optimization [Krishnapuram, Carin, Hartemink, and Figueiredo, IEEE-TPAMI, 2004].

Experimental Results on Gene Expression Data (Full LOOCV)
Accuracy (%) Method AML/ALL Colon Boosting 95.8 72.6 SVM (linear kernel) 94.4 77.4 SVM (quadratic kernel) 74.2 RVM (no kernel) 97.2 88.7 Logistic (no kernel) 71.0 Sparse probit (quadr. kernel) 84.6 Sparse probit (linear kernel) 91.9 JCFO (quadratic kernel) 98.6 JCFO (linear kernel) 100 96.8 [Ben-Dor et al, 2000] [Krishnapuram et al, 2002] Tipically around 25~30 genes selected, i.e., non-zero

Top 12 Genes for AML/ALL (sorted by mean |qi| )
* * * * * Agree with [Golub et al, 1999]; many others in the top 25. Antibodies to MPO are used in clinical diagnosis of AML

Top 12 Genes for Colon (sorted by mean qi )
* * * * * Known to be implicated in colon cancer.

Non-trivial Bounds for Sparse Classifiers
PART II Non-trivial Bounds for Sparse Classifiers

Introduction Training data (here, we consider only binary problems): assumed to be i.i.d. from an underlying distribution Given a classifier True generalization error (not computable, is unknown): Sample error: Key question: how are the two related?

PAC Performance Bounds
PAC (probably approximately correct) bounds are of the form: and hold independently of Usually, bounds have the form: uniformly over

PAC Performance Bounds
There are several ways to derive - Vapnik-Chervonenkis (VC) theory (see, e.g., [Vapnik, 1998]) VC usually leads to trivial bounds (>1, unless n is huge). - Compression arguments [Graepel, Herbrich, & Shawe-Taylor, 2000] Compression bounds are not applicable to point sparse classifiers of the type herein presented, or of the RVM type [Herbrich, 2000]. We apply PAC-Bayesian bounds [McAllester, 1999], [Seeger, 2002].

Some Definitions : some point “estimate” of Let a Laplacian centered at we’ll call this the “posterior”, although not in the usual sense. Point classifier (PC) at the one we’re interested in Gibbs classifier (GC) at a sample from Bayes voting classifier (BVC) at

Key Lemmas Lemma 1: for any , the decision of the PC with is the same as that of a BVC based on any symmetric posterior centered on Proof: a simple pairing argument (see, e.g., [Herbrich, 2002]). Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC. Proof: see [Herbrich, 2002]. Conclusion: we can use PAC-Bayesian bounds for GC for our PC.

PAC-Bayesian Theorem Let be our prior (meaning it is independent of ) Let be our posterior (meaning it may depend on ) Generalization error for a Gibbs classifier: Expected sample/empirical error for a Gibbs classifier: McAllester’s PAC-Bayesian theorem relates these two errors.

PAC-Bayesian Theorem Theorem: with as defined above, the following inequality holds with probability at least over random training samples of size where is the Kullback-Leibler div. between two Bernoullies, and is the Kullback-Leibler div. between posterior and prior.

Tightening the PAC-Bayesian Theorem
With our Laplacian prior and posterior, we have: and show that Due to the convexity of the KLD, it is easy to (numerically) find Since we can choose freely:

Using the PAC-Bayesian Theorem
Set a prior parameter and choose a confidence level Using this prior and , find a point estimate Find such that and evaluate the corresponding by Monte Carlo. With this , define the “posterior” as above, From these, we know that with probability at least

Using the PAC-Bayesian Theorem
To obtain an explicit bound on can easily be found numerically. is always non-trivial, i.e.,

Using the PAC-Bayesian Theorem
Finally, notice that the PAC-Bayesian bound applies to the Gibbs classifier, but recall Lemma 2. Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC. In practice, we have observed that the BVC usually generalizes as well as (often much better than) the GC. We believe the factor 2 can be reduced, but we have not yet been able to show it.

Example of PAC-Bayesian Bound
“Mines” dataset Maybe tight enough to guide selection of

Conclusions for Part II
PAC-Bayesian bound for sparse classifier. The bound (unlike VC bounds) is always non-trivial. Tightness still requires large sample sizes. Future goals: tightening the bounds, ..as always.

Feature Selection in Unsupervised Learning
PART III Feature Selection in Unsupervised Learning

Feature Selection in Unsupervised Learning
FS is a widely studied problem in supervised learning. FS is a not widely studied in unsupervised learning (clustering). A good reason: in the absence of labels, how do you assess the usefulness of each feature? We address this problem in the context of model-based clustering using finite mixtures: Approach: how relevant is each feature (component of x) for the mixture nature of the data?

Example of Relevant and Irrelevant Features
x2 is irrelevant for the mixture nature of this data. x1 is relevant for the mixture nature of this data. Any PCA-type analysis of this data would not be useful here.

Interplay Between Number of Clusters and Features
Example: Using only x1, we find 2 components. Using x1 and x2, we find 7 components (needed to fit the non-Gaussian density of x2) Marginals

Approaches to Feature Selection
Most classical FS methods for supervised learning require combinatorial searches. For d features, there are 2d possible feature subsets. Alternative: assign real valued feature weights and encourage sparseness (like seen above for supervised learning). [Law, Figueiredo and Jain, TPAMI, 2004 (to appear)]

Maximum Likelihood Estimation and Missing Data
Training data Maximum likelihood estimate of where Missing data One-of-k encoding: component i “Complete” log-likelihood ...would be easy to maximize, if we had

EM Algorithm for Maximum Likelihood Estimation
E-step: because Current estimate of the probability that was produced by component i M-step:

Model Selection for Mixtures
Important issue: how to select k (number of components)? [Figueiredo & Jain, 2002] MML-based approach; roughly, an MDL/BIC with careful definition of the amount of data from which each parameter is estimated. Resulting criterion leads to a simple modification of EM Original M-step expression for New expression: where Number of parameters of each component. This update “kills” weak components

Feature Selection for Mixtures
Simplifying assumption: in each component, the features are independent. parameters of the density of the l-th feature in component i Let some features have a common density in all components. These features are “irrelevant”. if feature l is relevant if feature l is irrelevant Common (w.r.t. i) densities of the irrelevant features

Feature Saliency The likelihood To apply EM, we see as missing data and define we call it “feature saliency” Can be shown that the resulting likelihood (marginal w.r.t. ) is

Applying EM We address by EM, using and as missing data. In addition to the variables defined above, the E-step now also involves the following variables: both easily computed in closed form.

Applying EM Assuming that and are both univariate Gaussian with arbitrary mean and variance. M-step: (ommitting the variances) mean in

The modified M-step is:
Model Selection To perform feature selection, we want to encourage some of the saliencies to become either 0 or 1. This can be achieved with the same MML-type criterion used above to select k The modified M-step is: where: number of parameters in

Synthetic Example Mixture of 4 Gaussians (with identity covariance) with d = 10, 2 relevant features 8 irrelevant features samples, projected on the first two dimensions

initial final Common density

Synthetic Example Feature saliency values (mean 1 s.d.) over 10 runs relevant features irrelevant features

- Several standard benchmark data sets:
Real Data - Several standard benchmark data sets: Name n d k Wine 178 13 3 Wisconsin breast cancer 569 30 2 Image segmentation 2320 18 7 Texture classification 4000 19 4 - These are standard data-sets for supervised classification. - We fit mixtures, ignoring the labels. - We classify the points and compare to the labels.

For these data-sets, our approach is able to improve the
Real Data: Results with FS without FS Name % error (sd) Wine 6.61 (3.91) 8.06 (3.73) Wisconsin breast cancer 9.55 (1.99) (2.70) Image segmentation (1.54) (5.10) Texture classification 4.04 (0.76) 4.85 (0.98) For these data-sets, our approach is able to improve the performance of mixture-based unsupervised classification.

Research Directions In supervised learning: In unsupervised learning: More efficient algorithms for logistic regression with LASSO prior. Investigating the performance of generalized Gaussian priors with exponents other than 1 (LASSO) or 2 (Ridge) Deriving performce bounds for this type of approach More efficien algorithms Removing the conditional independence assumption Extension to other mixtures (e.g., multinomial for categorical data).