Presentation on theme: "Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical."— Presentation transcript:
Feature Selection for Supervised and Unsupervised Learning Mário A. T. Figueiredo Institute of Telecommunications, and Instituto Superior Técnico Technical University of Lisbon PORTUGAL Work herein reported was done in collaboration with: L. Carin, B. Krishnapuram, and A. Hartemink, Duke University; A. K. Jain and M. Law, Michigan State University.
UFL, January 2004 M. Figueiredo, IST Outline 1.Introduction 2.Review LASSO Regression 3.The LASSO Penalty for Multinomial Logistic Regression 4.Bound Optimization Algorithms (Parallel and Sequential) 5.Non-linear Feature Weighting/Selection 6.Experimental Results Part I Supervised Learning Part III Unsupervised Learning 1.Introduction 2.Review of Model-Based Clustering with Finite Mixtures 3.Feature Saliency 4.An EM Algorithm to Estimate Feature Saliency 5.Model Selection 6.Experimental Results Part II Performance Bounds
UFL, January 2004 M. Figueiredo, IST Supervised Learning Goal: to learn a functional dependency......from a set of examples (the training data): - Discriminative (non-generative) approach: no attempt to model the joint density set of parameters
UFL, January 2004 M. Figueiredo, IST Complexity Control Via Bayes Point Estimation Bayesian (point estimation) approach: Priorcontrols the “complexity” of Good generalization requires complexity control. Likelihood function Maximum a posteriori (MAP) point estimate of Prediction for a new “input” :
UFL, January 2004 M. Figueiredo, IST Bayes Point Estimate Versus Fully Bayesian Approach Point prediction for a new “input” : Point estimate We will not consider fully Bayesian approaches here. Fully Bayesian prediction
UFL, January 2004 M. Figueiredo, IST Linear (w.r.t. ) Regression e.g., radial basis functions, splines, wavelets, polynomials,... We consider functions which are linear w.r.t. whereis some dictionary of functions; Notable particular cases: linear regression: kernel regression: as in SVM, RVM, etc...
UFL, January 2004 M. Figueiredo, IST Likelihood Function for Regression Likelihood function, for Gaussian observation model where Assuming that y and the columns of H are centered, we drop w.l.o.g. Maximum likelihood / ordinary least squares estimate …undetermined if …the design matrix.
UFL, January 2004 M. Figueiredo, IST called ridge regression, or weight decay (in neural nets parlance). With a Gaussian prior Bayesian Point Estimates: Ridge and the LASSO LASSO regression [Tibshirani, 1996], prunning priors for NNs [Williams, 1995], basis pursuit [Chen, Donoho, Saunders, 1995 ]. With a Laplacian prior promotes sparseness of i.e., its components are either significantly large, or zero. feature selection
UFL, January 2004 M. Figueiredo, IST Algorithms to Compute the LASSO Special purpose algorithms [Tibshirani, 1996], [Fu, 1998], [Osborne, Presnell, Turlach, 2000], For orthogonal H, closed-form solution: “soft threshold” More insight on the LASSO, see [Tibshirani, 1996]. [Efron, Hastie, Johnstone, & Tibshirani, 2002] Least angle regression (LAR); currently the best approach.
UFL, January 2004 M. Figueiredo, IST EM Algorithm for the LASSO LASSO can be computed by EM using a hierarchical formulation: are independent, This possibility mentioned in [Tibshirani, 1996]; not very efficient, but… Treat the as missing data and apply standard EM. This leads to: which can be called an iteratively reweighted ridge regression (IRRR)
UFL, January 2004 M. Figueiredo, IST About The previous derivation opens the door to other priors. For example, a Jeffreys prior: The EM algorithm becomes (still IRRR-type): Interestingly, similar to the FOCUSS algorithm for regression with an penalty [Kreutz-Delgado & Rao, 1998]. Strong sparseness! Problem: non-convex objective, results depend on initialization. Possibility: initialize with OLS estimate.
UFL, January 2004 M. Figueiredo, IST Some Results Same vectors as in [Tibshirani, 1996]: Design matrices and experimental procedure as in [Tibshirani, 1996] Model error (ME) improvement w.r.t. OLS estimate: Close to best in each case, without any cross-validation; more results in [Figueiredo, NIPS’2001] and [Figueiredo, PAMI’2003].
UFL, January 2004 M. Figueiredo, IST Classification via Logistic Regression Binary classification: Multi-class, with “1 of m” encoding: Class Recall that may denote the components of, or other (nonlinear) functions of, such as kernels.
UFL, January 2004 M. Figueiredo, IST Classification via Logistic Regression Since, we can set w.l.o.g. Parameters to estimate: Maximum log-likelihood estimate: If is separable, is unbounded, thus undefined.
UFL, January 2004 M. Figueiredo, IST Penalized (point Bayes) Logistic Regression Penalized (or point Bayes MAP) estimate where Gaussian prior Penalized log. reg. Laplacian prior (LASSO prior) favors sparseness, feature selection. For linear regression it does, what about for logistic regression?
UFL, January 2004 M. Figueiredo, IST Laplacian Prior for Logistic Regression Simple test with 2 training points: class 1 class -1 w/ Laplacian prior class 1 class -1 As decreases, becomes less relevant Linear logistic regression w/ Gaussian prior
UFL, January 2004 M. Figueiredo, IST Algorithms for Logistic Regression where Standard algorithm: Newton-Raphson, a.k.a. iteratively reweighted least squares (IRLS) IRLS is easily applied without any prior or with Gaussian prior. IRLS not applicable with Laplacian prior: is not differentiable. Alternative: bound optimization algorithms For ML logistic regression [Böhning & Lindsay, 1988], [Böhning,1992]. More general formulations [de Leeuw & Michailides, 1993], [Lange, Hunter, & Yang, 2000].
UFL, January 2004 M. Figueiredo, IST Bound Optimization Algorithms (BOA) Sufficient (in fact more than sufficient) to prove monotonicity: Notes: should be easy to maximize EM is a BOA Optimization problem: where is such that....with equality if and only if Bound optimization algorithm:
UFL, January 2004 M. Figueiredo, IST Deriving Bound Functions Many ways to obtain bound functions For example, well known that Jensen’s inequality underlies EM Via Hessian bound: suppose is concave, with Hessian bounded below, where is a positive definite matrix. Can use r.h.s. as with
UFL, January 2004 M. Figueiredo, IST Quasi-Newton Monotonic Algorithm Update equation is simple to solve, leads to This is a quasi-Newton algorithm, with B replacing the Hessian. Unlike the Newton algorithm, it is monotonic.
UFL, January 2004 M. Figueiredo, IST Aplication to ML Logistic Regression For logistic regression, can be shown that [Böhning,1992] Kroneker product Also easy to compute the gradient and finally plug into Under a ridge-type Gaussian prior can be computed off-line.
UFL, January 2004 M. Figueiredo, IST Aplication to LASSO Logistic Regression For LASSO logistic regression already bounded via Hessian bound...need bound for log prior quadratic bound Easy to show that, for any...with equality iff
UFL, January 2004 M. Figueiredo, IST Aplication to LASSO Logistic Regression After dropping additive terms, where The update equation is an IRRR which can be rewritten as where
UFL, January 2004 M. Figueiredo, IST Aplication to LASSO Logistic Regression The update equation has computational cost May not be OK for kernel classification for large OK for linear classification if not too large. This is the cost of standard IRLS for ML logistic regression...but now with a Laplacian prior.
UFL, January 2004 M. Figueiredo, IST Sequential Update Algorithm for LASSO Logistic Regression Recall that Let’s bound only via the Hessian bound, leaving Maximizing only w.r.t. the –th component of for
UFL, January 2004 M. Figueiredo, IST Sequential Update Algorithm for LASSO Logistic Regression The update equation Can be shown that updating all components has cost may be much less than Usually also uses fewer iterations, since we do not bound the prior and is “incremental” has a simple closed-form expression
UFL, January 2004 M. Figueiredo, IST Sequential Update Algorithm for Ridge Logistic Regression With a Gaussian prior, the update equation also has a simple closed-form expression v For, we get the update rule for ML logistic regression. Important issue: how to choose the order of update, i.e., which component to update next? In all the results presented, we use a simple cyclic schedule. We’re currently investigating other alternatives (good, but cheap).
UFL, January 2004 M. Figueiredo, IST Related Work Sequential update for the relevance vector machine (RVM) [Tipping & Faul, 2003] Comments: the objective function of the RVM is not concave, results may depend critically on initialization and order of update. Kernel logistic regression; the import vector machine (IVM) [Zhu & Hastie, 2001] Comments: sparseness not encouraged in the objective function (Gaussian prior) but by early stopping a greedy algorithm. Efficient algorithm for SVM with penalty [Zhu, Rosset, Hastie, & Tibshirani, 2003] Comments: efficient, though not simple, algorithm; the SVM objective is different from the logistic regression objective. Least angle regression (LAR) [Efron, Hastie, Johnstone, & Tibshirani, 2002] Comments: as far as we know, not yet/easily applied to logistic regression.
UFL, January 2004 M. Figueiredo, IST Experimental Results Three standard benchmark datasets: Crabs, Iris, and Forensic Glass. Three well-known gene expression datasets: AML/ALL, Colon, and Yeast. Penalty weight adjusted by cross-validation. Comparison with state-of-the-art classifiers: RVM and SVM (SVMlight) Summary of datasets: Not exactly CV, but 30 different 50/12 splits No CV, fixed standard split
UFL, January 2004 M. Figueiredo, IST Experimental Results All kernel classifiers (RBF Gaussian and linear). For RBF, width tuned by CV with SVM, and used for all other methods. Linear classification of AML/ALL (no kernel): 1 error, 81 (of 7129) features (genes) selected. [Krishnapuram, Carin, Hartemink, and Figueiredo, 2004 (submitted)] Number of errors Number of kernels Results: BMSLR Bayesian multinomial sparse logistic regression BMGLR Bayesian multinomial Gaussian logistic regression
UFL, January 2004 M. Figueiredo, IST Non-linear Feature Selection We have considered a fixed dictionary of functions i.e., feature selection is done on parameters appearing linearly or “generalized linearly” Let us now look at “non-linear parameters”, i.e., inside the dictionary:
UFL, January 2004 M. Figueiredo, IST Non-linear Feature Selection For logistic regression, We need to further constraint the problem....consider parameterizations of the type: Polynomial Gaussian For kernels:
UFL, January 2004 M. Figueiredo, IST This corresponds to a different scaling of each original feature. Sparseness of feature selection. We can also adopt Laplacian prior for Feature Scaling/Selection Estimation criterion: Logistic log-likelihood
UFL, January 2004 M. Figueiredo, IST EM Algorithm for Feature Scaling/Selection Optimization problem: We use again a bound optimization algorithm (BOA), with Hessian bound for and the quadratic bound for Maximizingcan’t be done in closed-form. Easy to maximize w.r.t., with fixed. Maximization w.r.t. done by conjugate gradient; necessary gradients are easy to derive. JCFO – Joint classifier and feature optimization [Krishnapuram, Carin, Hartemink, and Figueiredo, IEEE-TPAMI, 2004].
UFL, January 2004 M. Figueiredo, IST Experimental Results on Gene Expression Data (Full LOOCV) MethodAML/ALLColon Boosting SVM (linear kernel) SVM (quadratic kernel) RVM (no kernel) Logistic (no kernel) Sparse probit (quadr. kernel) Sparse probit (linear kernel) JCFO (quadratic kernel) JCFO (linear kernel) Accuracy (%) [Ben-Dor et al, 2000] [Krishnapuram et al, 2002] Tipically around 25~30 genes selected, i.e., non-zero
UFL, January 2004 M. Figueiredo, IST Top 12 Genes for AML/ALL (sorted by mean | i | ) * * * * * Agree with [Golub et al, 1999]; many others in the top 25. Antibodies to MPO are used in clinical diagnosis of AML
UFL, January 2004 M. Figueiredo, IST Top 12 Genes for Colon (sorted by mean q i ) * * * * * Known to be implicated in colon cancer.
UFL, January 2004 M. Figueiredo, IST PART II Non-trivial Bounds for Sparse Classifiers
UFL, January 2004 M. Figueiredo, IST Introduction Training data (here, we consider only binary problems): assumed to be i.i.d. from an underlying distribution Key question: how are the two related? Given a classifier True generalization error (not computable, is unknown): Sample error:
UFL, January 2004 M. Figueiredo, IST PAC Performance Bounds PAC (probably approximately correct) bounds are of the form: and hold independently of Usually, bounds have the form: uniformly over
UFL, January 2004 M. Figueiredo, IST PAC Performance Bounds There are several ways to derive - Vapnik-Chervonenkis (VC) theory (see, e.g., [Vapnik, 1998] ) VC usually leads to trivial bounds (>1, unless n is huge). - Compression arguments [Graepel, Herbrich, & Shawe-Taylor, 2000] Compression bounds are not applicable to point sparse classifiers of the type herein presented, or of the RVM type [Herbrich, 2000]. We apply PAC-Bayesian bounds [McAllester, 1999], [Seeger, 2002].
UFL, January 2004 M. Figueiredo, IST Some Definitions : some point “estimate” of. Let a Laplacian centered at we’ll call this the “posterior”, although not in the usual sense. Point classifier (PC) at the one we’re interested in Gibbs classifier (GC) at a sample from Bayes voting classifier (BVC) at
UFL, January 2004 M. Figueiredo, IST Key Lemmas Lemma 1: for any, the decision of the PC with is the same as that of a BVC based on any symmetric posterior centered on Proof: a simple pairing argument (see, e.g., [Herbrich, 2002]). Lemma 2: for any “posterior”, the generalization error of the BVC is less than twice that of the GC. Proof: see [Herbrich, 2002]. Conclusion: we can use PAC-Bayesian bounds for GC for our PC.
UFL, January 2004 M. Figueiredo, IST PAC-Bayesian Theorem Let be our prior (meaning it is independent of ) Let be our posterior (meaning it may depend on ) Generalization error for a Gibbs classifier: Expected sample/empirical error for a Gibbs classifier: McAllester’s PAC-Bayesian theorem relates these two errors.
UFL, January 2004 M. Figueiredo, IST PAC-Bayesian Theorem Theorem: with as defined above, the following inequality holds with probability at least over random training samples of size where is the Kullback-Leibler div. between two Bernoullies, and is the Kullback-Leibler div. between posterior and prior.
UFL, January 2004 M. Figueiredo, IST Tightening the PAC-Bayesian Theorem With our Laplacian prior and posterior, we have: and show that Due to the convexity of the KLD, it is easy to (numerically) find Since we can choose freely:
UFL, January 2004 M. Figueiredo, IST Using the PAC-Bayesian Theorem Set a prior parameter and choose a confidence level Find such that Using this prior and, find a point estimate From these, we know that with probability at least and evaluate the corresponding by Monte Carlo. With this, define the “posterior” as above,
UFL, January 2004 M. Figueiredo, IST To obtain an explicit bound on Using the PAC-Bayesian Theorem can easily be found numerically. is always non-trivial, i.e.,
UFL, January 2004 M. Figueiredo, IST Using the PAC-Bayesian Theorem Finally, notice that the PAC-Bayesian bound applies to the Gibbs classifier, but recall Lemma 2. Lemma 2: for any “posterior”, the generalization error of the BVC is less than twice that of the GC. In practice, we have observed that the BVC usually generalizes as well as (often much better than) the GC. We believe the factor 2 can be reduced, but we have not yet been able to show it.
UFL, January 2004 M. Figueiredo, IST Example of PAC-Bayesian Bound “Mines” dataset Maybe tight enough to guide selection of
UFL, January 2004 M. Figueiredo, IST Conclusions for Part II -PAC-Bayesian bound for sparse classifier. -The bound (unlike VC bounds) is always non-trivial. -Tightness still requires large sample sizes. -Future goals: tightening the bounds,..as always.
UFL, January 2004 M. Figueiredo, IST PART III Feature Selection in Unsupervised Learning
UFL, January 2004 M. Figueiredo, IST FS is a widely studied problem in supervised learning. FS is a not widely studied in unsupervised learning (clustering). A good reason: in the absence of labels, how do you assess the usefulness of each feature? Feature Selection in Unsupervised Learning Approach: how relevant is each feature (component of x) for the mixture nature of the data? We address this problem in the context of model-based clustering using finite mixtures:
UFL, January 2004 M. Figueiredo, IST Example: x 2 is irrelevant for the mixture nature of this data. x 1 is relevant for the mixture nature of this data. Any PCA-type analysis of this data would not be useful here. Example of Relevant and Irrelevant Features
UFL, January 2004 M. Figueiredo, IST Interplay Between Number of Clusters and Features Example: Using only x 1, we find 2 components. Using x 1 and x 2, we find 7 components (needed to fit the non-Gaussian density of x 2 ) Marginals
UFL, January 2004 M. Figueiredo, IST Most classical FS methods for supervised learning require combinatorial searches. For d features, there are 2 d possible feature subsets. Alternative: assign real valued feature weights and encourage sparseness (like seen above for supervised learning). Approaches to Feature Selection [Law, Figueiredo and Jain, TPAMI, 2004 (to appear)]
UFL, January 2004 M. Figueiredo, IST Maximum Likelihood Estimation and Missing Data Missing data One-of-k encoding: component i “Complete” log-likelihood...would be easy to maximize, if we had Training data Maximum likelihood estimate of where
UFL, January 2004 M. Figueiredo, IST EM Algorithm for Maximum Likelihood Estimation E-step: Current estimate of the probability that was produced by component i M-step: because
UFL, January 2004 M. Figueiredo, IST Model Selection for Mixtures Important issue: how to select k (number of components)? [Figueiredo & Jain, 2002] MML-based approach; roughly, an MDL/BIC with careful definition of the amount of data from which each parameter is estimated. Resulting criterion leads to a simple modification of EM Original M-step expression for This update “kills” weak components New expression: where Number of parameters of each component.
UFL, January 2004 M. Figueiredo, IST Feature Selection for Mixtures Simplifying assumption: in each component, the features are independent. parameters of the density of the l-th feature in component i Let some features have a common density in all components. These features are “irrelevant”. if feature l is relevant if feature l is irrelevant Common (w.r.t. i) densities of the irrelevant features
UFL, January 2004 M. Figueiredo, IST The likelihood To apply EM, we see as missing data and define we call it “feature saliency” Feature Saliency Can be shown that the resulting likelihood (marginal w.r.t. ) is
UFL, January 2004 M. Figueiredo, IST Applying EM We address by EM, using and as missing data. In addition to the variables defined above, the E-step now also involves the following variables: both easily computed in closed form.
UFL, January 2004 M. Figueiredo, IST Applying EM M-step: (ommitting the variances) Assuming that and are both univariate Gaussian with arbitrary mean and variance. mean in
UFL, January 2004 M. Figueiredo, IST Model Selection To perform feature selection, we want to encourage some of the saliencies to become either 0 or 1. This can be achieved with the same MML-type criterion used above to select k The modified M-step is: where: number of parameters in
UFL, January 2004 M. Figueiredo, IST 800 samples, projected on the first two dimensions Mixture of 4 Gaussians (with identity covariance) with d = 10, 2 relevant features 8 irrelevant features Synthetic Example
UFL, January 2004 M. Figueiredo, IST final initial Common density
UFL, January 2004 M. Figueiredo, IST Feature saliency values (mean 1 s.d.) over 10 runs relevant features irrelevant features Synthetic Example
UFL, January 2004 M. Figueiredo, IST - Several standard benchmark data sets: Namendk Wine Wisconsin breast cancer Image segmentation Texture classification These are standard data-sets for supervised classification. - We fit mixtures, ignoring the labels. - We classify the points and compare to the labels. Real Data
UFL, January 2004 M. Figueiredo, IST Real Data: Results Name% error (sd) Wine6.61 (3.91)8.06 (3.73) Wisconsin breast cancer9.55 (1.99)10.09 (2.70) Image segmentation20.19 (1.54)32.84 (5.10) Texture classification4.04 (0.76)4.85 (0.98) with FS without FS - For these data-sets, our approach is able to improve the performance of mixture-based unsupervised classification.
UFL, January 2004 M. Figueiredo, IST Research Directions - More efficient algorithms for logistic regression with LASSO prior. - Investigating the performance of generalized Gaussian priors with exponents other than 1 (LASSO) or 2 (Ridge) - Deriving performce bounds for this type of approach In supervised learning: In unsupervised learning: - More efficien algorithms - Removing the conditional independence assumption - Extension to other mixtures (e.g., multinomial for categorical data).