Presentation on theme: "Technical University of Lisbon"— Presentation transcript:
1Technical University of Lisbon Feature Selection forSupervised and Unsupervised LearningMário A. T. FigueiredoInstitute of Telecommunications, andInstituto Superior TécnicoTechnical University of LisbonPORTUGALWork herein reported was done in collaboration with:L. Carin, B. Krishnapuram, and A. Hartemink, Duke University;A. K. Jain and M. Law, Michigan State University.
2Part I Supervised Learning OutlinePart I Supervised LearningIntroductionReview LASSO RegressionThe LASSO Penalty for Multinomial Logistic RegressionBound Optimization Algorithms (Parallel and Sequential)Non-linear Feature Weighting/SelectionExperimental ResultsPart II Performance BoundsPart III Unsupervised LearningIntroductionReview of Model-Based Clustering with Finite MixturesFeature SaliencyAn EM Algorithm to Estimate Feature SaliencyModel SelectionExperimental Results
3Supervised LearningGoal: to learn a functional dependency...set of parameters...from a set of examples (the training data):Discriminative (non-generative) approach: no attempt to model the joint density
4Complexity Control Via Bayes Point Estimation Bayesian (point estimation) approach:Priorcontrols the “complexity” ofGood generalization requires complexity control.Likelihood functionMaximum a posteriori (MAP) point estimate ofPrediction for a new “input” :
5Bayes Point Estimate Versus Fully Bayesian Approach Point prediction for a new “input” :Fully Bayesian predictionWe will not consider fully Bayesian approaches here.
6Linear (w.r.t. b ) Regression We consider functions which are linear w.r.t.whereis some dictionary of functions;e.g., radial basis functions, splines, wavelets, polynomials,...Notable particular cases: linear regression:kernel regression: as in SVM, RVM, etc...
7Likelihood Function for Regression Likelihood function, for Gaussian observation modelwhereAssuming that y and the columns of H are centered,we drop w.l.o.g.…the design matrix.Maximum likelihood / ordinary least squares estimate…undetermined if
8Bayesian Point Estimates: Ridge and the LASSO With a Gaussian priorcalled ridge regression, or weight decay (in neural nets parlance).With a Laplacian priorpromotes sparseness ofi.e., its components are either significantly large, or zero.feature selectionLASSO regression [Tibshirani, 1996], prunning priors for NNs [Williams, 1995], basis pursuit [Chen, Donoho, Saunders, 1995].
9Algorithms to Compute the LASSO Special purpose algorithms[Tibshirani, 1996], [Fu, 1998], [Osborne, Presnell, Turlach, 2000],Least angle regression (LAR); currently the best approach.[Efron, Hastie, Johnstone, & Tibshirani, 2002]For orthogonal H, closed-form solution: “soft threshold”More insight on the LASSO, see [Tibshirani, 1996].
10EM Algorithm for the LASSO LASSO can be computed by EM using a hierarchical formulation:are independent,Treat the as missing data and apply standard EM. This leads to:which can be called an iteratively reweighted ridge regression (IRRR)This possibility mentioned in [Tibshirani, 1996]; not very efficient, but…
11About gThe previous derivation opens the door to other priors.For example, a Jeffreys prior:The EM algorithm becomes (still IRRR-type):Interestingly, similar to the FOCUSS algorithm for regression with an penalty [Kreutz-Delgado & Rao, 1998]. Strong sparseness!Problem: non-convex objective, results depend on initialization.Possibility: initialize with OLS estimate.
12Some ResultsSame vectors as in [Tibshirani, 1996]:Design matrices and experimental procedure as in [Tibshirani, 1996]Model error (ME) improvement w.r.t. OLS estimate:Close to best in each case, without any cross-validation; more results in [Figueiredo, NIPS’2001] and [Figueiredo, PAMI’2003].
13Classification via Logistic Regression Recall that may denote the components of , or other (nonlinear) functions of , such as kernels.Binary classification:Multi-class, with “1 of m” encoding:Class
14Classification via Logistic Regression Since, we can set w.l.o.g.Parameters to estimate:Maximum log-likelihood estimate:If is separable, is unbounded, thus undefined.
15Penalized (point Bayes) Logistic Regression Penalized (or point Bayes MAP) estimatewhereGaussian priorPenalized log. reg.Laplacian prior (LASSO prior)favors sparseness,feature selection.For linear regression it does, what about for logistic regression?
16Laplacian Prior for Logistic Regression Simple test with 2 training points:class 1class -1As decreases, becomes less relevantclass 1class -1Linear logistic regressionw/ Laplacian priorw/ Gaussian prior
17Algorithms for Logistic Regression whereStandard algorithm: Newton-Raphson,a.k.a. iteratively reweighted least squares (IRLS)IRLS is easily applied without any prior or with Gaussian prior.IRLS not applicable with Laplacian prior: is not differentiable.Alternative: bound optimization algorithmsFor ML logistic regression [Böhning & Lindsay, 1988], [Böhning,1992].More general formulations [de Leeuw & Michailides, 1993],[Lange, Hunter, & Yang, 2000].
18Bound Optimization Algorithms (BOA) Optimization problem:where is such that....with equality if and only ifBound optimization algorithm:Sufficient (in fact more than sufficient) to prove monotonicity:Notes:should be easy to maximizeEM is a BOA
19Deriving Bound Functions Many ways to obtain bound functionsFor example, well known that Jensen’s inequality underlies EMVia Hessian bound: suppose is concave,with Hessian bounded below,where is a positive definite matrix.Can use r.h.s. as with
20Quasi-Newton Monotonic Algorithm Update equationis simple to solve, leads toThis is a quasi-Newton algorithm, with B replacing the Hessian.Unlike the Newton algorithm, it is monotonic.
21Aplication to ML Logistic Regression For logistic regression, can be shown that [Böhning,1992]Kroneker productAlso easy to compute the gradient and finally plug intoUnder a ridge-type Gaussian priorcan be computed off-line.
22Aplication to LASSO Logistic Regression For LASSO logistic regressionalready bounded via Hessian bound...need bound for log priorquadratic boundEasy to show that, for any...with equality iff
23Aplication to LASSO Logistic Regression After dropping additive terms,whereThe update equation is an IRRRwhich can be rewritten aswhere
24Aplication to LASSO Logistic Regression The update equationhas computational costMay not be OK for kernel classification for largeOK for linear classification if not too large.This is the cost of standard IRLS for ML logistic regression ...but now with a Laplacian prior.
25Sequential Update Algorithm for LASSO Logistic Regression Recall thatLet’s bound only via the Hessian bound, leavingMaximizing only w.r.t. the –th component offor
26Sequential Update Algorithm for LASSO Logistic Regression The update equationhas a simple closed-form expressionCan be shown that updating all components has costmay be much less thanUsually also uses fewer iterations, since we do not bound the prior and is “incremental”
27Sequential Update Algorithm for Ridge Logistic Regression With a Gaussian prior, the update equationalso has a simple closed-form expression vFor , we get the update rule for ML logistic regression.Important issue: how to choose the order of update, i.e., which component to update next?In all the results presented, we use a simple cyclic schedule.We’re currently investigating other alternatives (good, but cheap).
28Related WorkSequential update for the relevance vector machine (RVM) [Tipping & Faul, 2003] Comments: the objective function of the RVM is not concave,results may depend critically on initialization and order of update.Kernel logistic regression; the import vector machine (IVM) [Zhu & Hastie, 2001]Comments: sparseness not encouraged in the objective function (Gaussian prior) but by early stopping a greedy algorithm.Efficient algorithm for SVM with penalty [Zhu, Rosset, Hastie, & Tibshirani, 2003]Comments: efficient, though not simple, algorithm; the SVM objective is different from the logistic regression objective.Least angle regression (LAR) [Efron, Hastie, Johnstone, & Tibshirani, 2002]Comments: as far as we know, not yet/easily applied to logistic regression.
29Experimental ResultsThree standard benchmark datasets: Crabs, Iris, and Forensic Glass.Three well-known gene expression datasets: AML/ALL, Colon, and Yeast.Penalty weight adjusted by cross-validation.Comparison with state-of-the-art classifiers: RVM and SVM (SVMlight)Summary of datasets:No CV, fixed standard splitNot exactly CV, but 30 different 50/12 splits
30Experimental Results All kernel classifiers (RBF Gaussian and linear). For RBF, width tuned by CV with SVM, and used for all other methods.Number of errorsNumber of kernelsResults:BMSLR Bayesian multinomial sparse logistic regressionBMGLR Bayesian multinomial Gaussian logistic regressionLinear classification of AML/ALL (no kernel): 1 error, (of 7129) features (genes) selected.[Krishnapuram, Carin, Hartemink, and Figueiredo, 2004 (submitted)]
31Non-linear Feature Selection We have considered a fixed dictionary of functionsi.e., feature selection is done on parameters appearing linearlyor “generalized linearly”Let us now look at “non-linear parameters”, i.e., inside the dictionary:
32Non-linear Feature Selection For logistic regression,We need to further constraint the problem....consider parameterizations of the type:PolynomialGaussianFor kernels:
33Feature Scaling/Selection This corresponds to a different scaling of each original feature.Sparseness of feature selection.We can also adopt Laplacian prior forEstimation criterion:Logistic log-likelihood
34EM Algorithm for Feature Scaling/Selection Optimization problem:We use again a bound optimization algorithm (BOA),with Hessian bound for and the quadratic bound forMaximizingcan’t be done in closed-form.Easy to maximize w.r.t. , with fixed.Maximization w.r.t. done by conjugate gradient;necessary gradients are easy to derive.JCFO – Joint classifier and feature optimization [Krishnapuram, Carin, Hartemink, and Figueiredo, IEEE-TPAMI, 2004].
35Experimental Results on Gene Expression Data (Full LOOCV) Accuracy (%)MethodAML/ALLColonBoosting95.872.6SVM (linear kernel)94.477.4SVM (quadratic kernel)74.2RVM (no kernel)97.288.7Logistic (no kernel)71.0Sparse probit (quadr. kernel)84.6Sparse probit (linear kernel)91.9JCFO (quadratic kernel)98.6JCFO (linear kernel)10096.8[Ben-Dor et al, 2000][Krishnapuram et al, 2002]Tipically around25~30 genes selected,i.e., non-zero
36Top 12 Genes for AML/ALL (sorted by mean |qi| ) *****Agree with [Golub et al, 1999]; many others in the top 25.Antibodies to MPO are used in clinical diagnosis of AML
37Top 12 Genes for Colon (sorted by mean qi ) *****Known to be implicated in colon cancer.
38Non-trivial Bounds for Sparse Classifiers PART IINon-trivial Bounds for Sparse Classifiers
39IntroductionTraining data (here, we consider only binary problems):assumed to be i.i.d. from an underlying distributionGiven a classifierTrue generalization error (not computable, is unknown):Sample error:Key question: how are the tworelated?
40PAC Performance Bounds PAC (probably approximately correct) bounds are of the form:and hold independently ofUsually, bounds have the form:uniformly over
41PAC Performance Bounds There are several ways to derive- Vapnik-Chervonenkis (VC) theory (see, e.g., [Vapnik, 1998])VC usually leads to trivial bounds (>1, unless n is huge).- Compression arguments [Graepel, Herbrich, & Shawe-Taylor, 2000]Compression bounds are not applicable to point sparse classifiers of the type herein presented, or of the RVM type [Herbrich, 2000].We apply PAC-Bayesian bounds [McAllester, 1999], [Seeger, 2002].
42Some Definitions: some point “estimate” of Leta Laplacian centered atwe’ll call this the “posterior”, although not in the usual sense.Point classifier (PC) atthe one we’reinterested inGibbs classifier (GC) ata sample fromBayes voting classifier (BVC) at
43Key LemmasLemma 1: for any , the decision of the PC with is the same as that of a BVC based on any symmetric posterior centered onProof: a simple pairing argument (see, e.g., [Herbrich, 2002]).Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC.Proof: see [Herbrich, 2002].Conclusion: we can use PAC-Bayesian bounds for GC for our PC.
44PAC-Bayesian TheoremLet be our prior (meaning it is independent of )Let be our posterior (meaning it may depend on )Generalization error for a Gibbs classifier:Expected sample/empirical error for a Gibbs classifier:McAllester’s PAC-Bayesian theorem relates these two errors.
45PAC-Bayesian TheoremTheorem: with as defined above, the following inequality holds with probability at least over random training samples of sizewhereis the Kullback-Leibler div. between two Bernoullies, andis the Kullback-Leibler div. between posterior and prior.
46Tightening the PAC-Bayesian Theorem With our Laplacian prior and posterior, we have:and show thatDue to the convexity of the KLD, it is easy to (numerically) findSince we can choose freely:
47Using the PAC-Bayesian Theorem Set a prior parameter and choose a confidence levelUsing this prior and , find a point estimateFind such thatand evaluate the corresponding by Monte Carlo.With this , define the “posterior” as above,From these, we know that with probability at least
48Using the PAC-Bayesian Theorem To obtain an explicit bound oncan easily be found numerically.is always non-trivial, i.e.,
49Using the PAC-Bayesian Theorem Finally, notice that the PAC-Bayesian bound applies to the Gibbs classifier, but recall Lemma 2.Lemma 2: for any “posterior” , the generalization error of the BVC is less than twice that of the GC.In practice, we have observed that the BVC usually generalizes as well as (often much better than) the GC. We believe the factor 2 can be reduced, but we have not yet been able to show it.
50Example of PAC-Bayesian Bound “Mines” datasetMaybe tight enough to guide selection of
51Conclusions for Part II PAC-Bayesian bound for sparse classifier.The bound (unlike VC bounds) is always non-trivial.Tightness still requires large sample sizes.Future goals: tightening the bounds,..as always.
52Feature Selection in Unsupervised Learning PART IIIFeature Selection in Unsupervised Learning
53Feature Selection in Unsupervised Learning FS is a widely studied problem in supervised learning.FS is a not widely studied in unsupervised learning (clustering).A good reason: in the absence of labels, how do you assess theusefulness of each feature?We address this problem in the context of model-based clustering using finite mixtures:Approach: how relevant is each feature (component of x) for the mixture nature of the data?
54Example of Relevant and Irrelevant Features x2 is irrelevant for the mixturenature of this data.x1 is relevant for the mixturenature of this data.Any PCA-type analysis ofthis data would not be useful here.
55Interplay Between Number of Clusters and Features Example:Using only x1, we find2 components.Using x1 and x2, we find7 components(needed to fit thenon-Gaussian density of x2)Marginals
56Approaches to Feature Selection Most classical FS methods for supervised learningrequire combinatorial searches.For d features, there are 2d possible feature subsets.Alternative: assign real valued feature weights and encouragesparseness (like seen above for supervised learning).[Law, Figueiredo and Jain, TPAMI, 2004 (to appear)]
57Maximum Likelihood Estimation and Missing Data Training dataMaximum likelihood estimate ofwhereMissing dataOne-of-k encoding: component i“Complete” log-likelihood...would be easy to maximize, if we had
58EM Algorithm for Maximum Likelihood Estimation E-step:becauseCurrent estimate of the probability that was produced by component iM-step:
59Model Selection for Mixtures Important issue: how to select k (number of components)?[Figueiredo & Jain, 2002] MML-based approach; roughly, an MDL/BIC with careful definition of theamount of data from which each parameter is estimated.Resulting criterion leads to a simple modification of EMOriginal M-step expression forNew expression:whereNumber of parameters of each component.This update “kills” weak components
60Feature Selection for Mixtures Simplifying assumption: in each component, the features are independent.parameters of the density of the l-th feature in component iLet some features have a common density in all components. These features are “irrelevant”.if feature l is relevantif feature l is irrelevantCommon (w.r.t. i) densities of the irrelevant features
61Feature SaliencyThe likelihoodTo apply EM, we see as missing data and definewe call it “feature saliency”Can be shown that the resulting likelihood (marginal w.r.t. ) is
62Applying EMWe addressby EM, using and as missing data.In addition to the variables defined above, the E-step now also involves the following variables:both easily computed in closed form.
63Applying EMAssuming that and are both univariate Gaussian with arbitrary mean and variance.M-step: (ommitting the variances)mean in
64The modified M-step is: Model SelectionTo perform feature selection, we want to encourage some of the saliencies to become either 0 or 1.This can be achieved with the same MML-type criterion used above to select kThe modified M-step is:where:number of parameters in
65Synthetic ExampleMixture of 4 Gaussians (with identity covariance) with d = 10,2 relevant features8 irrelevant featuressamples, projected on thefirst two dimensions
67Synthetic ExampleFeature saliency values (mean 1 s.d.) over 10 runsrelevant featuresirrelevant features
68- Several standard benchmark data sets: Real Data- Several standard benchmark data sets:NamendkWine178133Wisconsin breast cancer569302Image segmentation2320187Texture classification4000194- These are standard data-sets for supervised classification.- We fit mixtures, ignoring the labels.- We classify the points and compare to the labels.
69For these data-sets, our approach is able to improve the Real Data: Resultswith FSwithout FSName% error (sd)Wine6.61 (3.91)8.06 (3.73)Wisconsin breast cancer9.55 (1.99)(2.70)Image segmentation(1.54)(5.10)Texture classification4.04 (0.76)4.85 (0.98)For these data-sets, our approach is able to improve theperformance of mixture-based unsupervised classification.
70Research DirectionsIn supervised learning:In unsupervised learning:More efficient algorithms for logistic regression with LASSO prior.Investigating the performance of generalized Gaussian priors with exponents other than 1 (LASSO) or 2 (Ridge)Deriving performce bounds for this type of approachMore efficien algorithmsRemoving the conditional independence assumptionExtension to other mixtures (e.g., multinomial for categorical data).