Feature Selection, Projection and Extraction

Feature Selection, Projection and Extraction
Carla Brodley TO DO 4/14/12: Understand what L1-Norm regularization does vs. L2-Norm Finish PCA Slides Introduction to Machine Learning and Data Mining, Carla Brodley

Introduction to Machine Learning and Data Mining, Carla Brodley
What is the concept? - banana + grapefruit - firetruck + airplane duck fire + graduation + yellow - basket + garden - class Introduction to Machine Learning and Data Mining, Carla Brodley

What is Similarity? Introduction to Machine Learning and Data Mining, Carla Brodley

Classifier Construction
Objective: Construct a classifier for the data such that predictive accuracy is maximized 1. Prepare/collect training instances Generate/select descriptive features Collect and label instances 2. Construct a classifier Introduction to Machine Learning and Data Mining, Carla Brodley

Creating Features “Good” features are the key to accurate generalization Domain knowledge can be used to generate a feature set Medical Example: results of blood tests, age, smoking history Game Playing example: number of pieces on the board, control of the center of the board Data might not be in vector form Example: spam classification “Bag of words”: throw out order, keep count of how many times each word appears. Sequence: one feature for first letter in the , one for second letter, etc. Ngrams: one feature for every unique string of n features Idea in game playing, is that some features may take a significant amount of computation to find the value. Poor feature set for most purposes—similar documents are not close to one another in this representation. Introduction to Machine Learning and Data Mining, Carla Brodley

What is feature selection?
Reducing the feature space by throwing out some of the features Example: soc sec number is known to be poorly correlated with disease. Introduction to Machine Learning and Data Mining, Carla Brodley

Reasons for Feature Selection
Want to find which features are relevant Domain specialist not sure which factors are predictive of disease Common practice: throw in every feature you can think of, let feature selection get rid of useless ones Want to maximize accuracy, by removing irrelevant and noisy features For Spam, create a feature for each of ~105 English words Training with all features computationally expensive Irrelevant features hurt generalization Features have associated costs, want to optimize accuracy with least expensive features Embedded systems with limited resources Voice recognition on a cell phone Branch prediction in a CPU (4K code limit) Introduction to Machine Learning and Data Mining, Carla Brodley

Terminology Univariate method: considers one variable (feature) at a time Multivariate method: considers subsets of variables (features) together Filter method: ranks features or feature subsets independently of the predictor (classifier) Wrapper method: uses a classifier to assess features or feature subsets Linear/non linear. Introduction to Machine Learning and Data Mining, Carla Brodley

Filtering Basic idea: assign score to each feature x indicating how “related” x and the class y are Intuition: if x=y for all instances, then x is great no matter what our model is; x contains all information needed to predict y Pick the n highest scoring features to keep Introduction to Machine Learning and Data Mining, Carla Brodley

Individual Feature Irrelevance
Legend: Y=1 Y=-1 P(Xi, Y) = P(Xi) P(Y) P(Xi| Y) = P(Xi) P(Xi| Y=1) = P(Xi| Y=-1) density Example of disease diagnosis, imagine this is age Want to discriminate between sick and healthy patients. We have the distribution of the two classes. They look very similar, thus the feature is not of use because we can’t discriminate the classes. The joint distr’n of X and Y is equal to the product of the marginals, then X & Y are independent. Equivalently if we use Bayes rule, the probability of the variable given the target is equal to the prob of the variable. Which means we can ignore it. xi Slide from I. Guyon, PASCAL Bootcamp in ML

Individual Feature Relevance
m- m+ -1 One simple measure of relevance is the displacement of the two means of the two distributions. We have a new feature. Note that now it does have some information. How do we define an algorithm to determine which features are predictive? Simple idea: use the difference between the means (normalized to account for different ranges of course). s- s+ xi Figure from I. Guyon, PASCAL Bootcamp in ML

Univariate Dependence
Independence: P(X, Y) = P(X) P(Y) Measures of dependence: Mutual Information (see notes from board) Correlation (see notes from board) how do we measure dependence? One way is to use mutual information If independence is defined by this equality, then we need to find a way to measure the difference between P(x,y) and P(x)p(y). Well its not just a number because it depends on the density functions of x and y (i.e., how often we observe various values). Expected value taken over the distributions of x and y One measure of dependence is mutual information. We are taking the expected value over the distributions of X and Y of the mutual information. You can use this to rank features, BUT The only difficulty with this as a general measure is that its hard to measure the density of a probability distribution unless it’s a binary variable. Continuous is still ok, but for a real-valued variable we need to discretize the data or some estimation of density that is not trivial and how to do this properly is not always clear. Note that the MI is also called the Kulback-Liebler distance – KL distance. Slide from I. Guyon, PASCAL Bootcamp in ML

Correlation and MI P(X) X Y P(Y) R=0.02 MI=1.03 nat X Y
We next address the relationship between correlation and MI. correlation is a measure of linear dependences mi -- does not require the dependencies to be linear, but there is a price to pay in that it can be expensive. In upper left and lower right we show the marginal distributions of X and Y, two variables (ignore that we have been using Y as the class). The boxes in the lower left and upper right are two joint density functions of X and Y (made up). The lower left has some structure, the upper right does not. In terms of correlation both look terrible – can’t make a line to explain Y from X or v.v., but MI is able to find that there is a relationship in the one on the lower left. Y P(Y) R= MI=1.65 nat X Y Slide from I. Guyon, PASCAL Bootcamp in ML

Gaussian Distribution
X P(X) X Y Y Note that for the Guassian distribtion, there is an exact relationship between correlation r, and MI and is given by this formula but in general it is not this straightforsward. P(Y) X Y MI(X, Y) = -(1/2) log(1-R2) Slide from I. Guyon, PASCAL Bootcamp in ML

T-test m- m+ P(Xi|Y=1) P(Xi|Y=-1) -1 xi s- s+ Normally distributed classes, equal variance s2 unknown; estimated from data as s2within. Null hypothesis H0: m+ = m- T statistic: If H0 is true, t= (m+ - m-)/(swithin1/m++1/m-) Student(m++m--2 d.f.) What do people do in statistics. where they use test to determine relevance, e.g. t-test, measures whether there is a sig difference between 2 distr’ns assumed to be normal if the null hypothesis Ho true then the distribution between the two means (rescaled by the standard error) follows a t-distr’n. You can do the test, and then if want a certain risk value that it is relevant, we can use the test. Or we can use the p-value and rank the features, cutting off at a particular value. Often people just assume a normal distr’n. Slide from I. Guyon, PASCAL Bootcamp in ML

Other ideas for Univariate Feature Selection?
Run LTU, select features with large weights Run decision tree, select features chosen in the decision tree

Considering each feature alone may fail
Why consider multiple variables (subsets) together? Look that two figures that are scatter plots. We have two distributions Look at the projection onto each axis. Imagine x1 is age, x2 is weight we see no separation, so a univariate feature selection would throw away x2. X2 alone is irrelevant but together with x1 is good. Second example, shows that each alone do not give a separation. Together they give a separation (even though its not linear). Guyon-Elisseeff, JMLR 2004; Springer 2006

Multivariate Filter Methods?
Can’t really do it because how on earth do we do any kind of approximation of the joint without a ton of data…. Introduction to Machine Learning and Data Mining, Carla Brodley

Filtering Advantages: Fast, simple to apply Disadvantages: Doesn’t take into account which learning algorithm will be used Doesn’t take into account correlations between features, just correlation of each feature to the class label Last point can be an advantage if we’re only interested in ranking the relevance of features, rather than performing prediction Introduction to Machine Learning and Data Mining, Carla Brodley

Feature Selection Methods
Filter: Wrapper: Selected Features Filter All Features Supervised Learning Algorithm Classifier Selected Features Feature Subset Feature Evaluation Criterion Classifier Supervised Learning Algorithm Classifier All Features Search Selected Features Criterion Value Introduction to Machine Learning and Data Mining, Carla Brodley

Feature Subset Feature Evaluation Criterion Supervised Learning Algorithm Classifier All Features Search Selected Features Criterion Value What can we do here? How should we evaluate each subset???? Introduction to Machine Learning and Data Mining, Carla Brodley

Feature Subset Feature Evaluation Criterion Supervised Learning Algorithm Classifier All Features Search Selected Features Criterion Value Search Method: sequential forward search A B C D We applied sequential forward search in our experiments. For other applications, we have applied other search methods – we have also developed an interactive framework for visually exploring clusters – appeared in KDD 2000 This is a greedy method that may find a suboptimal solution. Take the case where 2 Features alone tell you nothing – but together they tell you a lot. A, B B, C B, D A, B, C B, C, D Introduction to Machine Learning and Data Mining, Carla Brodley

Feature Subset Feature Evaluation Criterion Supervised Learning Algorithm Classifier All Features Search Selected Features Criterion Value Search Method: sequential backward elimination ABC ABD ACD BCD Abc abd acd bcd AB AD BD D A Introduction to Machine Learning and Data Mining, Carla Brodley

Forward or backward selection? Of the three variables of this example, the third one separates the two classes best by itself (bottom right histogram). It is therefore the best candidate in a forward selection process. Still, the two other variables are better taken together than any subset of two including it. A backward selection method may perform better in this case. Introduction to Machine Learning and Data Mining, Carla Brodley

Model search More sophisticated search strategies exist Best-first search Stochastic search See “Wrappers for Feature Subset Selection”, Kohavi and John 1997 Other objective functions exist which add a model-complexity penalty to the training error AIC, BIC Introduction to Machine Learning and Data Mining, Carla Brodley

Regularization In certain cases, we can move model selection into the induction algorithm Only have to fit one model; more efficient This is sometimes called an embedded feature selection algorithm In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting. This information is usually of the form of a penalty for complexity, such as restrictions for smoothness or bounds on the vector space norm. A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters. The same idea arose in many fields of science. For example, the least-squares method can be viewed as a very simple form of regularization. A simple form of regularization applied to integral equations, generally termed Tikhonov regularization after Andrey Nikolayevich Tikhonov, is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including total variation regularization have become popular. In statistics and machine learning, regularization is used to prevent overfitting. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-norm in support vector machines. Regularization methods are also used for model selection, where they work by implicitly or explicitly penalizing models based on the number of their parameters. For example, Bayesian learning methods make use of a prior probability that (usually) gives lower probability to more complex models. Well-known model selection techniques include the Akaike information criterion (AIC), minimum description length (MDL), and the Bayesian information criterion (BIC). Alternative methods of controlling overfitting not involving regularization include cross-validation. Introduction to Machine Learning and Data Mining, Carla Brodley

Regularization Regularization: add model complexity penalty to training error. for some constant C Now Regularization forces weights to be small, but does it force weights to be exactly zero? is equivalent to removing feature f from the model Introduction to Machine Learning and Data Mining, Carla Brodley

Kernel Methods (Quick Review)
Expanding feature space gives us new potentially useful features Kernel methods let us work implicitly in a high-dimensional feature space All calculations performed quickly in low-dimensional space Introduction to Machine Learning and Data Mining, Carla Brodley

Feature Engineering Linear models: convenient, fairly broad, but limited We can increase the expressiveness of linear models by expanding the feature space. E.g. Now feature space is R6 rather than R2 Example linear predictor in these features: Introduction to Machine Learning and Data Mining, Carla Brodley

END OF MATERIAL Follow up slides require knowledge of ROC and L1, L2 Norms – have not yet covered these ideas in Fall 2012

Individual Feature Relevance
1 Sensitivity ROC curve AUC m- m+ One simple measure of relevance is the displacement of the two means of the two distributions. We have a new feature on the right. Note that now it does have some information. How do we define an algorithm to determine which features are predictive? Simple idea: use the difference between the means (normalized to account for different ranges of course). Idea 2: use AUC for single features. We consider each feature as a “classifier” and then we can set a threshold for the value to say +ive or –ive class, then Measure the AUC (shows the tradeoff between the sensitivity (error on +ive instances) and the specificity (error rate on –ive instances). -1 s- s+ 1- Specificity xi Slide from I. Guyon, PASCAL Bootcamp in ML

L1 versus L2 Regularization
NEED TO UNDERSTAND THIS STILL Introduction to Machine Learning and Data Mining, Carla Brodley

Feature Selection, Projection and Extraction

Similar presentations

Presentation on theme: "Feature Selection, Projection and Extraction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feature Selection, Projection and Extraction

Similar presentations

Presentation on theme: "Feature Selection, Projection and Extraction"— Presentation transcript:

Similar presentations

About project

Feedback