Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006

Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006
Computer Science Dept. University of Crete (Some slides borrowed from Aliferis,Tsamardinos, 2004 Medinfo Tutorial)

Outline What is Variable/Feature Selection Filters & Wrappers
What is Relevancy Connecting Wrappers, Filters, and Relevancy SVM-Based Variable Selection Markov-Blanket-Based Variable Selection

Back to the Fundamentals: The feature selection problem
Journal of Machine Learning Research special issue: “Variable Selection refers to the problem of selecting input variables that are most predictive of a given outcome” Kohavi and John [1997]: “[variable selection is the problem of selecting]… the subset of features such that the accuracy of the induced classifier … is maximal” Problem: according to which classifier is predictive power measured by? A specific one? All possible classifiers? What about different cost features?

Why Feature Selection To reduce cost or risk associated with observing the variables To increase predictive power To reduce the size of the models, so they are easier to understand and trust To understand the domain

Definition of Feature Selection
Let M be a metric, scoring a model and a feature subset according to predictions and features used Let A be a learning algorithm used to build the model Feature Selection Problem: Select a feature subset s, that maximizes the score that M gives to the model learned by A using features s Feature Selection Problem 2: Select a feature subset s and a learner A’, that maximizes the score M gives to the model learned by A’ using features s

Examples M is the accuracy + a preference for smaller models, A is a SVM Find the minimal feature subset that maximizes the accuracy of a SVM Other possibilities for M: calibrated accuracy, AUC, trade-off between accuracy and cost of features

Filters and Wrappers

Wrappers An algorithm for solving the feature selection problem that is allowed to evaluate (has access to) learner A on different feature subsets Typical wrapper: Search (greedily or otherwise) the feature subset space Evaluate using M each subset s during search Report the one that maximizes M

Wrappers Say we have predictors A, B, C and classifier M. We want to predict T given the smallest possible subset of {A,B,C}, while achieving maximal performance (accuracy) FEATURE SET CLASSIFIER PERFORMANCE {A,B,C} M 98% {A,B} M 98% {A,C} M 77% {B,C} M 56% {A} M 89% {B} M 90% {C} M 91% {.} M 85%

An Example of a Greedy Wrapper
Since the search space is exponential, we have to use heuristic search Solution {A,B} 98 start {A} 89 {A,C} 77 {A,B}98 {A,B,C}98 {} 85 {B} 90 {B,C} 56 {A,C}77 {C} 91 end Subset returned by greedy search {B,C}56

Wrappers A common example of heuristic search is hill climbing: keep adding features one at a time until no further improvement can be achieved (“forward greedy wrapping”) Alternatively we can start with the full set of predictors and keep removing features one at a time until no further improvement can be achieved (“backward greedy wrapping”) A third alternative is to interleave the two phases (adding and removing) either in forward or backward wrapping (“forward-backward wrapping”). Of course other forms of search can be used; most notably: Exhaustive search Genetic Algorithms Branch-and-Bound (e.g., cost=# of features, goal is to reach performance th or better)

Example Feature Selection Methods in Bioinformatics: GA/KNN
Wrapper approach whereby: heuristic search=Genetic Algorithm, and classifier=KNN

Filters An algorithm for solving the feature selection problem that is not allowed to evaluate (does not have access to) learner A Typical filters select the feature subset according to certain statistical properties

Filter Example: Univariate Association Filtering
Rank features according to their association with the target (univariately) Select the first k features FEATURE ASSOCIATION WITH TARGET No Threshold gives optimal solution {C} % {B} % {A} %

Order all predictors according to strength of association with target
Example Feature Selection Methods in Biomedicine: Univariate Association Filtering Order all predictors according to strength of association with target Choose the first k predictors and feed them to the classifier Various measures of association may be used: X2, G2, Pearson r, Fisher Criterion Scoring, etc. How to choose k? What if we have too many variables?

Filter algorithm where feature selection is done as follows:
Example Feature Selection Methods in Biomedicine: Recursive Feature Elimination Filter algorithm where feature selection is done as follows: build linear Support Vector Machine classifiers using V features compute weights of all features and choose the best V/2 repeat until 1 feature is left choose the feature subset that gives the best performance (using cross-validation) give best feature set to the classifier of choice.

What is Relevancy

Relevant and Irrelevant Features
Large effort and debate to define relevant (irrelevant) features (AI Journal vol. 97) Why? Intuition: For classification (presumably) we only need relevant features We can throw away the irrelevant features The set of relevant features must be the solution to the feature selection problem! What is Relevant must be independent of the classifier A used to build the final model! Relevant Features teach us something about the domain

Relevancy and Filters Consider a definition of relevancy
Construct an algorithm that attempts to identify the relevant features It is a filtering algorithm (independent of the classifier used) Relevancy  a family of filtering algorithms

The Argument of Kohavi and John [1997]
Take a handicapped perceptron: sgn(wx) instead of sgn(wx + w0) Add an irrelevant variable to the data with value always 1 For some problems, the irrelevant variables is necessary Filtering (presumably) returns only “relevant” features Thus, filtering is suboptimal, wrapping is not sgn(wx) 1 1 x4 x3 x2 x1 x0

KJ Definitions of Relevancy
KJ-Strongly Relevant Variable (for target T) X is KJ-Strongly Relevant if it is necessary for optimal density estimation V set of all variables, S=V \ {X, T} P(T | X, S)  P(T | S)

KJ Definitions of Relevancy
KJ-Weakly Relevant Variable (for target T) X is Weakly Relevant if it is not necessary for optimal density estimation, but still informative (i.e., there is some subset that makes conditioned on which it becomes informative) V set of all variables X not strongly relevant and There exists U V \ {X, T} P(T | X, U)  P(T | X)

KJ Definition of Irrelevancy
A variable X is KJ-irrelevant to T if it is not weakly or strongly relevant to T Intuitively X provides no information for T conditioned on any subset of other variables

Connecting Wrappers, Filters, and Relevancy

Negative Results on Relevancy and Filters
Kohavi and John argument: “filtering returns only relevant variables and sometimes KJ-irrelevant or KJ-weakly relevant variables maybe needed” True: There is no definition of relevancy independent of the classifier A used to build the final model, or independent of the metric M that evaluates the model of A, such that the relevant features are the solution to the feature selection problem [Tsamardinos, Aliferis, AI&Stats 2003] Have to assume a (family) of algorithm(s) and metric(s) to define what is relevant

Negative Results on Wrappers
Wrappers are subject to the No Free Lunch theorem for black-box optimization if the choice of metric or the classifier is unconstrained [Tsamardinos, Aliferis, AI&Stats 2003] => Averaged out on all possible problems, each wrapper is the same as the random search Requires an exponential search to provably find the optimal feature subset

Connecting with Bayesian Networks
Faithful Bayesian Network A Irrelevant Features KJ Strongly Relevant Features B C D KJ-irrelevant Features (anything without a path to T) K F T H E I Markov Blanket of T Weakly Relevant Features (anything with a path to T)

Markov Blanket in Faithful Bayesian Networks
Markov Blanket of T KJ-Strongly Relevant Features Smallest set of variables, conditioned on which all other variables become independent of T Set of parents, children, and spouses of T

OPTIMAL Solutions to a class of Feature Selection Problems
MB(T) is smallest subset of variables, conditioned on which all other variables become independent of T The Markov Blanket of T should be all we need True when: Classifier can utilize the information in those variables (e.g. is a universal approximator) The metric prefers the smallest models with optimal calibrated accuracy (otherwise the Markov Blanket may include unnecessary variables)

SVM Based Variable Selection

Linear SVMs Identify Irrelevant Features
Theorem: Both the hard and soft margin linear SVM will assign a weight of zero to irrelevant features (in the sample limit) Set up the sample limit SVM Prove there is a unique and w, b in the sample limit Prove in this w, b the weight of the irrelevant features is zero

Linear SVMs may not Identify KJ-Strongly Relevant Variables
Consider an exclusive OR The soft margin linear SVM has a zero weight vector But, both features are KJ-strongly relevant Similar result expected for non-linear SVMs x2 1 1 -1 -1 1 1

Linear SVMs may Retain KJ-Weakly-Relevant Features
X2 1 1 X1 1+

Feature Selection with (Linear) SVMs
The SVM will correct remove irrelevant features The SVM may incorrect also remove strongly relevant features The SVM may incorrectly retain weakly relevant features

Markov Blanket Based Feature Selection

Optimal Feature Selection with Markov Blankets
MMMB and Hiton algorithms [KDD 2003, AMIA 2003] Can identify the MB(T) among thousands of variables Provably correct in the sample limit and in faithful distributions Provably the MB(T) is the solution under the conditions specified Excellent results in real datasets from biomedicine The different between the two methods is conditioning vs maximizing the margin

Causal Discovery Recall: feature selection to understand the domain
Markov Blanket of T Causal interpretation: direct causes, direct effects, direct causes of direct effects … When Faithfulness Causal Sufficiency Acyclicity

(999 variables, consists of 37 tiles of Alarm network)
Network: Alarm-1k (999 variables, consists of 37 tiles of Alarm network) Classification Algorithm: RBF SVM Training sample size = 1000 Testing sample size = 1000

Target = 46 Target Member of MB

Feature Selection Method = HITON_MB
Classification performance = 87.7% 3 True positives 2 False positives 0 False negatives Target False negative True positive False positive

Feature Selection Method = MMMB

Feature Selection Method = RFE Linear
Classification performance = 74.6% 2 True positives 189 False positives 1 False negative Target False negative True positive False positive

Feature Selection Method = RFE Polynomial
Classification performance = 85.6% 2 True positives 33 False positives 1 False negative Target False negative True positive False positive

Feature Selection Method = BFW

Classification Algorithm: See 5.0 Decision Trees
Network: Gene (801 variables) Classification Algorithm: See 5.0 Decision Trees Training sample size = 1000 Testing sample size = 1000

Target = 220 Target Member of MB

Feature Selection Method = MMMB
Classification performance with DT = 96.4% 9 True positives 4 False positives 0 False negatives Target False negative True positive False positive

Feature Selection Method = HITON_MB

Feature Selection Method = BFW

Is This A General Phenomenon Or A Contrived Example?

Random Targets in Tiled ALARM

Random Targets in GENE

Conclusions A formal definition of the feature selection problems allows to draw connections between relevant/irrelevant variables, the Markov Blanket, and solutions to the feature selection problem Need to specify the algorithm and metric to design algorithms that provably solve the feature selection problem

Conclusions Linear SVMs (current formulations) correctly identify the irrelevant variables, but do not solve the feature selection problem (under the conditions specified) Markov Blanket based algorithms exists that are probably correct in the sample limit for faithful distributions Questions: SVM formulations that provably return the solution Extend the results to the non-linear case

Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006

Similar presentations

Presentation on theme: "Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006

Similar presentations

Presentation on theme: "Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006"— Presentation transcript:

Similar presentations

About project

Feedback