315 Feature Selection. 316 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

Automated Regression Modeling Descriptive vs. Predictive Regression Models Four common automated modeling procedures Forward Modeling Backward Modeling.
Sub Exponential Randomize Algorithm for Linear Programming Paper by: Bernd Gärtner and Emo Welzl Presentation by : Oz Lavee.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Analysis of Algorithms
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Boosting Ashok Veeraraghavan. Boosting Methods Combine many weak classifiers to produce a committee. Resembles Bagging and other committee based methods.
Minimum Redundancy and Maximum Relevance Feature Selection
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
CMPUT 466/551 Principal Source: CMU
Lecture 5: Causality and Feature Selection Isabelle Guyon
Lecture 4: Embedded methods
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Feature Selection Presented by: Nafise Hatamikhah
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Feature Selection for Regression Problems
Reduced Support Vector Machine
Statistics 350 Lecture 25. Today Last Day: Start Chapter 9 ( )…please read 9.1 and 9.2 thoroughly Today: More Chapter 9…stepwise regression.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Lecture 6: Causal Discovery Isabelle Guyon
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Efficient Model Selection for Support Vector Machines
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
by B. Zadrozny and C. Elkan
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
273 Discovery of Causal Structure Using Causal Probabilistic Network Induction AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.
Rule Induction Overview Generic separate-and-conquer strategy CN2 rule induction algorithm Improvements to rule induction.
311 Feature Selection. 312 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the.
Class1 Class2 The methods discussed so far are Linear Discriminants.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Greedy Algorithms BIL741: Advanced Analysis of Algorithms I (İleri Algoritma Çözümleme I)1.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Dr. Gheith Abandah 1.  Feature selection is typically a search problem for finding an optimal or suboptimal subset of m features out of original M features.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Data Transformation: Normalization
Inference in Bayesian Networks
Challenges in Creating an Automated Protein Structure Metaserver
Backtracking And Branch And Bound
COMP61011 Foundations of Machine Learning Feature Selection
Computer Science cpsc322, Lecture 14
Machine Learning Feature Creation and Selection
Learning Markov Blankets
Feature Selection To avid “curse of dimensionality”
Computer Science cpsc322, Lecture 14
Chapter 6: Multiple Linear Regression
Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006
Linear Model Selection and regularization
An Algorithm for Bayesian Network Construction from Data
Feature Selection Methods
Presentation transcript:

315 Feature Selection

316 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the wrapper approach to feature selection? –Examples

317 What is Feature Selection for classification? –Given: a set of predictors (“features”) V and a target variable T –Find: minimum set F that achieves maximum classification performance of T (for a given set of classifiers and classification performance metrics)

318 Why feature selection is important? –May Improve performance of classification algorithm –Classification algorithm may not scale up to the size of the full feature set either in sample or time –Allows us to better understand the domain –Cheaper to collect a reduced set of predictors –Safer to collect a reduced set of predictors

319 Filters vs Wrappers: Wrappers Say we have predictors A, B, C and classifier M. We want to predict T given the smallest possible subset of {A,B,C}, while achieving maximal performance (accuracy) FEATURE SET CLASSIFIER PERFORMANCE {A,B,C} M98% {A,B} M98% {A,C} M77% {B,C} M56% {A} M89% {B} M90% {C} M91% {.} M85%

320 Filters vs Wrappers: Wrappers The set of all subsets is the power set and its size is 2 |V|. Hence for large V we cannot do this procedure exhaustively; instead we rely on heuristic search of the space of all possible feature subsets. {} 85 {A} 89 {B} 90 {A,B} 98 {A,B,C}98 {C} 91 {A,C} 77 {B,C} 56 start {A,B}98 {B,C}56 {A,C}77 end

321 Filters vs Wrappers: Wrappers A common example of heuristic search is hill climbing: keep adding features one at a time until no further improvement can be achieved. {} 85 {A} 89 {B} 90 {A,B} 98 {A,B,C}98 {C} 91 {A,C} 77 {B,C} 56 start {A,B}98 {B,C}56 {A,C}77 end

322 Filters vs Wrappers: Wrappers A common example of heuristic search is hill climbing: keep adding features one at a time until no further improvement can be achieved (“forward greedy wrapping”) Alternatively we can start with the full set of predictors and keep removing features one at a time until no further improvement can be achieved (“backward greedy wrapping”) A third alternative is to interleave the two phases (adding and removing) either in forward or backward wrapping (“forward-backward wrapping”). Of course other forms of search can be used; most notably: -Exhaustive search -Genetic Algorithms -Branch-and-Bound (e.g., cost=# of features, goal is to reach performance th or better)

323 Filters vs Wrappers: Filters In the filter approach we do not rely on running a particular classifier and searching in the space of feature subsets; instead we select features on the basis of statistical properties. A classic example is univariate associations: FEATURE ASSOCIATION WITH TARGET {A} 91% {B} 90% {C} 89% Threshold gives suboptimal solution Threshold gives optimal solution

324 Example Feature Selection Methods in Biomedicine: Univariate Association Filtering –Order all predictors according to strength of association with target –Choose the first k predictors and feed them to the classifier –Various measures of association may be used: X 2, G 2, Pearson r, Fisher Criterion Scoring, etc. –How to choose k? –What if we have too many variables?

325 Example Feature Selection Methods in Biomedicine: Recursive Feature Elimination –Filter algorithm where feature selection is done as follows: 1. build linear Support Vector Machine classifiers using V features 2. compute weights of all features and choose the best V/2 3. repeat until 1 feature is left 4. choose the feature subset that gives the best performance (using cross-validation)

326 Example Feature Selection Methods in Bioinformatics: GA/KNN Wrapper approach whereby: heuristic search=Genetic Algorithm, and classifier=KNN

327 How do we approach the feature selection problem in our research? –Find the Markov Blanket –Why?

328 A fundamental property of the Markov Blanket –MB(T) is the minimal set of predictor variables needed for classification (diagnosis, prognosis, etc.) of the target variable T (given a powerful enough classifier and calibrated classification) CD TH I V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V

329 HITON: An algorithm for feature selection that combines MB induction with wrapping C.F. Aliferis M.D., Ph.D., I. Tsamardinos Ph.D., A. Statnikov M.S. Department of Biomedical Informatics, Vanderbilt University AMIA Fall Conference, November 2003

330 HITON: An algorithm for feature selection that combines MB induction with wrapping ALGORITHMSOUNDSCALABLESAMPLE EXPONENTIAL TO |MB| COMMENTS Cheng and GreinerYESNO Post-processing on learning BN Cooper et al.NO Uses full BN learning Margaritis and Thrun YES Intended to facilitate BN learning Koller and SahamiNO Most widely- cited MB induction algorithm Tsamardinos and Aiferis YES Some use BN learning as sub- routine HITONYES NO

331 HITON: An algorithm for feature selection that combines MB induction with wrapping Step #1: Find the parents and children of T; call this set PC(T) Step #2: Find the PC(.) set of each member of PC(T); take the union of all these sets to be PCunion Step #3: Run a special test to filter out from PCunion the non-members of MB(T) that can be identified as such (not all can); call the resultant set TMB (tentative MB) Step #4: Apply heuristic search with a desired classifier/loss function and cross-validation to identify variables that can be dropped from TMB without loss of accuracy

332 HITON (Data D; Target T; Classifier A) “returns a minimal set of variables required for optimal classification of T using algorithm A” MB(T) = HITON-MB(D, T) // Identify Markov Blanket Vars = Wrapper(MB(T), T, A) // Use wrapping to remove unnecessary variables Return Vars HITON-MB(Data D, Target T) “returns the Markov Blanket of T” PC = parents and children of T returned by HITON-PC(D, T) PCPC = parents and children of the parents and children or T CurrentMB = PC  PCPC // Retain only parents of common children and remove false positives  potential spouse X in CurrentMB and  Y in  PC: if not  S  in {Y}  V -{T, X} so that  (T ; X | S ) then retain X in CurrentMB else remove it Return CurrentMB HITON-PC(Data D, Target T) “returns parents and children of T” Wrapper(Vars, T, A) “returns a minimal set among variables Vars for predicting T using algorithm A and a wrapping approach” Select and remove a variable. If internally cross-validated performance of A remains the same permanently remove the variable. Continue until all variables are considered.

333 HITON-PC(Data D, Target T) “returns parents and children of T” CurrentPC = {} Repeat Find variable V i not in CurrentPC that maximizes association(Vi, T) and admit V i into CurrentPC If there is a variable X and a subset S of CurrentPC s.t.  (X : T | S) remove V i  from CurrentPC; mark V i and do not consider it again Until no more variables are left to consider Return CurrentPC

334

335

336

337 Filters vs Wrappers: Which Is Best? None over all possible classification tasks! We can only prove that a specific filter (or wrapper) algorithm for a specific classifier (or class of classifiers), and a specific class of distributions yields optimal or sub-optimal solutions. Unless we provide such proofs we are operating on faith and hope…

338 A final note: What is the biological significance of selected features? In MB-based feature selection and CPN-faithful distributions: causal neighborhood of target (i.e., direct causes, direct effects, direct causes of the direct effects of target). In other methods: ???