Feature Selection Lecture 5

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Mining Classification: Alternative Techniques
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Yue Han and Lei Yu Binghamton University.
Minimum Redundancy and Maximum Relevance Feature Selection
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Lecture 4: Embedded methods
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Reduced Support Vector Machine
Elena Marchiori Department of Computer Science
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Computational Genomics and Proteomics Lecture 9 CGP Part 2: Introduction Based on: Martin Bachler slides.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Feature Selection Bioinformatics Data Analysis and Tools
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
INSTANCE-BASE LEARNING
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Whole Genome Expression Analysis
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:
Today Ensemble Methods. Recap of the course. Classifier Fusion
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
KNN Classifier.  Handed an instance you wish to classify  Look around the nearby region to see what other classes are around  Whichever is most common—make.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Semi-Supervised Clustering
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
School of Computer Science & Engineering
Instance Based Learning
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
COMP61011 Foundations of Machine Learning Feature Selection
Roberto Battiti, Mauro Brunato
Machine Learning Feature Creation and Selection
K Nearest Neighbor Classification
Chapter 7: Transformations
Feature Selection Methods
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Feature Selection Lecture 5 G A V B M S U Lecture 5 Feature Selection (Elena Marchiori’s slides adapted) Bioinformatics Data Analysis and Tools bvhoute@few.vu.nl

What is feature selection? Reducing the feature space by removing some of the (non-relevant) features. Also known as: variable selection feature reduction attribute selection variable subset selection

Why select features? It is cheaper to measure less variables. The resulting classifier is simpler and potentially faster. Prediction accuracy may improve by discarding irrelevant variables. Identifying relevant variables gives more insight into the nature of the corresponding classification problem (biomarker detection). Alleviate the “curse of dimensionality”.

Selection based on variance Why select features? That should go under clustering as well? No feature selection Top 100 feature selection Selection based on variance Correlation plot Data: Leukemia, 3 class -1 +1

The curse of dimensionality Term introduced by Richard Bellman1. Problems caused by the exponential increase in volume associated with adding extra dimensions to a (mathematical) space. So: the ‘problem space’ increases with the number of variables/features. 1Bellman, R.E. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ

The curse of dimensionality A high dimensional feature space leads to problems in for example: Machine learning: danger of overfitting with too many variables. Optimization: finding the global optimum is (virtually) infeasible in a high-dimensional space. Microarray analysis: the number of features (genes) is much larger than the number of objects (samples). So a huge amount of observations is needed to obtain a good estimate of the function of a gene.

Approaches Wrapper: Filter: Embedded: Feature selection takes into account the contribution to the performance of a given type of classifier. Filter: Feature selection is based on an evaluation criterion for quantifying how well feature (subsets) discriminate the two classes. Embedded: Feature selection is part of the training procedure of a classifier (e.g. decision trees).

Embedded methods Attempt to jointly or simultaneously train both a classifier and a feature subset. Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. Intuitively appealing. Example: tree-building algorithms Adapted from J. Fridlyand

Approaches to Feature Selection Filter Approach Feature Selection by Distance Metric Score Input Features Train Model Model Wrapper Approach Feature Selection Search Feature Set Train Model Model Input Features Importance of features given by the model Adapted from Shin and Jasso

Filter methods Feature selection p S Classifier design R R S << p Features are scored independently and the top S are used by the classifier. Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic, etc. Easy to interpret. Can provide some insight into the disease markers. Adapted from J. Fridlyand

Problems with filter method Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new information. Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others). Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others. Adapted from J. Fridlyand

Wrapper methods Feature selection p S Classifier design R R S << p Iterative approach: many feature subsets are scored based on classification performance and best is used. Adapted from J. Fridlyand

Problems with wrapper methods Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated. No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only. Easy to overfit. Adapted from J. Fridlyand

Example: Microarray Analysis “Labeled” cases (38 bone marrow samples: 27 AML, 11 ALL Each contains 7129 gene expression values) Train model (using Neural Networks, Support Vector Machines, Bayesian nets, etc.) key genes 34 New unlabeled bone marrow samples Model AML/ALL

Microarray Data Challenges to Machine Learning Algorithms: Few samples for analysis (38 labeled). Extremely high-dimensional data (7129 gene expression values per sample). Noisy data. Complex underlying mechanisms, not fully understood.

Example: genes 36569_at and 36495_at are useful Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful

Example: genes 36569_at and 36495_at are useful Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful AML ALL

Example: genes 37176_at and 36563_at not useful Some genes are more useful than others for building classification models Example: genes 37176_at and 36563_at not useful

Importance of feature (gene) selection Majority of genes are not directly related to leukemia. Having a large number of features enhances the model’s flexibility, but makes it prone to overfitting. Noise and the small number of training samples makes this even more likely. Some types of models, like kNN do not scale well with many features.

How do we choose the most relevant of the 7219 genes? Distance metrics to capture class separation. Rank genes according to distance metric score. Choose the top n ranked genes. HIGH score LOW score

Distance metrics Tamayo’s Relative Class Separation: t-test: Bhattacharyya distance:

SVM-RFE: wrapper Recursive Feature Elimination: Train linear SVM  linear decision function. Use absolute value of variable weights to rank variables. Remove half variables with lower rank. Repeat above steps (train, rank, remove) on data restricted to variables not removed. Output: subset of variables.

SVM-RFE Linear binary classifier decision function Recursive Feature Elimination (SVM-RFE) - At each iteration: eliminate threshold% of variables with lower score recompute scores of remaining variables

SVM-RFE I. Guyon et al., Machine Learning, 46,389-422, 2002

RELIEF Idea: relevant features make (1) nearest examples of same class closer and (2) nearest examples of opposite classes more far apart. weights of all features = zero For each example in training set: find nearest example from same (hit) and opposite class (miss) update weight of each feature by adding abs(example - miss) -abs(example - hit) RELIEF I. Kira K, Rendell L, 10th Int. Conf. on AI, 129-134, 1992

RELIEF Algorithm RELIEF assigns weights to variables based on how well they separate samples from their nearest neighbors (nnb) from the same and from the opposite class. RELIEF %input: X (two classes) %output: W (weights assigned to variables) nr_var = total number of variables; weights = zero vector of size nr_var; for all x in X do hit(x) = nnb of x from same class; miss(x) = nnb of x from opposite class; weights += abs(x-miss(x)) - abs(x-hit(x)); end; nr_ex = number of examples of X; return W = weights/nr_ex; Note: Variables have to be normalized (e.g., divide each variable by its (max – min) values)

RELIEF: example Gene expressions for two types of leukemia: - 3 patiënts with AML (Acute Myeloid Leukemia) - 3 patiënts with ALL (Acute Lymphoblastic Leukemia) What are the weights of genes 1-5, assigned by RELIEF?

RELIEF: normalization First, apply (max-min) normalization: - identify the max and min value of each feature (gene) - Divide all values of each feature with the corresponding (max-min) normalization: 3 / (6-1) = 0.6

RELIEF: distance matrix Data after normalization: Then, calculate the distance matrix: Distance measure = 1 - Pearson Correlation

RELIEF: 1st iteration RELIEF, Iteration 1: AML1 hit miss . Update weights: weightgene1 += abs(0.600-0.800) - abs(0.600-0.400) weightgene2 += abs(0.417-1.083) - abs(0.417-0.250) . 0.600 0.400 0.800 miss hit

RELIEF: 2nd iteration RELIEF, Iteration 2: AML2 hit miss . Update weights: weightgene1 += abs(0.400-0.200) - abs(0.400-1.200) weightgene2 += abs(0.250-1.167) - abs(0.250-0.333) . 0.400 1.200 0.200 hit miss

RELIEF: results (after 6th iteration) Weights after last iteration sorting Last step is to sort the features by their weights, and select the features with the highest ranks:

RELIEF Advantages: Disadvantages: Fast. Easy to implement. Does not filter out redundant features, so features with very similar values could be selected. Not robust to outliers. Classic RELIEF can only handle data sets with two classes.

Extension of RELIEF: RELIEF-F Extension for multi-class problems. Instead of finding one near miss, the algorithm finds one near miss for each different class and averages their contribution of updating the weights. RELIEF-F %input: X (two or more classes C) %output: W (weights assigned to variables) nr_var = total number of variables; weights = zero vector of size nr_var; for all x in X do hit(x) = nnb of x from same class; sum_miss = 0; for all c in C do miss(x, c) = nnb of x from class c; sum_miss += abs(x-miss(x, c)) / nr_examples(c); end; weights += sum_miss - abs(x-hit(x)); nr_ex = number of examples of X; return W = weights/nr_ex;