Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stable Feature Selection for Biomarker Discovery Name: Goutham Reddy Bakaram Student Id: 20299604 Instructor Name: Dr. Dongchul Kim Review Article by Zengyou.

Similar presentations


Presentation on theme: "Stable Feature Selection for Biomarker Discovery Name: Goutham Reddy Bakaram Student Id: 20299604 Instructor Name: Dr. Dongchul Kim Review Article by Zengyou."— Presentation transcript:

1 Stable Feature Selection for Biomarker Discovery Name: Goutham Reddy Bakaram Student Id: 20299604 Instructor Name: Dr. Dongchul Kim Review Article by Zengyou He, Weichuan Yu @ School of software, Dalian University of Technology, China

2 Contents 1.Abstract 2.Introduction 3.Causes of Instability 4.Existing Methods 5.Stability Measure 6.Analysis 7.Conclusions 8.Future Work

3 Abstract 1.The article reviews different existing stable feature selection methods using a generic hierarchical framework for biomarker discovery. 2.Two Objectives: Providing an overview on this latest yet fast topic for convenient reference. Categorizing existing methods into a framework for future research and development. Keywords: Feature selection; Biomarker Discovery; Stability; Machine learning

4 Introduction What is Feature Selection? In machine learning, feature selection is the process of selecting a subset of relevant features for use in model construction. They are used for three reasons: 1.Simplification 2.Shorter Time 3.Enhanced generalization by reducing overfitting. The discovery of biomarkers from high-throughput “omics” data is typically modeled as selecting the most discriminating features (or variables) for classification (e.g., discriminating healthy versus diseased, or different tumor stages) The stability issue in feature selection has received much attention recently. In this article, we shall review existing methods for stable feature selection in biomarker discovery applications, summarize them with an unified framework and provide a convenient reference for future research and development.

5 Causes of Instability There are mainly three sources of instability in biomarker discovery: 1.Algorithm design without considering stability: Classic feature selection methods aim at selecting a minimum subset of features to construct a classifier of the best predictive accuracy. 2. The existence of multiple sets of true markers: It is possible that there exist multiple sets of potential true markers in real data. On the one hand, when there are many highly correlated features, different ones may be selected under different settings. On the other hand, even there are no redundant features, the existence of multiple non-correlated sets of real markers is also possible. 3. Small number of samples in high-dimensional data: In the analysis of gene expression data and proteomics data, there are typically only hundreds of samples but thousands of features.

6 Existing Methods 1. The ensemble feature selection method and the method using prior feature relevance incorporate stability consideration into the algorithm design stage. To handle data with highly correlated features, the group feature selection approach treats feature cluster as the basic unit in the selection process to increase robustness. The sample injection method tries to increase the sample size to address the small-sample-size versus large-feature-size issue. Fig. 1 A hierarchical framework for stable feature selection methods. Fig. 1

7 Ensemble Feature Selection 1.Ensemble learning methods combine multiple learned models under the assumption that “two (or more) heads are better than one”. 2.Ensemble feature selection techniques use a two-step procedure that is similar to ensemble classification and regression: Creating a set of different feature selectors. Aggregating the results of component selectors to generate the ensemble output. 3. The second step is typically modeled as a rank aggregation problem. Rank aggregation combines multiple rankings into a consensus ranking, which has been widely studied in the context of web search engines

8 Data perturbation In ensemble feature selection with data perturbation, different samplings of the original data are generated to construct different feature selectors, as described in Fig. 2. Fig. 2 These methods can be further distinguished according to the sampling method, the component feature selection algorithm and the rank aggregation method. (Table 1)

9

10 Function Perturbation 1.Here we use function perturbation to refer to those ensemble feature selection methods in which the component learners are different from each other. The basic idea is to capitalize on the strengths of different algorithms to obtain robust feature subsets. 2. Function perturbation is different from data perturbation in two perspectives: It uses different feature selection algorithms rather than the same feature selection method. It typically conducts local feature selection on the original data (without sampling).

11 Feature selection with prior feature relevance In most biomarker discovery applications, we typically assume that all features are equally relevant before the selection procedure. In practice, some prior knowledge may be available to bias the selection towards some features assumed to be more relevant. Fig. 3 shows that there are several methods for obtaining such kind of prior knowledge. One feasible method is to seek advices from domain experts or relevant publications.Fig. 3

12 Group Feature Selection 1.One motivation for group feature selection is that groups of correlated features commonly exist in high-dimensional data, and such groups are resistant to the variations of training samples 2.Group formation is the process of identifying groups of associated features. There are typically two classes of methods for this purpose: knowledge-driven methods and data-driven methods. 3.Group transformation generates a single coherent representation for each feature group. The transformation method can range from simple approaches like feature value mean to complicated methods such as principal component analysis.

13 Sample Injection 1. In biomarker discovery applications, the number of features is typically larger than the sample size. This is one of the main sources of instability in feature selection. To increase the reproducibility of feature selection, one natural idea is to generate more samples. 2. However, the generation of real sample data from patients and healthy people is usually expensive and time-consuming. With this practical limitation in mind, people begin to seek other alternative methods for the same purpose

14 Stability Measure 1.How to measure the “stability” of feature selection algorithms? Answer: Two approaches are reviewed. On the one hand, it is indispensable for evaluating different algorithms in performance comparison. On the other hand, it can be used for internal validation in feature selection algorithms that take into account stability. 2. Measuring stability requires a similarity measure for feature selection results. This depends on the representations used by different algorithms. Formally, let training examples be described by a vector of features F = (f 1,f 2,…,f m ), then there are three types of representation methods: A subset of features: S = {s 1,s 2,…,s k }, s i ∈ {f 1,f 2,…,f m }. A ranking vector: R = (r 1,r 2,…,r m ), 1 ≤ r i ≤ m. A weighting-score vector: W=(w 1,w 2,..,w m ), w i ∈ R +. 3. Stability measures that take two subsets are only considered for the sake of notation simplication.

15 Feature Subset Most of the measures in feature subset are defined using the physical properties of only two sets. MS 1: Relative hamming distance between the masks corresponding to two subsets. MS 2: The Tanimoto distance metric measures the amount of overlap between two sets of arbitrary cardinality MS 3: The Dice–Sorensen's index is the harmonic mean of |S ∩ S′|/|S| and |S ∩ S′|/|S′|.

16

17 Ranking List 1.The problem of comparing ranking lists is widely studied in different contexts such as voting theory and web document retrieval. 2.One typical example is MR2, in which the Spearman's correlation is adapted to place more weights on those top ranked features since these features are more important than irrelevant features in the stability evaluation. MR1: The Spearman's rank correlation coefficient takes values in [−1,1], with 1 meaning that the two ranking lists are identical and a value of −1 meaning that they have exactly inverse orders. MR2: The Canberra distance is a weighted version of Spearman's foot rule distance. MR3: The overlap score is originally proposed in and here we follow to reformulate it with the assumption that only top ranked features are important.

18

19 Weighting-score Vector The computational issue of combinatorial search for feature subset can to some extent be alleviated by using a feature weighting strategy. Allowing feature weights to take real- valued numbers instead of binary ones enables us to use well-established optimization techniques in algorithmic development. The weighting-score vector is seldom used in defining stability measure. Table 6 lists one stability measure MW1Table 6

20 Analysis 1.Out of the three sources of instability for feature selection, the small number of samples in high-dimensional feature area is most difficult part in biomarker discovery- WHY? 2.Group feature selection is used mostly in existing stable feature selection approaches because of many correlated features in high-dimensional space. However it has a PROBLEM. 3.The solution is implementing a hybrid strategy is to combine group feature selection with ensemble feature selection.

21 Future Work The group feature selection strategy is only helpful when multiple sets of true markers are generated due to the existence of redundant features. However, it is also possible that multiple sets of true markers share no correlated features. The feature selection problem in this case is much harder than finding a minimal optimal feature set for classification. To our knowledge, there is still no available method and measure that aim at handling stability issues in this context. The general problem is open and needs more research efforts. How to directly measure the stability of feature(s) without sampling training data? Can we propose new methods that are capable of explicitly controlling the stability of reported feature subset? Are there other special requirements for biomarker discovery rather than stability?

22 Conclusion This paper summarizes existing stable feature selection methods and stability measures. When it comes to theoretical perspective and practical aspect, stable feature selection is an important problem. More efforts has to devoted to this topic.

23

24


Download ppt "Stable Feature Selection for Biomarker Discovery Name: Goutham Reddy Bakaram Student Id: 20299604 Instructor Name: Dr. Dongchul Kim Review Article by Zengyou."

Similar presentations


Ads by Google