Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by Jingting Zeng 11/26/2007

Similar presentations


Presentation on theme: "Presented by Jingting Zeng 11/26/2007"— Presentation transcript:

1 Presented by Jingting Zeng 11/26/2007
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution Presented by Jingting Zeng 11/26/2007

2 Outline Fast Correlation-Based Filter (FCBF) Algorithm
Introduction to Feature Selection Feature Selection Models Fast Correlation-Based Filter (FCBF) Algorithm Experiment Discussion Reference

3 Introduction of Feature Selection
Definition A process that chooses an optimal subset of features according to an objective function Objectives To reduce dimensionality and remove noise To improve mining performance Speed of learning Predictive accuracy Simplicity and comprehensibility of mined results

4 An Example for Optimal Subset
Data set (whole set) Five Boolean features C = F1∨F2 F3= ┐F2 ,F5= ┐F4 Optimal subset: {F1, F2}or{F1, F3}

5 Models of Feature Selection
Filter model Separating feature selection from classifier learning Relying on general characteristics of data (information, distance, dependence, consistency) No bias toward any learning algorithm, fast Wrapper model Relying on a predetermined classification algorithm Using predictive accuracy as goodness measure High accuracy, computationally expensive

6 Filter Model

7 Wrapper Model

8 Two Aspects for Feature Selection
How to decide whether a feature is relevant to the class or not How to decide whether such a relevant feature is redundant or not compared to other features

9 Linear Correlation Coefficient
For a pair of variables (x,y): However, it may not be able to capture the non-linear correlations

10 Information Measures Entropy of variable X
Entropy of X after observing Y Information Gain Symmetrical Uncertainty

11 Fast Correlation-Based Filter (FCBF) Algorithm
How to decide whether a feature is relevant to the class C or not Find a subset , such that How to decide whether such a relevant feature is redundant Use the correlation of features and class as a reference

12 Definitions Predominant Correlation Redundant peer (RP)
The correlation between a feature and the class C is predominant Redundant peer (RP) If there is , is a RP of Use to denote the set of RP for

13 i C

14 Three Heuristics If , treat as a predominant feature, remove all features in and skip identifying redundant peers for them If , process all the features in at first. If non of them becomes predominant, follow the first heuristic The feature with the largest value is always a predominant feature and can be a starting point to remove other features.

15 i C

16 FCBF Algorithm Time Complexity: O(N)

17 FCBF Algorithm (cont.) Time complexity: O(NlogN)

18 Experiments FCBF are compared to ReliefF, CorrSF and ConsSF
Summary of the 10 data sets

19 Results

20 Results (cont.)

21 Pros and Cons Advantage Disadvantage Very fast
Select fewer features with higher accuracy Disadvantage Cannot detect some features 4 features generated by 4 Gaussian functions and adding 4 additional redundant features, FCBF selected only 3 features

22 Discussion FCBF compares only individual features with each other
Try to use PCA to capture a group of features. Based on the result, then the FCBF is used.

23 Reference L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003 Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp , 2005. www1.cs.columbia.edu/~jebara/6772/proj/Keith.ppt

24 Thank you! Q and A


Download ppt "Presented by Jingting Zeng 11/26/2007"

Similar presentations


Ads by Google