Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.

Similar presentations


Presentation on theme: "Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University."— Presentation transcript:

1 Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University

2 2 Outlines Introduction to feature selection Motivation Problem statement Key research issues Application in genomic data analysis Overview of data mining for microarray data Gene selection A case study Current research directions

3 3 Motivation An active field in Pattern recognition Machine learning Data mining Statistics F1F1 F2F2...FNFN C I1I1 f 11 f 12...f 1N c1c1 I2I2 f 21 f 22...f 2N c2c2.................. IMIM f M1 f M2...f MN cMcM Goodness Reducing dimensionality Improving learning efficiency Increasing predicative accuracy Reducing complexity of learned results

4 4 Problem Statement A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991) Selecting a minimum subset G such that P( C | G ) is equal or as close as possible to P( C | F ) (Koller and Sahami, 1996)

5 5 An Example for the Problem Data set Five Boolean features C = F 1 ∨ F 2 F 3 = ┐ F 2, F 5 = ┐ F 4 Optimal subset: {F 1, F 2 } or {F 1, F 3 } Combinatorial nature of searching for an optimal subset F1F1 F2F2 F3F3 F4F4 F5F5 C 001010 010011 101011 110011 001100 010101 101101 110101

6 6 Subset Search An example of search space ( Kohavi and John, 1997 )

7 7 Evaluation Measures Wrapper model Relying on a predetermined classification algorithm Using predictive accuracy as goodness measure High accuracy, computationally expensive Filter model Separating feature selection from classifier learning Relying on general characteristics of data (distance, correlation, consistency) No bias toward any learning algorithm, fast

8 8 A Framework for Algorithms Subset Generation Subset Evaluation Stopping Criterion Original Set Current Best Subset Candidate Subset Yes No Selected Subset

9 9 Feature Ranking Weighting and ranking individual features Selecting top-ranked ones for feature selection Advantages Efficient: O(N) in terms of dimensionality N Easy to implement Disadvantages Hard to determine the threshold Unable to consider correlation between features

10 10 Applications of Feature Selection Text categorization Yang and Pederson, 1997 (CMU) Forman, 2003 (HP Labs) Image retrieval Swets and Weng, 1995 (MSU) Dy et al, 2003 (Purdue University) Gene expression microarrray data analysis Xing et al, 2001 (UC Berkeley) Lee et al, 2003 (Texas A&M) Customer relationship management Ng and Liu, 2000 (NUS) Intrusion detection Lee et al, 2000 (Columbia University)

11 11 Microarray Technology Enabling simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experiment Providing new opportunities and challenges for data mining GeneValue M23197_at U66497_at M92287_at 261 88 4778............

12 12 Two Ways to View Microarray Data...... AML ALL Class...498341450Sample 3...270074101Sample 2...477888261Sample 1......... M92287_at...... U66497_at...... M23197_at...... Gene Sample

13 13 Data Mining Tasks GenesSamples Clustering Classification Building a classifier to predict the classes of new samples Grouping similar samples together to find classes or subclasses Grouping similar genes together to find co-regulated genes Data points are:

14 14 Gene Selection Data characteristics in sample classification High dimensionality (thousands of genes) Small sample size (often less than 100 samples) Problems Curse of dimensionality Overfitting the training data Traditional gene selection methods Within the filter model Gene ranking

15 15 A Case Study (Golub et al., 1999) Leukemia data 7129 genes, 72 samples Training: 38 (27 ALL, 11 AML) Test: 34 (20 ALL, 14 AML) Normalization Mean: 0 Standard deviation: 1 Correlation measure -3 0 3 Normalized Expression ALL AML

16 16 Case Study (continued) Performance of selected genes Accuracy on training set: 36 out 38 (94.74%) correctly classified Accuracy on test set: 29 out 34 (85.29%) correctly classified Limitations Domain knowledge required to determine the number of genes selected Unable to remove redundant genes

17 17 Feature/Gene Redundancy Examining redundant genes Two heads are not necessarily better than one Effects of redundant genes How to handle redundancy A challenge Some latest work MRMR (Maximum Relevance Minimum Redundancy) (Ding and Peng, CSB-2003) FCBF (Fast Correlation Based Filter) (Yu and Liu, ICML-2003)

18 18 Research Directions Feature selection for unlabeled data Common things as for labeled data Difference Dealing with different data types Nominal, discrete, continuous Discretization Dealing with large size data Comparative study and intelligent selection of feature selection methods

19 19 References G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. ICML-1994. L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML-2003. T. R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science-1999. C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. CSB-2003. J. Shavlik and D. Page. Machine learning and genetic microarrays. ICML-2003 tutorial. http://www.cs.wisc.edu/~dpage/ICML-2003-Tutorial-Shavlik- Page.ppt


Download ppt "Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University."

Similar presentations


Ads by Google