Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

Similar presentations


Presentation on theme: "1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu."— Presentation transcript:

1 1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu DB/Bioinformatics Lab Chungbuk Nat’l University Korea

2 2 Outline  Background  Motivation  Proposed Method  Experiments  Conclusion

3 3 Feature Selection  Definition: Process of selecting a subset of relevant features for building robust learning models  Objectives: Alleviating the effect of the curse of dimensionality Enhancing generalization capability Speeding up learning process Improving model interpretability from Wikipedia: http://en.wikipedia.org/wiki/Feature_selection

4 4 Issues in Feature Selection  How to compute the degree to which a feature is relevant with the class (discrimination)  How to decide if a selected feature is redundant with other features (strongly correlated)  How to select features so that classifying power is not diminished (increased)  Removal of irrelevancy  Removal of redundancy  Maintain class-discriminating power

5 5 Selection Modes  Univariate method: considers one feature at a time based on score rank measures are Correlation, Information measure, K-S statistic, etc  Multivariate method: considers subsets of features altogether Bayesian and PCA based selection in principle, more powerful than univariate method, but not always in practice (Guyon2008)

6 6 Hard Case in Univariate method ( Guyon2008* ) *Adopted from Guyon’s tutorial at IPAM summer school

7 7 Proposed method: Motivation  Method that fits 2-D microarray data typical forms: thousands of genes (rows) and hundreds of samples (columns)  Multivariate approach Feature relevancy and redundancy are addressed simultaneously

8 8 System Flow samples genes

9 9 System Flow (cont.)

10 10 Methods: Step1  Perform column-based difference op.  D i (N,M) = C(N,M)  C i (N,1), i = 1,2,…, M Difference operator may depend on applications, e.g. Euclidean or Manhattan distance D i (N,M) contains class-specific info. w.r.t each gene genes

11 11 Methods: Step2  Apply thresholds Find kind of “emerging patterns” which contrast 2 classes Suppose 1, 2,…, j  C1 and j+1, j+2, … M  C2 Sort the values in each column of D i (N,M) 25%-threshold to the same class differences and 75%-threshold to the different class differences C1C1 C2C2 C1C1 C2C2 C1C1 C2C2 25%75%

12 12 Methods: Step3  Extract class-specific features Within-class summation of binary values (count 1’s) summation C1C1 C2C2

13 13 Methods: Step4  Gene selection Apply different threshold value for different class Gene selection: we are done for the row-wise reduction threshold

14 14 Methods: Step5  Column-wise reduction by clustering Classification of samples Applied NMF method

15 15 Nonnegative Matrix Factorization (NMF)  Matrix factorization: A ~ VH A: n  m matrix of n genes and m samples. V: (n  k): k columns of V are called basis vectors H: (k  m): describes how strongly each building block is present in measurement vectors = n m m n k k A VH

16 16 NMF: Parts-based Clustering (Brunet2004)  Brunet introduce meta-genes concept

17 17 Experiments: Datasets  Leukemia Data 5000 genes 38 samples of two classes  19 samples of ALL-B and 8 samples of ALL-T type,  11 samples of AML type.  Medulloblastoma Data 5893 genes 34 samples of two classes  25 classic type and 9 desmoplastic medulloblastoma type  Central Nervous System Tumors Data 7129 samples 34 samples of four classes  10 classic medulloblastomas, 10 malig-nant gliomas, 10 rhabdoids, and 4 normals

18 18 Classification  Given a target sample, its class is predicted by the highest value in k-dim column vector of H = n m m n k k A VH

19 19 Results  Leukemia Data (ALL-T vs. ALL-B vs. AML)

20 20 Results  Medulloblastoma Data (Classic vs. Desmoplastic)

21 21 Results  Central Nervous System Tumors Data (4 classes)

22 22 Conclusions & Future work  Our approach tries to capture a group of features, but in contrast to holistic methods such as PCA and ICA, intrinsic structure of data distribution is preserved in the reduced space.  Still, PCA and ICA can be used as an aide to look into the data distribution structure, and provide useful information for further processing to other methods. Our on-going research is on how to combine the PCA and ICA to the proposed work

23 23 References  Wikipedia, http://en.wikipedia.org/wiki/Feature_selection  J.-P. Brunet, P. Tamayo, T. Golub, and J. P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. PNAS, 101(12):4164-4169, 2004.  L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003  Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95- 104, 2005.  D.D. Lee and H.S. Seung, Learning the parts of objects by nonnegative matrix factorization

24 24 Questions?


Download ppt "1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu."

Similar presentations


Ads by Google