1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Visual Vocabulary Construction for Mining Biomedical Images Arnab Bhattacharya, Vebjorn Ljosa, Jia-Yu Pan Presented by Li An, CIS, TU.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.

Bayesian Factor Regression Models in the “Large p, Small n” Paradigm Mike West, Duke University Presented by: John Paisley Duke University.

Minimum Redundancy and Maximum Relevance Feature Selection

Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.

Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop.

The Broad Institute of MIT and Harvard Clustering.

Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Dimensional reduction, PCA

Mining Phenotypes and Informative Genes from Gene Expression Data Chun Tang, Aidong Zhang and Jian Pei Department of Computer Science and Engineering State.

Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.

Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.

Feature Selection Lecture 5

Feature Selection Bioinformatics Data Analysis and Tools

Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.

Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School.

Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.

Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.

A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.

A REVIEW OF FEATURE SELECTION METHODS WITH APPLICATIONS Alan Jović, Karla Brkić, Nikola Bogunović {alan.jovic, karla.brkic,

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Whole Genome Expression Analysis

Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Non Negative Matrix Factorization

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

The Broad Institute of MIT and Harvard Classification / Prediction.

Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science

Blind Information Processing: Microarray Data Hyejin Kim, Dukhee KimSeungjin Choi Department of Computer Science and Engineering, Department of Chemical.

Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.

Gene-Markers Representation for Microarray Data Integration Boston, October 2007 Elena Baralis, Elisa Ficarra, Alessandro Fiori, Enrico Macii Department.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Consensus Group Stable Feature Selection

Automated Fingertip Detection

Stable Feature Selection for Biomarker Discovery Name: Goutham Reddy Bakaram Student Id: Instructor Name: Dr. Dongchul Kim Review Article by Zengyou.

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

2D-LDA: A statistical linear discriminant analysis for image matrix

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Experience Report: System Log Analysis for Anomaly Detection

An Artificial Intelligence Approach to Precision Oncology

Presented by Jingting Zeng 11/26/2007

School of Computer Science & Engineering

Bag-of-Visual-Words Based Feature Extraction

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Gene Expression Classification

Source: Procedia Computer Science（2015）70:

Principal Component Analysis (PCA)

Machine Learning Feature Creation and Selection

Multivariate Methods Berlin Chen

Feature Selection Methods

Multivariate Methods Berlin Chen, 2005 References:

NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &

Marios Mattheakis and Pavlos Protopapas

Presentation transcript:

1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu DB/Bioinformatics Lab Chungbuk Nat’l University Korea

2 Outline  Background  Motivation  Proposed Method  Experiments  Conclusion

3 Feature Selection  Definition: Process of selecting a subset of relevant features for building robust learning models  Objectives: Alleviating the effect of the curse of dimensionality Enhancing generalization capability Speeding up learning process Improving model interpretability from Wikipedia:

4 Issues in Feature Selection  How to compute the degree to which a feature is relevant with the class (discrimination)  How to decide if a selected feature is redundant with other features (strongly correlated)  How to select features so that classifying power is not diminished (increased)  Removal of irrelevancy  Removal of redundancy  Maintain class-discriminating power

5 Selection Modes  Univariate method: considers one feature at a time based on score rank measures are Correlation, Information measure, K-S statistic, etc  Multivariate method: considers subsets of features altogether Bayesian and PCA based selection in principle, more powerful than univariate method, but not always in practice (Guyon2008)

6 Hard Case in Univariate method ( Guyon2008* ) *Adopted from Guyon’s tutorial at IPAM summer school

7 Proposed method: Motivation  Method that fits 2-D microarray data typical forms: thousands of genes (rows) and hundreds of samples (columns)  Multivariate approach Feature relevancy and redundancy are addressed simultaneously

8 System Flow samples genes

9 System Flow (cont.)

10 Methods: Step1  Perform column-based difference op.  D i (N,M) = C(N,M)  C i (N,1), i = 1,2,…, M Difference operator may depend on applications, e.g. Euclidean or Manhattan distance D i (N,M) contains class-specific info. w.r.t each gene genes

11 Methods: Step2  Apply thresholds Find kind of “emerging patterns” which contrast 2 classes Suppose 1, 2,…, j  C1 and j+1, j+2, … M  C2 Sort the values in each column of D i (N,M) 25%-threshold to the same class differences and 75%-threshold to the different class differences C1C1 C2C2 C1C1 C2C2 C1C1 C2C2 25%75%

12 Methods: Step3  Extract class-specific features Within-class summation of binary values (count 1’s) summation C1C1 C2C2

13 Methods: Step4  Gene selection Apply different threshold value for different class Gene selection: we are done for the row-wise reduction threshold

14 Methods: Step5  Column-wise reduction by clustering Classification of samples Applied NMF method

15 Nonnegative Matrix Factorization (NMF)  Matrix factorization: A ~ VH A: n  m matrix of n genes and m samples. V: (n  k): k columns of V are called basis vectors H: (k  m): describes how strongly each building block is present in measurement vectors = n m m n k k A VH

16 NMF: Parts-based Clustering (Brunet2004)  Brunet introduce meta-genes concept

17 Experiments: Datasets  Leukemia Data 5000 genes 38 samples of two classes  19 samples of ALL-B and 8 samples of ALL-T type,  11 samples of AML type.  Medulloblastoma Data 5893 genes 34 samples of two classes  25 classic type and 9 desmoplastic medulloblastoma type  Central Nervous System Tumors Data 7129 samples 34 samples of four classes  10 classic medulloblastomas, 10 malig-nant gliomas, 10 rhabdoids, and 4 normals

18 Classification  Given a target sample, its class is predicted by the highest value in k-dim column vector of H = n m m n k k A VH

19 Results  Leukemia Data (ALL-T vs. ALL-B vs. AML)

20 Results  Medulloblastoma Data (Classic vs. Desmoplastic)

21 Results  Central Nervous System Tumors Data (4 classes)

22 Conclusions & Future work  Our approach tries to capture a group of features, but in contrast to holistic methods such as PCA and ICA, intrinsic structure of data distribution is preserved in the reduced space.  Still, PCA and ICA can be used as an aide to look into the data distribution structure, and provide useful information for further processing to other methods. Our on-going research is on how to combine the PCA and ICA to the proposed work

23 References  Wikipedia,  J.-P. Brunet, P. Tamayo, T. Golub, and J. P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. PNAS, 101(12): ,  L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003  Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp ,  D.D. Lee and H.S. Seung, Learning the parts of objects by nonnegative matrix factorization

24 Questions?