Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Yue Han and Lei Yu Binghamton University.
Minimum Redundancy and Maximum Relevance Feature Selection
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Feature Selection Presented by: Nafise Hatamikhah
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Principal Component Analysis
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Reduced Support Vector Machine
Lei Yu Binghamton University Jieping Ye, Huan Liu
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Feature Selection Lecture 5
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Introduction to machine learning
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
1 Feature Selection: Algorithms and Challenges Joint Work with Yanglan Gang, Hao Wang & Xuegang Hu Xindong Wu University of Vermont, USA; Hefei University.
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Whole Genome Expression Analysis
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.
CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
1 Feature Selection Jamshid Shanbehzadeh, Samaneh Yazdani Department of Computer Engineering, Faculty Of Engineering, Khorazmi University (Tarbiat Moallem.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Fuzzy Machine Learning Methods for Biomedical Data Analysis
Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Friday, 14 November 2003 William.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
1 Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci
Consensus Group Stable Feature Selection
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Stable Feature Selection for Biomarker Discovery Name: Goutham Reddy Bakaram Student Id: Instructor Name: Dr. Dongchul Kim Review Article by Zengyou.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Data Mining and Decision Support
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Feature Selection on Time-Series Cab Data
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Unsupervised Streaming Feature Selection in Social Media
Dr. Gheith Abandah 1.  Feature selection is typically a search problem for finding an optimal or suboptimal subset of m features out of original M features.
Feature Selection: Algorithms and Challenges
Presented by Jingting Zeng 11/26/2007
Claudio Lottaz and Rainer Spang
A Unifying View on Instance Selection
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Feature Selection Methods
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Claudio Lottaz and Rainer Spang
Presentation transcript:

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University

2 Outlines Introduction to feature selection Motivation Problem statement Key research issues Application in genomic data analysis Overview of data mining for microarray data Gene selection A case study Current research directions

3 Motivation An active field in Pattern recognition Machine learning Data mining Statistics F1F1 F2F2...FNFN C I1I1 f 11 f 12...f 1N c1c1 I2I2 f 21 f 22...f 2N c2c IMIM f M1 f M2...f MN cMcM Goodness Reducing dimensionality Improving learning efficiency Increasing predicative accuracy Reducing complexity of learned results

4 Problem Statement A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991) Selecting a minimum subset G such that P( C | G ) is equal or as close as possible to P( C | F ) (Koller and Sahami, 1996)

5 An Example for the Problem Data set Five Boolean features C = F 1 ∨ F 2 F 3 = ┐ F 2, F 5 = ┐ F 4 Optimal subset: {F 1, F 2 } or {F 1, F 3 } Combinatorial nature of searching for an optimal subset F1F1 F2F2 F3F3 F4F4 F5F5 C

6 Subset Search An example of search space ( Kohavi and John, 1997 )

7 Evaluation Measures Wrapper model Relying on a predetermined classification algorithm Using predictive accuracy as goodness measure High accuracy, computationally expensive Filter model Separating feature selection from classifier learning Relying on general characteristics of data (distance, correlation, consistency) No bias toward any learning algorithm, fast

8 A Framework for Algorithms Subset Generation Subset Evaluation Stopping Criterion Original Set Current Best Subset Candidate Subset Yes No Selected Subset

9 Feature Ranking Weighting and ranking individual features Selecting top-ranked ones for feature selection Advantages Efficient: O(N) in terms of dimensionality N Easy to implement Disadvantages Hard to determine the threshold Unable to consider correlation between features

10 Applications of Feature Selection Text categorization Yang and Pederson, 1997 (CMU) Forman, 2003 (HP Labs) Image retrieval Swets and Weng, 1995 (MSU) Dy et al, 2003 (Purdue University) Gene expression microarrray data analysis Xing et al, 2001 (UC Berkeley) Lee et al, 2003 (Texas A&M) Customer relationship management Ng and Liu, 2000 (NUS) Intrusion detection Lee et al, 2000 (Columbia University)

11 Microarray Technology Enabling simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experiment Providing new opportunities and challenges for data mining GeneValue M23197_at U66497_at M92287_at

12 Two Ways to View Microarray Data AML ALL Class Sample Sample Sample M92287_at U66497_at M23197_at Gene Sample

13 Data Mining Tasks GenesSamples Clustering Classification Building a classifier to predict the classes of new samples Grouping similar samples together to find classes or subclasses Grouping similar genes together to find co-regulated genes Data points are:

14 Gene Selection Data characteristics in sample classification High dimensionality (thousands of genes) Small sample size (often less than 100 samples) Problems Curse of dimensionality Overfitting the training data Traditional gene selection methods Within the filter model Gene ranking

15 A Case Study (Golub et al., 1999) Leukemia data 7129 genes, 72 samples Training: 38 (27 ALL, 11 AML) Test: 34 (20 ALL, 14 AML) Normalization Mean: 0 Standard deviation: 1 Correlation measure Normalized Expression ALL AML

16 Case Study (continued) Performance of selected genes Accuracy on training set: 36 out 38 (94.74%) correctly classified Accuracy on test set: 29 out 34 (85.29%) correctly classified Limitations Domain knowledge required to determine the number of genes selected Unable to remove redundant genes

17 Feature/Gene Redundancy Examining redundant genes Two heads are not necessarily better than one Effects of redundant genes How to handle redundancy A challenge Some latest work MRMR (Maximum Relevance Minimum Redundancy) (Ding and Peng, CSB-2003) FCBF (Fast Correlation Based Filter) (Yu and Liu, ICML-2003)

18 Research Directions Feature selection for unlabeled data Common things as for labeled data Difference Dealing with different data types Nominal, discrete, continuous Discretization Dealing with large size data Comparative study and intelligent selection of feature selection methods

19 References G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. ICML L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML T. R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. CSB J. Shavlik and D. Page. Machine learning and genetic microarrays. ICML-2003 tutorial. Page.ppt