Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

Slides:



Advertisements
Similar presentations
Control Case Common Always active
Advertisements

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection Xudong Zhao, Shilin Wang, Shenghong Li and Jianhua Li Shanghai Jiao.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Reduced Support Vector Machine
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed.
Whole Genome Expression Analysis
Efficient Model Selection for Support Vector Machines
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
INFSO-RI Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum.
1/15 Strengthening I-ReGEC classifier G. Attratto, D. Feminiano, and M.R. Guarracino High Performance Computing and Networking Institute Italian National.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
C. Furlanello – June 22th, Annalisa Barla, Bettina Irler, Stefano Merler, Giuseppe Jurman, Silvano Paoli, Cesare Furlanello ITC-irst,
Consensus Group Stable Feature Selection
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Enabling Grids for E-sciencE ITC-irst for NA4 biomed meeting at EGEE conference: Ginevra 2006 BioDCV - Features 1.Application for analysis of microarray.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Data Mining and Decision Support
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Classification with Gene Expression Data
In Search of the Optimal Set of Indicators when Classifying Histopathological Images Catalin Stoean University of Craiova, Romania
Trees, bagging, boosting, and stacking
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005

Overview  On Feature Selection  Correlation Aware Ranking  Synthetic Example

Feature Selection Step-wise variable selection: n*<N effective variables modeling the classification function N features N steps Step 1Step N … One feature vs. N features …

Feature Selection Step-wise selection of the features. Steps Ranked Features Discarded Features

Ranking Classifier independent filters Prefiltering is risky: you might discard features that turns out to be important. (ignoring labelling) Induced by a classifier

Support Vector Machines Classification function: Optimal Separating Hyperplane

The classification/ranking machine  The RFE idea: given N features (genes) Train a SVM Compute a cost function J from the weight coefficients of the the SVM Rank features in terms of contribution to J Discard the feature less contributing to J Reapply procedure on the N-1 features This is called Recursive Feature Elimination (RFE) Features are ranked according to their contribute to the classification, given the training data. Time and data consuming, and at risk of selection bias Guyon et al. 2002

RFE-based Methods Considering chunks of data at a time :  Parametrics Sqrt(N) – RFE Bisection – RFE  Non-Parametrics E – RFE (adapting to weight distribution): thresholding weights to a value w*

Variable Elimination Given F={x 1, x 2, …, x H } such that: for a given threshold T. w(x 1 )~w(x 2 ) ~ … ~ ε < w* w(x 1 )+w(x 2 )+ … >> w* Each single weight is negligible Correlated genes BUT

Correlated Genes (1)

Correlated Genes (2)

Synthetic Data Binary problem 100 (50 +50) samples of 1000 genes: genes 1  50 : randomly extracted from N(1,1) and N(-1,1) respectively genes 50  100 : randomly extracted from N(1,1) and N(-1,1) respectively (1 repeated 50 times) genes 101  1000 extracted from UNIF(-4,4) Class 1: 50 Class 2: x50 N(1,1) N(-1,1) 1 feat repeated Unif(-4,4) significant features

Our algorithm step j

Methodology  Implemented within the BioDCV system (50 replicates)  Realized through R - C code interaction

Synthetic Data Gene 100 is consistently ranked as 2nd steps

Work in Progress  Preservation of high correlated genes with low initial weights on microarrays datasets  Robust correlation measures  Different techniques to detect F l families (clustering, gene functions)

Synthetic Data Stepfeatures 1-50 features features > SAVED500 Stepfeatures 1-50 features features > SAVED500

Synthetic Data Features discarded at step 9 from E-RFE procedure: Correlation Correction: Saves feature 100

INFRASTRUCTURE MPACluster -> available for batch jobs Connecting with IFOM -> 2005 Running at IFOM -> 2005/2006 Production on GRID resources (spring 2005) Challenges ALGORITHMS II 1.Gene list fusion: suite of algebraic/statistical methods 2.Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi- supervised analysis 3.New SVM Kernels for prediction on spectrometry data within complete validation ALGORITHMS II 1.Gene list fusion: suite of algebraic/statistical methods 2.Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi- supervised analysis 3.New SVM Kernels for prediction on spectrometry data within complete validation Challenges for predictive profiling

Prefiltering is risky: you might discard features that turns out to be important. Nevertheless, wrapper methods are quite costing. Moreover, in the gene expression data, you have to deal also with particular situations like clones or highly correlated features that may represent a pitfall for several selection methods. A classic alternative is to map into linear combination of features, and then select. Principal Component Analysis Metagenes (a simplified model for pathways: but biological suggestions require caution) But we are not working anymore with the original features. eigen-craters for unexploded bomb risk maps

A few issues in feature selection with a particular interest on classification of genomic data WHY? To ease computational burden To enhance information Discard the (apparently) less significant features and train in a simplified space: alleviate the curse of dimensionality Highlight (and rank) the most important features and improve the knowledge of the underlying process. HOW? As a pre-processing stepAs a learning step Employ a statistical filter (t-test, S2N) Link the feature ranking to the classification task: wrapper methods, …

Prefiltering is risky: you might discard features that turns out to be important. Nevertheless, wrapper methods are quite costing. Moreover, in the gene expression data, you have to deal also with particular situations like clones or highly correlated features that may represent a pitfall for several selection methods. A classic alternative is to map into linear combination of features, and then select. Principal Component Analysis Metagenes (a simplified model for pathways: but biological suggestions require caution) But we are not working anymore with the original features. eigen-craters for unexploded bomb risk maps

Feature Selection within Complete Validation Experimental Setups Complete Validation is needed to decouple model tuning from (ensemble) model accuracy estimation: otherwise selection bias effects … Accumulating rel. importance from Random Forest models for the identification of sensory drivers (with P. Granitto, IASMA)