INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum.

Slides:



Advertisements
Similar presentations
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Advertisements

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
INTRODUCTION We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Supporting MPI Applications on EGEE Grids Zoltán Farkas MTA SZTAKI.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
. Differentially Expressed Genes, Class Discovery & Classification.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
BioDCV: a grid-enabled complete validation setup for functional profiling Wannsee Retreat, October Cesare Furlanello with Silvano.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed.
Whole Genome Expression Analysis
Efficient Model Selection for Support Vector Machines
BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb Silvano Paoli, Davide Albanese, Giuseppe.
INFSO-RI Enabling Grids for E-sciencE Status of LCG-2 porting Stephen Childs, Brian Coghlan and Eamonn Kenny Grid-Ireland/EGEE October.
INFSO-RI Enabling Grids for E-sciencE Grid Applications -- Cyprus Contribution to EGEE Organization: HPCL, University Of Cyprus.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
BIOINFOGRID: Bioinformatics Grid Application for Life Science Giorgio Maggi INFN and Politecnico di Bari
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
The Broad Institute of MIT and Harvard Classification / Prediction.
Classification of microarray samples Tim Beißbarth Mini-Group Meeting
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Performance Improvements to BDII - Grid Information.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Status report on Application porting at SZTAKI.
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Services for advanced workflow programming.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Enabling Grids for E-sciencE INFSO-RI Institute of mathematical problems of biology RAS Expertise in mathematical modeling and experimental data.
INFSO-RI Enabling Grids for E-sciencE Running ECCE on EGEE clusters Olav Vahtras KTH.
INFSO-RI Enabling Grids for E-sciencE gPTM3D : Grid-Enabling Interactive Medical Analysis EGEE 1 st EU Review – 9 th to 11 th February.
INFSO-RI Enabling Grids for E-sciencE gPTM3D : Grid-Enabling Interactive Medical Analysis EGEE 1 st EU Review – 9 th to 11 th February.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
INFSO-RI Enabling Grids for E-sciencE A Grid Approach to Distributed Image Analysis for Early Diagnosis of Alzheimer Disease Livia.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The GILDA t-Infrastructure Roberto Barbera.
C. Furlanello – June 22th, Annalisa Barla, Bettina Irler, Stefano Merler, Giuseppe Jurman, Silvano Paoli, Cesare Furlanello ITC-irst,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
INFSO-RI Enabling Grids for E-sciencE Application of GRID resource for modeling of charge transfer in DNA Nadezhda S. Fialko, Victor.
BIOINFOGRID: Bioinformatics Grid Application for life science MILANESI, Luciano National Research Council Institute of.
INFSO-RI Enabling Grids for E-sciencE Charon Extension Layer. Modular environment for Grid jobs and applications management Jan.
Enabling Grids for E-sciencE ITC-irst for NA4 biomed meeting at EGEE conference: Ginevra 2006 BioDCV - Features 1.Application for analysis of microarray.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid2Win : gLite for Microsoft Windows Roberto.
INFSO-RI Enabling Grids for E-sciencE Construction of a Mathematical Model of a Cell as a Challenge for Science in the 21 Century.
INFSO-RI Enabling Grids for E-sciencE EGEE-2 NA4 Biomed Bioinformatics in CNRS Christophe Blanchet Institute of Biology and Chemistry.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
StoRM: status report A disk based SRM server.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Bioinformatics activity Christophe BLANCHET.
Tutorial on "GRID Computing“ EMBnet Conference 2008 CNR - ITB GRID distribution supporting chaotic map clustering on large mixed microarray.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GRID distribution supporting chaotic map clustering on large mixed.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Overview of gLite, the EGEE middleware Mike Mineter Training Outreach Education National.
Milanesi Luciano Catania, Italy 13/03/2007 Bioinformatics challenges in European projects in Grid. Milanesi Luciano National Research Council Institute.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.
Canadian Bioinformatics Workshops
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
INFSO-RI Enabling Grids for E-sciencE EGEE is a project funded by the European Union under contract IST Report from.
EGRID Project: Experience Report Implementation of a GRID Infrastructure for the Analysis of Economic and Financial data.
BaBar-Grid Status and Prospects
E.Corso, S.Cozzini, A.Leto, R. Murri, A. Terpin, C. Zoicas
Roberto Barbera (a nome di Livia Torterolo)
Claudio Lottaz and Rainer Spang
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Claudio Lottaz and Rainer Spang
Presentation transcript:

INFSO-RI Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum CERN, S. Paoli, D. Albanese, G. Jurman, A. Barla, S. Merler, R. Flor, S. Cozzini, J. Reid, C. Furlanello

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Summary 1.Predictive profiling from microarray data. 2.A complete validation environment in grid: BioDCV. 3.Test: Cluster vs Grid.

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Predictive Profiling QUESTIONS for a discriminating molecular signature: predict disease state predict disease state identify patterns regarding subclasses of patients identify patterns regarding subclasses of patients Group A Group B Array (gene expression Affy) B Over-expression in group B Over-expression in group A B genes Under-expression in group B samples A PANEL OF DISCRIMINATING GENES? A

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, The BioDCV system A set-up based on the E-RFE algorithm for Support Vector Machines (SVM). Control of selection bias, outlier detection Subtype identification C language coupled with SQLite database libraries. It implements complete validation procedure on distributed systems: MPI or Open Mosix clusters. Since March 2005: ported as grid application with MPI execution through LCG middleware and data storage in SE. A software setup for predictive molecular profiling gene expression data:

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, The BioDCV setup (E-RFE SVM) To avoid selection bias (p>>n): a COMPLETE VALIDATION SCHEME* externally a stratified random partitioning, internally a model selection based on a K-fold cross-validation  3 x 10 5 SVM models (+ random labels  2 x 10 6 ) ** ** Binary classification, on a genes x 45 cDNA array, 400 runs * Ambroise & McLachlan, 2002, Simon et. al 2003, Furlanello et. al 2003 OFS-M: Model tuning and Feature ranking ONF: Optimal gene panel estimator ATE: Average Test Error B=400

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Implementation CE SE UI Egrid MB 2-50 MB BioDCV system WNs Egrid infrastructure WN

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Experiments We present two experiments designed to measure the performances of the BioDCV. Resources A Linux cluster of 8 Xeon CPUs 3.0 GHZ and Egrid infrastructure (into Italian Grid-it) ranging from 1 to 64 Xeon CPUs 3.0 GHZ. Data A set of 6 different microarray datasets. Tests –Benchmark 1: footprint –Benchmark 2: scalability

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Datasets Dataset nameSamplesGenesDB (MB)dN / 10 6 T_tns (s) 1BRCA Sarcoma Liver Cancer Pediatric Leukemia Wang Chang IFOM-INT, Milan (Italy), ATAC-PCR: Sese et. al, Bioinformatics Yeoh et al., NCBI Wang et al., Lancet Chang et al., PNAS 2005 Benchmark1 Benchmark2 Footprint (dN=Samples x Genes)

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 1 We characterize the BioDCV application with respect to different dataset for fixed number of CPUs in grid. This benchmark tries to discover the discrimination factor, called footprint, between execution times of one application and its input data Applied on the set of 6 microarray datasets with a fixed number of 32 CPUs in grid. Evaluation metrics: T_tns=Li+U+E_g+D+S Evaluation metrics: T_tns=Li+U+E_g+D+S T_tns: effective execution time, total execution time (without time spent in queue) Li: experiment setup E_g: computing time without latency time S: semisupervised analysis time U: time for uploading data and application to the grid, including delivery on CE. D: time for data retrieval and download. This includes copying all results from the WNs to the starting SE, and their transfer to local site

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 1 - Footprint FOOTPRINT dN: #genes x #samples dN / 10 6 Time (s) T_tns E_g 10 x L_i 10 x U 10 x S BRCAChang Morishita PL Sarcoma Wang T_tns: effective execution time E_g: computing time S: semisupervised analysis L_i: setup experiment U: upload data to grid Dataset footprint Fixed 32 CPUs in grid

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 2 We study the scalability of our application as function of the number of CPUs through a speed-up measure on different computational environments. Resources: Linux cluster (ranging from 1 to 8 CPUs) and in grid (from 1 to 32 CPUs). Data: Speed-up metric Def: if E_g[N] is user time of a program from shell command “time” for N CPUs: Speed-up (N)= E_g[1] / E_g[N] Speed-up metric Def: if E_g[N] is user time of a program from shell command “time” for N CPUs: Speed-up (N)= E_g[1] / E_g[N] Dataset nameSamplesGenesDB (MB)dN x 10e-7 Liver Cancer Pediatric Leukemia

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 2 Cluster N.Cpu Speedup LiverCanc: cluster Experimental data Linear Speed-up N.Cpu Speedup PedLeuk: cluster Experimental data Linear Speed-up

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 2 Grid N.Cpu Speedup PedLeuk: Grid Experimental data Linear Speed-up N.Cpu Speedup LiverCanc: Grid Experimental data Linear Speed-up

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Discussion Two experiments for 139 CPU days in Egrid infrastructure In Benchmark 1, effective execution time increases linearly with the dataset footprint, i.e. the product of number of genes and number of samples In Benchmark 2, the speed-up curve is very close to linear BioDCV system on LCG/EGEE computational grid can be used in practical large scale experiments BioDCV system will soon be executed on Proteomic data in grid Next step is porting our system under EGEE’s Biomed VO

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, BioDCV SubVersion Homepage C. Furlanello, M. Serafini, S. Merler and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Trans. Comp. Biology and Bioinformatics, 2(2): , More on

Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Acknowledgments ICTP E-GRID Project, Trieste Angelo Leto Riccardo Murri Ezio Corso Alessio Terpin Antonio Messina Riccardo Di Meo INFN GRID Roberto Barbera Mirco Mazzuccato ICTP E-GRID Project, Trieste Angelo Leto Riccardo Murri Ezio Corso Alessio Terpin Antonio Messina Riccardo Di Meo INFN GRID Roberto Barbera Mirco Mazzuccato IFOM-FIRC and INT, Milano Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: IFOM-FIRC Cardiogenomics PGA IFOM-FIRC and INT, Milano Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: IFOM-FIRC Cardiogenomics PGA