INFSO-RI Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum CERN, S. Paoli, D. Albanese, G. Jurman, A. Barla, S. Merler, R. Flor, S. Cozzini, J. Reid, C. Furlanello
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Summary 1.Predictive profiling from microarray data. 2.A complete validation environment in grid: BioDCV. 3.Test: Cluster vs Grid.
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Predictive Profiling QUESTIONS for a discriminating molecular signature: predict disease state predict disease state identify patterns regarding subclasses of patients identify patterns regarding subclasses of patients Group A Group B Array (gene expression Affy) B Over-expression in group B Over-expression in group A B genes Under-expression in group B samples A PANEL OF DISCRIMINATING GENES? A
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, The BioDCV system A set-up based on the E-RFE algorithm for Support Vector Machines (SVM). Control of selection bias, outlier detection Subtype identification C language coupled with SQLite database libraries. It implements complete validation procedure on distributed systems: MPI or Open Mosix clusters. Since March 2005: ported as grid application with MPI execution through LCG middleware and data storage in SE. A software setup for predictive molecular profiling gene expression data:
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, The BioDCV setup (E-RFE SVM) To avoid selection bias (p>>n): a COMPLETE VALIDATION SCHEME* externally a stratified random partitioning, internally a model selection based on a K-fold cross-validation 3 x 10 5 SVM models (+ random labels 2 x 10 6 ) ** ** Binary classification, on a genes x 45 cDNA array, 400 runs * Ambroise & McLachlan, 2002, Simon et. al 2003, Furlanello et. al 2003 OFS-M: Model tuning and Feature ranking ONF: Optimal gene panel estimator ATE: Average Test Error B=400
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Implementation CE SE UI Egrid MB 2-50 MB BioDCV system WNs Egrid infrastructure WN
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Experiments We present two experiments designed to measure the performances of the BioDCV. Resources A Linux cluster of 8 Xeon CPUs 3.0 GHZ and Egrid infrastructure (into Italian Grid-it) ranging from 1 to 64 Xeon CPUs 3.0 GHZ. Data A set of 6 different microarray datasets. Tests –Benchmark 1: footprint –Benchmark 2: scalability
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Datasets Dataset nameSamplesGenesDB (MB)dN / 10 6 T_tns (s) 1BRCA Sarcoma Liver Cancer Pediatric Leukemia Wang Chang IFOM-INT, Milan (Italy), ATAC-PCR: Sese et. al, Bioinformatics Yeoh et al., NCBI Wang et al., Lancet Chang et al., PNAS 2005 Benchmark1 Benchmark2 Footprint (dN=Samples x Genes)
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 1 We characterize the BioDCV application with respect to different dataset for fixed number of CPUs in grid. This benchmark tries to discover the discrimination factor, called footprint, between execution times of one application and its input data Applied on the set of 6 microarray datasets with a fixed number of 32 CPUs in grid. Evaluation metrics: T_tns=Li+U+E_g+D+S Evaluation metrics: T_tns=Li+U+E_g+D+S T_tns: effective execution time, total execution time (without time spent in queue) Li: experiment setup E_g: computing time without latency time S: semisupervised analysis time U: time for uploading data and application to the grid, including delivery on CE. D: time for data retrieval and download. This includes copying all results from the WNs to the starting SE, and their transfer to local site
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 1 - Footprint FOOTPRINT dN: #genes x #samples dN / 10 6 Time (s) T_tns E_g 10 x L_i 10 x U 10 x S BRCAChang Morishita PL Sarcoma Wang T_tns: effective execution time E_g: computing time S: semisupervised analysis L_i: setup experiment U: upload data to grid Dataset footprint Fixed 32 CPUs in grid
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 2 We study the scalability of our application as function of the number of CPUs through a speed-up measure on different computational environments. Resources: Linux cluster (ranging from 1 to 8 CPUs) and in grid (from 1 to 32 CPUs). Data: Speed-up metric Def: if E_g[N] is user time of a program from shell command “time” for N CPUs: Speed-up (N)= E_g[1] / E_g[N] Speed-up metric Def: if E_g[N] is user time of a program from shell command “time” for N CPUs: Speed-up (N)= E_g[1] / E_g[N] Dataset nameSamplesGenesDB (MB)dN x 10e-7 Liver Cancer Pediatric Leukemia
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 2 Cluster N.Cpu Speedup LiverCanc: cluster Experimental data Linear Speed-up N.Cpu Speedup PedLeuk: cluster Experimental data Linear Speed-up
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Benchmark 2 Grid N.Cpu Speedup PedLeuk: Grid Experimental data Linear Speed-up N.Cpu Speedup LiverCanc: Grid Experimental data Linear Speed-up
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Discussion Two experiments for 139 CPU days in Egrid infrastructure In Benchmark 1, effective execution time increases linearly with the dataset footprint, i.e. the product of number of genes and number of samples In Benchmark 2, the speed-up curve is very close to linear BioDCV system on LCG/EGEE computational grid can be used in practical large scale experiments BioDCV system will soon be executed on Proteomic data in grid Next step is porting our system under EGEE’s Biomed VO
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, BioDCV SubVersion Homepage C. Furlanello, M. Serafini, S. Merler and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Trans. Comp. Biology and Bioinformatics, 2(2): , More on
Enabling Grids for E-sciencE INFSO-RI EGEE User Forum, CERN, Acknowledgments ICTP E-GRID Project, Trieste Angelo Leto Riccardo Murri Ezio Corso Alessio Terpin Antonio Messina Riccardo Di Meo INFN GRID Roberto Barbera Mirco Mazzuccato ICTP E-GRID Project, Trieste Angelo Leto Riccardo Murri Ezio Corso Alessio Terpin Antonio Messina Riccardo Di Meo INFN GRID Roberto Barbera Mirco Mazzuccato IFOM-FIRC and INT, Milano Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: IFOM-FIRC Cardiogenomics PGA IFOM-FIRC and INT, Milano Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: IFOM-FIRC Cardiogenomics PGA