Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum.

Similar presentations


Presentation on theme: "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum."— Presentation transcript:

1 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum CERN, 01-03.03.2006 S. Paoli, D. Albanese, G. Jurman, A. Barla, S. Merler, R. Flor, S. Cozzini, J. Reid, C. Furlanello http://mpa.itc.it

2 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 2 Summary 1.Predictive profiling from microarray data. 2.A complete validation environment in grid: BioDCV. 3.Test: Cluster vs Grid.

3 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 3 Predictive Profiling QUESTIONS for a discriminating molecular signature: predict disease state predict disease state identify patterns regarding subclasses of patients identify patterns regarding subclasses of patients Group A Group B Array (gene expression Affy) B Over-expression in group B Over-expression in group A B genes Under-expression in group B samples A PANEL OF DISCRIMINATING GENES? A

4 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 4 The BioDCV system A set-up based on the E-RFE algorithm for Support Vector Machines (SVM). Control of selection bias, outlier detection Subtype identification C language coupled with SQLite database libraries. It implements complete validation procedure on distributed systems: MPI or Open Mosix clusters. Since March 2005: ported as grid application with MPI execution through LCG middleware and data storage in SE. A software setup for predictive molecular profiling gene expression data:

5 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 5 The BioDCV setup (E-RFE SVM) To avoid selection bias (p>>n): a COMPLETE VALIDATION SCHEME* externally a stratified random partitioning, internally a model selection based on a K-fold cross-validation  3 x 10 5 SVM models (+ random labels  2 x 10 6 ) ** ** Binary classification, on a 20000 genes x 45 cDNA array, 400 runs * Ambroise & McLachlan, 2002, Simon et. al 2003, Furlanello et. al 2003 OFS-M: Model tuning and Feature ranking ONF: Optimal gene panel estimator ATE: Average Test Error B=400

6 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 6 Implementation CE SE UI Egrid 50-400 MB 2-50 MB BioDCV system WNs Egrid infrastructure WN

7 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 7 Experiments We present two experiments designed to measure the performances of the BioDCV. Resources A Linux cluster of 8 Xeon CPUs 3.0 GHZ and Egrid infrastructure (into Italian Grid-it) ranging from 1 to 64 Xeon CPUs 3.0 GHZ. Data A set of 6 different microarray datasets. Tests –Benchmark 1: footprint –Benchmark 2: scalability

8 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 8 Datasets Dataset nameSamplesGenesDB (MB)dN / 10 6 T_tns (s) 1BRCA6240572.20.22534 2Sarcoma3571432.2 0.32887 3Liver Cancer21319933.7 0.46894 4 Pediatric Leukemia 32712625324.127831 5Wang2861781640 5.0138335 6Chang2952448157 7.2114546 1-2 IFOM-INT, Milan (Italy), 2005 3 ATAC-PCR: Sese et. al, Bioinformatics 2000 3 Yeoh et al., NCBI 2002 4 Wang et al., Lancet 2005 5 Chang et al., PNAS 2005 Benchmark1 Benchmark2 Footprint (dN=Samples x Genes)

9 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 9 Benchmark 1 We characterize the BioDCV application with respect to different dataset for fixed number of CPUs in grid. This benchmark tries to discover the discrimination factor, called footprint, between execution times of one application and its input data Applied on the set of 6 microarray datasets with a fixed number of 32 CPUs in grid. Evaluation metrics: T_tns=Li+U+E_g+D+S Evaluation metrics: T_tns=Li+U+E_g+D+S T_tns: effective execution time, total execution time (without time spent in queue) Li: experiment setup E_g: computing time without latency time S: semisupervised analysis time U: time for uploading data and application to the grid, including delivery on CE. D: time for data retrieval and download. This includes copying all results from the WNs to the starting SE, and their transfer to local site

10 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 10 Benchmark 1 - Footprint FOOTPRINT dN: #genes x #samples dN / 10 6 Time (s) 12345678 1500 10000 50000 100000 T_tns E_g 10 x L_i 10 x U 10 x S BRCAChang Morishita PL Sarcoma Wang T_tns: effective execution time E_g: computing time S: semisupervised analysis L_i: setup experiment U: upload data to grid Dataset footprint Fixed 32 CPUs in grid

11 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 11 Benchmark 2 We study the scalability of our application as function of the number of CPUs through a speed-up measure on different computational environments. Resources: Linux cluster (ranging from 1 to 8 CPUs) and in grid (from 1 to 32 CPUs). Data: Speed-up metric Def: if E_g[N] is user time of a program from shell command “time” for N CPUs: Speed-up (N)= E_g[1] / E_g[N] Speed-up metric Def: if E_g[N] is user time of a program from shell command “time” for N CPUs: Speed-up (N)= E_g[1] / E_g[N] Dataset nameSamplesGenesDB (MB)dN x 10e-7 Liver Cancer21319933.7 0.4 Pediatric Leukemia 32712625324.1

12 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 12 Benchmark 2 Cluster N.Cpu Speedup 1248 0 1 2 3 4 5 6 7 8 LiverCanc: cluster Experimental data Linear Speed-up N.Cpu Speedup 1248 0 1 2 3 4 5 6 7 8 PedLeuk: cluster Experimental data Linear Speed-up

13 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 13 Benchmark 2 Grid N.Cpu Speedup 481632 4 8 16 32 PedLeuk: Grid Experimental data Linear Speed-up N.Cpu Speedup 12481632 1 2 4 8 16 32 LiverCanc: Grid Experimental data Linear Speed-up

14 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 14 Discussion Two experiments for 139 CPU days in Egrid infrastructure In Benchmark 1, effective execution time increases linearly with the dataset footprint, i.e. the product of number of genes and number of samples In Benchmark 2, the speed-up curve is very close to linear BioDCV system on LCG/EGEE computational grid can be used in practical large scale experiments BioDCV system will soon be executed on Proteomic data in grid Next step is porting our system under EGEE’s Biomed VO

15 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 15 BioDCV SubVersion Homepage http://biodcv.itc.it C. Furlanello, M. Serafini, S. Merler and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Trans. Comp. Biology and Bioinformatics, 2(2):110-118, 2005. More on http://mpa.itc.it

16 Enabling Grids for E-sciencE INFSO-RI-508833 EGEE User Forum, CERN, 01-03.03.2006 16 Acknowledgments ICTP E-GRID Project, Trieste Angelo Leto Riccardo Murri Ezio Corso Alessio Terpin Antonio Messina Riccardo Di Meo INFN GRID Roberto Barbera Mirco Mazzuccato ICTP E-GRID Project, Trieste Angelo Leto Riccardo Murri Ezio Corso Alessio Terpin Antonio Messina Riccardo Di Meo INFN GRID Roberto Barbera Mirco Mazzuccato IFOM-FIRC and INT, Milano Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: IFOM-FIRC Cardiogenomics PGA IFOM-FIRC and INT, Milano Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: IFOM-FIRC Cardiogenomics PGA


Download ppt "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum."

Similar presentations


Ads by Google