Presentation is loading. Please wait.

Presentation is loading. Please wait.

BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006 Silvano Paoli, Davide Albanese, Giuseppe.

Similar presentations


Presentation on theme: "BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006 Silvano Paoli, Davide Albanese, Giuseppe."— Presentation transcript:

1 BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006 http://mpa.itc.it Silvano Paoli, Davide Albanese, Giuseppe Jurman, Annalisa Barla, Stefano Merler, Roberto Flor, Stefano Cozzini, James Reid, Cesare Furlanello

2

3 Predictive Profiling QUESTIONS for a discriminating molecular signature: predict disease state predict disease state identify patterns regarding subclasses of patients identify patterns regarding subclasses of patients Group A Group B Array (gene expression Affy) B Over-expression in group B Over-expression in group A B genes Under-expression in group B samples A PANEL OF DISCRIMINATING GENES?

4 Algorithms and software systems for 1. 1.Predictive classification, feature selection, discovery Algorithms and software systems for 1. 1.Predictive classification, feature selection, discovery Our BioDCV system: a set-up based on the E-RFE algorithm for Support Vector Machines (SVM) Control of selection bias, a serious experimental design issue in the use of prognostic molecular signatures Subtype identification for studies of disease evolution and response to treatment Our BioDCV system: a set-up based on the E-RFE algorithm for Support Vector Machines (SVM) Control of selection bias, a serious experimental design issue in the use of prognostic molecular signatures Subtype identification for studies of disease evolution and response to treatment Predictive classification and functional profiling

5 “In conclusion, the list of genes included in a molecular signature (based on one training set and the proportion of misclassifications seen in one validation set) depends greatly on the selection of the patients in training sets.” “Five of the seven largest published studies addressing cancer prognosis did not classify patients better than chance. This result suggests that these publications were overoptimistic.” John P A Ioannidis February 5, 2005 Selection bias

6 To avoid selection bias (p>>n): a COMPLETE VALIDATION SCHEME* externally a stratified random partitioning, internally a model selection based on a K-fold cross-validation  3 x 10 5 SVM models (+ random labels  2 x 10 6 ) ** ** Binary classification, on a 20 000 genes x 45 cDNA array, 400 loops * Ambroise & McLachlan, 2002, Simon et. al 2003, Furlanello et. al 2003 OFS-M: Model tuning and Feature ranking ONF: Optimal gene panel estimator ATE: Average Test Error The BioDCV setup (E-RFE SVM)

7 Starting from a suite of C modules and Perl/shell scripts running on a local HPC resource … 1. Optimize modules and scripts:* database management of data, of model structures, of system outputs, scripts for OpenMosix Linux Clusters database management of data, of model structures, of system outputs, scripts for OpenMosix Linux Clusters 2. Wrap BioDCV into a grid application Learn about grid computing Learn about grid computing Port the serial version on a computational grid testbed Port the serial version on a computational grid testbed Analyze/verify results: identify needs/problems Analyze/verify results: identify needs/problems 3. Wrap with C MPI scripts Build the MPI mechanism Build the MPI mechanism Experiment on the testbed Experiment on the testbed Submit on production grid Submit on production grid Test scalability Test scalability 4. Production 5. Egee Biomed/NA4 Roadmap for a new grid application February 2005 March 05: Up and Running! Nov 04 - January 2005 Sept-Dec 2004 Sept 05: 1500 jobs, 500+ computing days on production grid Feb 2006: Egee

8 Rewrite shell/Perl scripts in C language Rewrite shell/Perl scripts in C language control I/O costs, control I/O costs, a process granularity optimal for temporary data allocation without tmp files a process granularity optimal for temporary data allocation without tmp files convenient for migrations convenient for migrations SQLite interface (Database engine library) SQLite interface (Database engine library) SQLite is small, self-contained, embeddable SQLite is small, self-contained, embeddable It provides a relational access to model and data structures (inputs, outputs, diagnostics) It provides a relational access to model and data structures (inputs, outputs, diagnostics) It supports transactions and multiple connections, databases up to 2 terabytes in size It supports transactions and multiple connections, databases up to 2 terabytes in size local copy (db file): + model definitions + a copy of of data + indexes defining the partition of the replicate sample(s) 1. Optimize modules and scripts

9 BioDCV (1)exp : experiment design through configuration of the setup database (2)scheduler : script submitting jobs (run) on each available processor. Platform dependent. (3)run : performs fractions of the complete validation procedures on several data splits. Local db is created (4)unify : the local datasets are merged with setup after completing the validations tasks. A complete dataset collecting all the relevant parameters is created.

10 Why porting into the grid? Why porting into the grid? Because we do not have “enough” computational resources… Because we do not have “enough” computational resources… How to port the BioDCV in grid? How to port the BioDCV in grid? PRELIMINARY PRELIMINARY Identify a collaborator with experience in grid computing (e.g. the Egrid Project hosted at ICTP http://www.egrid.it ) Identify a collaborator with experience in grid computing (e.g. the Egrid Project hosted at ICTP http://www.egrid.it ) Train human resources (SP  Trieste) Train human resources (SP  Trieste) Join the Egrid testbed (installing a grid site in Trento) Join the Egrid testbed (installing a grid site in Trento) HANDS-ON HANDS-ON Porting of the serial application on the testbed Porting of the serial application on the testbed patch code as needed: code portability is mandatory to make life easier patch code as needed: code portability is mandatory to make life easier Identify requirements/problems Identify requirements/problems 2. Wrapping into a grid application

11 A few EDG/LCG definitions Storage Element (SE): stores the user data in the grid and makes it available for subsequent elaboration Computing Element (CE): where the grid user programs are delivered for elaboration: this is usually a front-end to several elementary Worker Node machines Worker Node (WN): machines where the user programs are actually executed, possibly with multiple CPUs User Interface (UI): machine to access the GRID CE SE m TByte WNs N CPUs site

12 The local testbed in Trieste ( now gridats) The local testbed in Trieste ( now gridats) Small computational grid based on EDG middleware + Egrid add-ons Small computational grid based on EDG middleware + Egrid add-ons Designed for testing/training/porting of applications Designed for testing/training/porting of applications Full compatibility with Grid.it middleware Full compatibility with Grid.it middleware The production infrastructure: The production infrastructure: A Virtual Organization within Grid.it, with its own services A Virtual Organization within Grid.it, with its own services Star topology with central node in Padova ( only in version 1) Star topology with central node in Padova ( only in version 1) CE SE 2.8 TByte WNs 100 cpus Padova CE+SE+WN Trento CE+SE+WN Roma CE+SE+WN Trieste CE+SE+WN Firenze CE+SE+WN Palermo The ICTP Egrid project infrastructures

13 Porting the serial application Porting the serial application Easy task due to portability (no actual work needed) Easy task due to portability (no actual work needed) No software/library dependencies No software/library dependencies Testing/Evaluation Testing/Evaluation Problems identified: Problems identified: Job submission overhead due to EDG mechanisms Job submission overhead due to EDG mechanisms Managing multiple (~hundreds/thousands) jobs is difficult and cumbersome Managing multiple (~hundreds/thousands) jobs is difficult and cumbersome Answer: parallellize jobs on the GRID via MPI Answer: parallellize jobs on the GRID via MPI Single submission Single submission Multiple executions Multiple executions Hands on

14 How can we use C MPI? How can we use C MPI? Prepare two wrappers, and an unifier Prepare two wrappers, and an unifier one shell script to submit jobs (BioDCV.sh) one shell script to submit jobs (BioDCV.sh) one C MPI program (Mpba-mpi) one C MPI program (Mpba-mpi) one shell script to integrate results (BioDCV-union.sh) one shell script to integrate results (BioDCV-union.sh) BioDCV.sh in action: BioDCV.sh in action: copies file from and to Storage Element (SE) and distributes the microarray dataset to all WNs. copies file from and to Storage Element (SE) and distributes the microarray dataset to all WNs. It then starts the C MPI wrapper which spawns several runs of the BioDCV program (optimize for resources) It then starts the C MPI wrapper which spawns several runs of the BioDCV program (optimize for resources) When all BioDCV runs are completed, the wrapper copies all the results (SQLite files) from the WNs to the starting SE. When all BioDCV runs are completed, the wrapper copies all the results (SQLite files) from the WNs to the starting SE. MPBA-MPI executes the BioDCV runs in parallel MPBA-MPI executes the BioDCV runs in parallel BioDCV-union.sh collates results in one SQLite file (  R) BioDCV-union.sh collates results in one SQLite file (  R) C MPI 3. Wrap with C MPI scripts

15 Using BioDCV in Egrid UI Egrid Live CD* Resource broker (PD-TN) CE SE 2.8 TByte WNs 100 cpus CE+SE+WN Padova Trieste Palermo Trento.. “Edg-job-submit bioDCV.jdl” site a bootable Linux live-cd distribution with a complete suit of GRID tools by Egrid (ICTP Trieste)

16 [ Type = "Job"; JobType = "MPICH"; NodeNumber = 64; Executable = “BioDCV.sh"; Arguments = “Mpba-mpi 64 lfn:/utenti/spaoli/sarcoma.db 400"; StdOutput = "test.out"; StdError = "test.err"; InputSandbox = {“BioDCV.sh",“Mpba-mpi","run", "run.sh"}; OutputSandbox = {"test.err","test.out","executable.out"}; Requirements = other.GlueCEInfoLRMSType == "PBS" || other.GlueCEInfoLRMSType == "LSF"; ] BioDCV.jdl A Job Description

17 Second step:... WN 1 WN 2 WN 3 WN n Mpba-mpi and Sarcoma.db are distributed to all the involved WNs BioDCV Sarcoma.db WN 1 BioDCV.sh runs on Request file sarcoma.db SE First step: BioDCV.sh copies data from SE to the WN Using BioDCV in Egrid (II)

18 ... SE Fourth step: Output WN 1 BioDCV.sh copies all results (SQLite files) from the WNs to the starting SE BioDCV.sh runs on... Third step: WN 1 BioDCV is executed on all involved WNs by MPI Mpba-mpi runs on WN2 BioDCV runs on Mpba-mpi on WN3 Mpba-mpi on WN n Job completed Using BioDCV in Egrid (III)

19 RUNNING ON THE TESTBED (EGRID.IT) MARCH 2005 CPU no.Computing (sec) Copying files (sec) Total time (secondi) 1658832566018 2225664622749 32507,146171454 64554,3712632389 CPUs: Intel Xeon @ 2.80 GHz SCALING UP TESTS a. b. INT-IFOM Sarcoma dataset 7143 genes 35 samples a. Colon cancer dataset 2000 genes 62 samples b.

20 The pros: MPI execution on the GRID in a few days.. MPI execution on the GRID in a few days.. The tests showed scalable behavior of our grid application for increasing numbers of CPUs The tests showed scalable behavior of our grid application for increasing numbers of CPUs Grid computing reduces significantly production times and allows to tackle larger problems (see next slide) Grid computing reduces significantly production times and allows to tackle larger problems (see next slide) The cons: Data movements limit the scalability for a large number of CPU’s Data movements limit the scalability for a large number of CPU’s Note: this is a GRID.it limitation: there is no shared Filesystem between the WNs, so each file needs to be copied everywhere! Note: this is a GRID.it limitation: there is no shared Filesystem between the WNs, so each file needs to be copied everywhere! To hide the latency (ideas): To hide the latency (ideas): Smart data distribution from MWN to WN’s: Smart data distribution from MWN to WN’s: Reduce the amount of data to be moved Reduce the amount of data to be moved Proportionate BioDCV subtasks to local cache Proportionate BioDCV subtasks to local cache Data transferred via MPI communication Data transferred via MPI communication Requires MPI coding and some MPI adaptation of the code) Requires MPI coding and some MPI adaptation of the code) Results of Phase 1

21 Phase 2: Improving the system Reduce the amount of data to be moved 1.Redesign “per run”: SVM models (about 200) and SVM models (about 200) and results, evaluation results, evaluation Variables for semisupervised analysis Variables for semisupervised analysis all managed within one data structure 2.A large part of the sampletracking semisupervised analysis, is now managed within BioDCV (about 2000 files, 300MB) i.e. stored through SQLite. 3.Randomization of labels is fully automated 4.The SVM library is now an external library: Modular use of machine learning methods Modular use of machine learning methods Now being added: a PDA module Now being added: a PDA module 5.BioDCV now under GPL (code curation …) 6.Distributed at BioDCV with a SubVersion server since September 2005 PDA: Penalized Discriminant Analysis

22 1.At work on several clusters: MPBA-old: 50 P3 CPUs, 1GHz MPBA-old: 50 P3 CPUs, 1GHz MPBA-new: 6 Xeon CPUs, 2,8 GHz MPBA-new: 6 Xeon CPUs, 2,8 GHz ECT* (BEN): up to 32 (of 100) CPU Xeon, 2.8GHz ECT* (BEN): up to 32 (of 100) CPU Xeon, 2.8GHz ICTP cluster: up to 8 (of 60) P4, 2GHz, Myrinet ICTP cluster: up to 8 (of 60) P4, 2GHz, Myrinet 2.GRID experiences A.Egrid “production grid” (INFN Padua): up to 64 (of 100) Cpu Xeon, 2-3GHz Microarray data: Sarcoma, HS random, Morishita, Wang, … B.LESSONS LEARNED: i.the latest version reduces latencies (system times) due to file copying and management  CPU saturation ii.Life quality (and more studies): huge reduction of file installing and retrieving from facilities and WITHIN facilities iii.Forgetting the severe limitation of file system (AFS, …) iv.Now installing 2.6 LCG2 (CERN release September 2005) CLUSTER AND GRID ISSUES

23 In this section we present two experiments designed to measure the performance of the BioDCV parallel application in two different computing available environments: a standard linux cluster and computational grid. In this section we present two experiments designed to measure the performance of the BioDCV parallel application in two different computing available environments: a standard linux cluster and computational grid. SPEED-UP SPEED-UP In Benchmark 1, we study the scalability of our application as function of the number of CPUs (from 1 to 64). FOOTPRINT FOOTPRINT In Benchmark 2, we characterize the BioDCV appllication with respect to different dataset, i.e. for different number of feature (d) and number of samples (N) for the complete validation experiment (for a fixed number of 32 CPUs). NOVEMBER/DECEMBER 2005 – TEST CLUSTER AND GRID

24 Tasks for BioDCV (E-RFE SVM) - DICEMBER 2005 Liver cancer: 213 samples, 107 tumors from liver cancer +106 non tumoral/normal, 1993 genes, ATAC-PCR (Sese et. al, 2000) Breast cancer: Wang et al. 2005: 238 samples (286 lymph-node-negative), Affimetrix, 17819 genes Chang et al. 2005: 295 samples (151 lymph-node-negative, 144 pos), cDNA 25000 genes IFOM: 62 BRCA (4 subclasses) Pediatric Leukemia: 327 samples, 12626 genes (7 classes, binary: 284+43), Yeoh et al. 2002 Sarcoma: 37 samples, 7143 genes (4 classes) Benchmark1 Benchmark2

25 DICEMBER 2005 - SPEED-UP (Cluster + and GRID)

26 (DICEMBER 2005) FOOTPRINT GRID dN x 10e-7 Time (s) 12345678 1500 10000 50000 100000 T_tns E_g 10 x L_i 10 x U 10 x S BRCAChang Morishita PL Sarcoma Wang FOOTPRINT dN: #genes x #samples

27 Two experiments for 139 CPU day in Egrid infrastructure Two experiments for 139 CPU day in Egrid infrastructure In benchmark 1, we obtain a speed-up curve very close to linear In benchmark 1, we obtain a speed-up curve very close to linear In Benchmark 2, effective execution time increases linearly with the dataset footprint, i.e. the product of number of genes and number of samples In Benchmark 2, effective execution time increases linearly with the dataset footprint, i.e. the product of number of genes and number of samples Performance penalty payed for data transfer from WNs and queuing policy (best effort) respect on a local linux cluster Performance penalty payed for data transfer from WNs and queuing policy (best effort) respect on a local linux cluster We will investigate if substituting an MPI approach with a model that submit N different single CPU jobs We will investigate if substituting an MPI approach with a model that submit N different single CPU jobs BioDCV system on LCG/EGEE computational grid can be used in pratical large scale experiments BioDCV system on LCG/EGEE computational grid can be used in pratical large scale experiments Next is porting our system under EGEE’s Biomed VO. Next is porting our system under EGEE’s Biomed VO. NOVEMBER/DECEMBER 2005 –DISCUSSION

28 BASIC CLASSIFICATION: MODELS, lists, additional tools Tools for researchers: subtype discovery, outlier detection Connection to data (DB–MIAME) BASIC CLASSIFICATION: MODELS, lists, additional tools Tools for researchers: subtype discovery, outlier detection Connection to data (DB–MIAME) Challenges (AIRC-BICG) HPC-Interaction: access through web portal to GRID HPC www A p a c h e

29

30 Acknowledgments ITC-irst, Trento Alessandro Soraruf ICTP E-GRID Project, Trieste Angelo Leto Cristian Zoicas Riccardo Murri Ezio Corso Alessio Terpin ITC-irst, Trento Alessandro Soraruf ICTP E-GRID Project, Trieste Angelo Leto Cristian Zoicas Riccardo Murri Ezio Corso Alessio Terpin IFOM-FIRC and INT, Milano Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: IFOM-FIRCC Cardiogenomics PGA IFOM-FIRC and INT, Milano Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: IFOM-FIRCC Cardiogenomics PGA


Download ppt "BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006 Silvano Paoli, Davide Albanese, Giuseppe."

Similar presentations


Ads by Google