BioDCV: a grid-enabled complete validation setup for functional profiling Wannsee Retreat, October 2005 Cesare Furlanello with Silvano.

Slides:

Advertisements

Similar presentations

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Advertisements

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.

SALSA HPC Group School of Informatics and Computing Indiana University.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.

Workload Management Massimo Sgaravatto INFN Padova.

High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.

Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.

INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed.

BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb Silvano Paoli, Davide Albanese, Giuseppe.

Gene Expression Profiling Illustrated Using BRB-ArrayTools.

Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.

BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.

The gLite API – PART I Giuseppe LA ROCCA INFN Catania ACGRID-II School 2-14 November 2009 Kuala Lumpur - Malaysia.

:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Distribution After Release Tool Natalia Ratnikova.

Nick Brook Current status Future Collaboration Plans Future UK plans.

Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.

DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.

The Broad Institute of MIT and Harvard Classification / Prediction.

INFSO-RI Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum.

Group 1 : Grid Computing Laboratory of Information Technology Supervisors: Alexander Ujhinsky Nikolay Kutovskiy.

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

The european ITM Task Force data structure F. Imbeaux.

Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.

9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Cluster Software Overview

NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

C. Furlanello – June 22th, Annalisa Barla, Bettina Irler, Stefano Merler, Giuseppe Jurman, Silvano Paoli, Cesare Furlanello ITC-irst,

Enabling Grids for E-sciencE ITC-irst for NA4 biomed meeting at EGEE conference: Ginevra 2006 BioDCV - Features 1.Application for analysis of microarray.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.

Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

FESR Consorzio COMETA - Progetto PI2S2 Using MPI to run parallel jobs on the Grid Marcello Iacono Manno Consorzio Cometa

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.

EGRID Project: Experience Report Implementation of a GRID Infrastructure for the Analysis of Economic and Financial data.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Workload Management Workpackage

BaBar-Grid Status and Prospects

Eleonora Luppi INFN and University of Ferrara - Italy

Classification with Gene Expression Data

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

LCG 3D Distributed Deployment of Databases

BDII Performance Tests

Sergio Fantinel, INFN LNL/PD

Ruslan Fomkin and Tore Risch Uppsala DataBase Laboratory

CompChem VO: User experience using MPI

Job Application Monitoring (JAM)

Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

BioDCV: a grid-enabled complete validation setup for functional profiling Wannsee Retreat, October Cesare Furlanello with Silvano Paoli, Davide Albanese. Giuseppe Jurman, Annalisa Barla, Stefano Merler, Roberto Flor Cesare Furlanello with Silvano Paoli, Davide Albanese. Giuseppe Jurman, Annalisa Barla, Stefano Merler, Roberto Flor

Algorithms and software systems for 1. 1.Predictive classification, feature selection, discovery Algorithms and software systems for 1. 1.Predictive classification, feature selection, discovery Our BioDCV system: a set-up based on the E-RFE algorithm for Support Vector Machines (SVM) Control of selection bias, a serious experimental design issue in the use of prognostic molecular signatures Subtype identification for studies of disease evolution and response to treatment Our BioDCV system: a set-up based on the E-RFE algorithm for Support Vector Machines (SVM) Control of selection bias, a serious experimental design issue in the use of prognostic molecular signatures Subtype identification for studies of disease evolution and response to treatment Predictive classification and functional profiling

“In conclusion, the list of genes included in a molecular signature (based on one training set and the proportion of misclassifications seen in one validation set) depends greatly on the selection of the patients in training sets.” “Five of the seven largest published studies addressing cancer prognosis did not classify patients better than chance. This result suggests that these publications were overoptimistic.” John P A Ioannidis February 5, 2005 Selection bias

“the 95% CI for the proportion of misclassiﬁcations fell to below 50% for some training-set sizes in only two of the studies” “We noted unstable molecular signatures and misclassiﬁcation rates (with minimum rates between 31% and 49%).” Michiels et al, Lancet 2005

The authors present a novel algorithm for classification, preprocessing, feature selection, … A description is available in XXX and the algorithm is publicly available as a Windows program/ website/ R package YYY. BUT, HAVE THEY ANSWERED THE FOLLOWING QUESTIONS? 1. Which classification result could be achieved using standard algorithms and is there a difference in classification quality between a standard algorithm and the proposed one? 2. If there is a substantial difference, what is the reason? The authors present a novel algorithm for classification, preprocessing, feature selection, … A description is available in XXX and the algorithm is publicly available as a Windows program/ website/ R package YYY. BUT, HAVE THEY ANSWERED THE FOLLOWING QUESTIONS? 1. Which classification result could be achieved using standard algorithms and is there a difference in classification quality between a standard algorithm and the proposed one? 2. If there is a substantial difference, what is the reason? M. Ruschhaupt et al (2004) "A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks", Statistical Applications in Genetics and Molecular Biology: 3 (1), Article 37. A NEW PAPER ON PREDICTIVE GENE PROFILING … INGREDIENTS: DATA + METHODS (- EXPERIMENTAL SETUP?) A NEW PAPER ON PREDICTIVE GENE PROFILING … INGREDIENTS: DATA + METHODS (- EXPERIMENTAL SETUP?)

M. Ruschhaupt et al (2004) "A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks", Statistical Applications in Genetics and Molecular Biology: 3 (1), Article 37. METAGENERF-MPAM-MPLR-MSVM-MBBT-MSize No Recurrence Recurrence All NO METAG.RFPAMPLRSVMBBTSize No Recurrence Recurrence All REANALYSIS OF DATASET: Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng CF, Bild A, Iversen ES, Liao M, Chen CM, West M, Nevins JR, Huang AT. Gene expression predictors of breast cancer outcomes. The Lancet 361:1590– 1596 (2003). ERRORS FOR DIFFERENT CLASSIFIERS AND WITH/WITHOUT METAGENES IN COMPLETE VALIDATION RF: random forest; PAM: Class prediction by nearest shrunken centroids; PLR: Penalized logistic regression; SVM: Support Vector Machines; BBT: Bayesian Binary Prediction Tree Models; Metagenes: new variables from linear combinations

Misclassification rates of around 25% with all eight methods. The use of metagenes did not seem to make a big difference either way. Most of the misclassified samples come from the group of patients with recurrence, which is the smaller group. Possibly, this could be explained by a preference of the classification algorithms to favour the larger group. Misclassification rates of around 25% with all eight methods. The use of metagenes did not seem to make a big difference either way. Most of the misclassified samples come from the group of patients with recurrence, which is the smaller group. Possibly, this could be explained by a preference of the classification algorithms to favour the larger group. M. Ruschhaupt et al (2004) "A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks", Statistical Applications in Genetics and Molecular Biology: 3 (1), Article 37. RESULTSRESULTS THEN: HOW TO EVALUATE ACCURACY IN PREDICTION?

To avoid selection bias (p>>n): a COMPLETE VALIDATION SCHEME* externally a stratified random partitioning, internally a model selection based on a K-fold cross-validation  3 x 10 5 SVM models (+ random labels  2 x 10 6 ) ** ** Binary classification, on a genes x 45 cDNA array, 400 loops * Ambroise & McLachlan, 2002, Simon et. al 2003, Furlanello et. al 2003 OFS-M: Model tuning and Feature ranking ONF: Optimal gene panel estimator ATE: Average Test Error The BioDCV setup (E-RFE SVM)

Tasks for BioDCV (E-RFE SVM) Lymphoma: 96 samples (74 tumoral + 24 control) described by 4096 genes (Alizadeh et. al, 2000) Tumor vs. Metastases: 76 samples (64 primary tumoral + 12 metastatic) described by genes (Ramaswamy et al., 2001) High Sezary CTCL: 30 samples (18 disease + 12 control) described by 6660 genes (Kari et al., 2003) – coll. Wistar Inst. Glioma: 50 samples (28 glioblastoma + 22 oligodendroglioma) described by genes (Nutt et al., 2003) Breast cancer: 37 samples (18 high risk + 19 low risk) described by genes (Huang et al., 2003) Mouse Model of Myocardial Infarction: 36 samples (18 infarcted + 18 control) described by geni (Cardiogenomics PGA ) Colon cancer: 62 samples (40 tumoral + 22 control) described by 2000 genes (Alon et. al, 1999)

Tasks for BioDCV (E-RFE SVM) Liver cancer: 213 samples, 107 tumors from liver cancer +106 non tumoral/normal, 1993 genes, ATAC-PCR (Sese et. al, 2000) Breast cancer: Wang et al. 2005: 238 samples (286 lymph-node-negative), Affimetrix, genes Chang et al. 2005: 295 samples (151 lymph-node-negative, 144 pos), cDNA genes IFOM: 62 BRCA (4 subclasses) Pediatric Leukemia: 327 samples, genes (7 classes, binary: ), Yeoh et al Tumor vs. Metastases: 76 samples (64 primary tumoral + 12 metastatic) described by genes (Ramaswamy et al., 2001) High Sezary CTCL: 30 samples (18 disease + 12 control) described by 6660 genes (Kari et al., 2003) – coll. Wistar Inst. Glioma: 50 samples (28 glioblastoma + 22 oligodendroglioma) described by genes (Nutt et al., 2003) Breast cancer: 37 samples (18 high risk + 19 low risk) described by genes (Huang et al., 2003) Mouse Model of Myocardial Infarction: 36 samples (18 infarcted + 18 control) described by geni (Cardiogenomics PGA )

DatasetDimensionJob noSvmTot (h)Cluster (h) Breast Cancer37 x MM NILV –LV41 x Glioma50 x High sezary Survival 49 x ,4 High sezary Without C X Our HPC resource, MpaCluster, 6Xeon+ 40 Pentium CPU, OpenMosix, 3 TeraB central storage. Upgraded in  production in GRID , P3 biproc. 1. With a Linux OpenMosix HPC facility

Starting from a suite of C modules and Perl/shell scripts running on a local HPC resource … 1. Optimize modules and scripts:* database management of data, of model structures, of system outputs, scripts for OpenMosix Linux Clusters database management of data, of model structures, of system outputs, scripts for OpenMosix Linux Clusters 2. Wrap BioDCV into a grid application Learn about grid computing Learn about grid computing Port the serial version on a computational grid testbed Port the serial version on a computational grid testbed Analyze/verify results: identify needs/problems Analyze/verify results: identify needs/problems 3. Wrap with C MPI scripts Build the MPI mechanism Build the MPI mechanism Experiment on the testbed Experiment on the testbed Submit on production grid Submit on production grid Test scalability Test scalability 4. Production March 05: Up and Running! February 2005 November-04 January 2005 Sept-Dec 04 Sept 05: 1500 jobs, 500+ computing days on production grid Roadmap for a new grid application

Rewrite shell/Perl scripts in C language Rewrite shell/Perl scripts in C language control I/O costs, control I/O costs, a process granularity optimal for temporary data allocation without tmp files a process granularity optimal for temporary data allocation without tmp files convenient for migrations convenient for migrations SQLite interface (Database engine library) SQLite interface (Database engine library) SQLite is small, self-contained, embeddable SQLite is small, self-contained, embeddable It provides a relational access to model and data structures (inputs, outputs, diagnostics) It provides a relational access to model and data structures (inputs, outputs, diagnostics) It supports transactions and multiple connections, databases up to 2 terabytes in size It supports transactions and multiple connections, databases up to 2 terabytes in size local copy (db file): + model definitions + a copy of of data + indexes defining the partition of the replicate sample(s) 1. Optimize modules and scripts

BioDCV (1)exp : experiment design through configuration of the setup database (2)scheduler : script submitting jobs (run) on each available processor. Platform dependent. (3)run : performs fractions of the complete validation procedures on several data splits. Local db is created (4)unify : the local datasets are merged with setup after completing the validations tasks. A complete dataset collecting all the relevant parameters is created.

Why porting into the grid? Why porting into the grid? Because we need “enough” computational resources… Because we need “enough” computational resources… How to port the BioDCV in grid? How to port the BioDCV in grid? PRELIMINARY PRELIMINARY Identify a collaborator with experience in grid computing (e.g. the Egrid Project hosted at ICTP ) Identify a collaborator with experience in grid computing (e.g. the Egrid Project hosted at ICTP ) Train human resources (SP  Trieste) Train human resources (SP  Trieste) Join the Egrid testbed (installing a supernode in Trento) Join the Egrid testbed (installing a supernode in Trento) HANDS-ON HANDS-ON Porting of the serial application on the testbed Porting of the serial application on the testbed patch code as needed: code portability is mandatory to make life easier patch code as needed: code portability is mandatory to make life easier Identify requirements/problems Identify requirements/problems 2. Wrapping into a grid application

A few EDG definitions Storage Element (SE): stores the user data in the grid and makes it available for subsequent elaboration Computing Element (CE): where the grid user programs are delivered for elaboration: this is usually a front-end to several elementary Worker Node machines Worker Node (WN): machines where the user programs are actually executed, possibly with multiple CPUs User Interface (UI): machine to access the GRID CE SE m TByte WNs N CPUs site

The local testbed in Trieste The local testbed in Trieste Small computational grid based on EDG middleware + Egrid add-ons Small computational grid based on EDG middleware + Egrid add-ons Designed for testing/training/porting of applications Designed for testing/training/porting of applications Full compatibility with Grid.it middleware Full compatibility with Grid.it middleware The production infrastructure: The production infrastructure: A Virtual Organization within Grid.it, with its own services A Virtual Organization within Grid.it, with its own services Star topology with central node in Padova Star topology with central node in Padova CE SE 2.8 TByte WNs 100 cpus Padova CE+SE+WN Trento CE+SE+WN Roma CE+SE+WN Trieste CE+SE+WN Firenze CE+SE+WN Palermo The ICTP Egrid project infrastructures

Porting the serial application Porting the serial application Easy task due to portability (no actual work needed) Easy task due to portability (no actual work needed) No software/library dependencies No software/library dependencies Testing/Evaluation Testing/Evaluation Problems identified: Problems identified: Job submission overhead due to EDG mechanisms Job submission overhead due to EDG mechanisms Managing multiple (~hundreds/thousands) jobs is difficult and cumbersome Managing multiple (~hundreds/thousands) jobs is difficult and cumbersome Answer: parallellize jobs on the GRID via MPI Answer: parallellize jobs on the GRID via MPI Single submission Single submission Multiple executions Multiple executions Hands on

How can we use C MPI? How can we use C MPI? Prepare two wrappers, and an unifier Prepare two wrappers, and an unifier one shell script to submit jobs (BioDCV.sh) one shell script to submit jobs (BioDCV.sh) one C MPI program (Mpba-mpi) one C MPI program (Mpba-mpi) one shell script to integrate results (BioDCV-union.sh) one shell script to integrate results (BioDCV-union.sh) BioDCV.sh in action: BioDCV.sh in action: copies file from and to Storage Element (SE) and distributes the microarray dataset to all WNs. copies file from and to Storage Element (SE) and distributes the microarray dataset to all WNs. It then starts the C MPI wrapper which spawns several runs of the BioDCV program (optimize for resources) It then starts the C MPI wrapper which spawns several runs of the BioDCV program (optimize for resources) When all BioDCV runs are completed, the wrapper copies all the results (SQLite files) from the WNs to the starting SE. When all BioDCV runs are completed, the wrapper copies all the results (SQLite files) from the WNs to the starting SE. MPBA-MPI executes the BioDCV runs in parallel MPBA-MPI executes the BioDCV runs in parallel BioDCV-union.sh collates results in one SQLite file (  R) BioDCV-union.sh collates results in one SQLite file (  R) C MPI 3. Wrap with C MPI scripts

Using BioDCV in Egrid UI Egrid Live CD* Resource broker (PD-TN) CE SE 2.8 TByte WNs 100 cpus CE+SE+WN Padova Trieste Palermo Trento.. “Edg-job-submit bioDCV.jdl” site a bootable Linux live-cd distribution with a complete suit of GRID tools by Egrid (ICTP Trieste)

[ Type = "Job"; JobType = "MPICH"; NodeNumber = 64; Executable = “BioDCV.sh"; Arguments = “Mpba-mpi 64 lfn:/utenti/spaoli/sarcoma.db 400"; StdOutput = "test.out"; StdError = "test.err"; InputSandbox = {“BioDCV.sh",“Mpba-mpi","run", "run.sh"}; OutputSandbox = {"test.err","test.out","executable.out"}; Requirements = other.GlueCEInfoLRMSType == "PBS" || other.GlueCEInfoLRMSType == "LSF"; ] BioDCV.jdl A Job Description

Second step:... WN 1 WN 2 WN 3 WN n Mpba-mpi and Sarcoma.db are distributed to all the involved WNs BioDCV Sarcoma.db WN 1 BioDCV.sh runs on Request file sarcoma.db SE First step: BioDCV.sh copies data from SE to the WN Using BioDCV in Egrid (II)

... SE Fourth step: Output WN 1 BioDCV.sh copies all results (SQLite files) from the WNs to the starting SE BioDCV.sh runs on... Third step: WN 1 BioDCV is executed on all involved WNs by MPI Mpba-mpi runs on WN2 BioDCV runs on Mpba-mpi on WN3 Mpba-mpi on WN n Job completed Using BioDCV in Egrid (III)

RUNNING ON THE TESTBED (EGRID.IT) CPU no.Computing (sec) Copying files (sec) Total time (secondi) , , CPUs: Intel 2.80 GHz SCALING UP TESTS a. b. INT-IFOM Sarcoma dataset 7143 genes 35 samples a. Colon cancer dataset 2000 genes 62 samples b.

BioDCV Outlier detection Compare subgroups and pathological features 213 Samples 198 Samples Complete dataset Shaved dataset Complete dataset 213 Samples BioDCV Usage (examples)

Example of Semisupervised analysis (Sese)

The pros: MPI execution on the GRID in a few days.. MPI execution on the GRID in a few days.. The tests showed scalable behavior of our grid application for increasing numbers of CPUs The tests showed scalable behavior of our grid application for increasing numbers of CPUs Grid computing reduces significantly production times and allows to tackle larger problems (see next slide) Grid computing reduces significantly production times and allows to tackle larger problems (see next slide) The cons: Data movements limit the scalability for a large number of CPU’s Data movements limit the scalability for a large number of CPU’s Note: this is a GRID.it limitation: there is no shared Filesystem between the WNs, so each file needs to be copied everywhere! Note: this is a GRID.it limitation: there is no shared Filesystem between the WNs, so each file needs to be copied everywhere! To hide the latency (ideas): To hide the latency (ideas): Smart data distribution from MWN to WN’s: Smart data distribution from MWN to WN’s: Reduce the amount of data to be moved Reduce the amount of data to be moved Proportionate BioDCV subtasks to local cache Proportionate BioDCV subtasks to local cache Data transferred via MPI communication Data transferred via MPI communication Requires MPI coding and some MPI adaptation of the code) Requires MPI coding and some MPI adaptation of the code) Results

MOVE no. 2: Improving the system Reduce the amount of data to be moved 1.Redesign “per run”: SVM models (about 200) and SVM models (about 200) and results, evaluation results, evaluation Variables for semisupervised analysis Variables for semisupervised analysis all managed within one data structure 2.A large part of the sampletracking semisupervised analysis, is now managed within BioDCV (about 2000 files, 300MB) i.e. stored through SQLite. 3.Randomization of labels is fully automated 4.The SVM library is now an external library: Modular use of machine learning methods Modular use of machine learning methods Now adding a PDA module Now adding a PDA module 5.BioDCV now under GPL (code curation …) 6.Distributed at BioDCV with a SubVersion server since September 2005

1.At work on several clusters: MPBA-old: 50 P3 CPUs, 1GHz MPBA-old: 50 P3 CPUs, 1GHz MPBA-new: 6 Xeon CPUs, 2,8 GHz MPBA-new: 6 Xeon CPUs, 2,8 GHz ECT* (BEN): up to 32 (of 100) CPU Xeon, 2.8GHz ECT* (BEN): up to 32 (of 100) CPU Xeon, 2.8GHz SISSA (Cozzini): up to 32 (of 60) P4, 2GHz, Myrinet SISSA (Cozzini): up to 32 (of 60) P4, 2GHz, Myrinet 2.GRID experiences A.Egrid “production grid” (INFN Padua): up to 64 (of 100) Cpu Xeon, 2-3GHz Microarray data: Sarcoma, HS random, Morishita, Wang, … B.LESSONS LEARNED: i.the latest version reduces latencies (system times) due to file copying and management  CPU saturation ii.Life quality (and more studies): huge reduction of file installing and retrieving from facilities and WITHIN facilities iii.Forgetting the severe limitation of file system (AFS, …) iv.Now installing 2.6 LCG2 (CERN realise September 2005) (OCTOBER 2005) CLUSTER AND GRID ISSUES

INFRASTRUCTURE MPACluster -> available for batch jobs Connecting with IFOM -> 2005 Running at IFOM -> 2005/2006 Production on GRID resources (spring 2005) Challenges ALGORITHMS II 1. 1.Gene list fusion: suite of algebraic/statistical methods 2. 2.Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semisupervised analysis 3. 3.New SVM Kernels for prediction on spectrometry data within complete validation ALGORITHMS II 1. 1.Gene list fusion: suite of algebraic/statistical methods 2. 2.Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semisupervised analysis 3. 3.New SVM Kernels for prediction on spectrometry data within complete validation Challenges for predictive profiling

BASIC CLASSIFICATION: MODELS, lists, additional tools Tools for researchers: subtype discovery, outlier detection Connection to data (DB–MIAME) BASIC CLASSIFICATION: MODELS, lists, additional tools Tools for researchers: subtype discovery, outlier detection Connection to data (DB–MIAME) Challenges (AIRC-BICG) HPC-Interaction: access through web front-ends to GRID HPC www A p a c h e

Acknowledgments ITC-irst, Trento Davide Albanese Giuseppe Jurman Stefano Merler Roberto Flor Alessandro Soraruf ICTP E-GRID Project, Trieste Angelo Leto Cristian Zoicas Riccardo Murri Ezio Corso Alessio Terpin ITC-irst, Trento Davide Albanese Giuseppe Jurman Stefano Merler Roberto Flor Alessandro Soraruf ICTP E-GRID Project, Trieste Angelo Leto Cristian Zoicas Riccardo Murri Ezio Corso Alessio Terpin IFOM-FIRC and INT, Milano James Reid Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: ShoweLab (Wistar) Cardiogenomics PGA IFOM-FIRC and INT, Milano James Reid Manuela Gariboldi Marco A. Pierotti Grants: BICG (AIRC) Democritos Data: ShoweLab (Wistar) Cardiogenomics PGA

DTW based clustering 81a 41a 40a 127a 75a 52a300 64a 130a 68a 22 70a 29a 105a 54 51a67a 89a 79a a87a 63a 97a 62 86a 49a 301a a 80a a 45a 93a98a a 50a82a 21a92a 83a a 119a 72a 57a 48a 25 36a 58a 69a a a 56a 90a 55a 59a99a 43a 37a 44a a a 61a 17 27a a a Positive Height Curves were clustered with Dynamic Time Warping, as distance with weight configuration (1,2,1).

Subgroup and pathological features a 45a 93a 98a a Positive Height DTW based clustering of sample tracking profiles of cluster subgroup. DTW based clustering of sample tracking profiles of cluster subgroup. All samples have virus type B (V-B). All samples have virus type B (V-B). Bottom lines: incidence in the cluster subgroup and in all the positive samples. Bottom lines: incidence in the cluster subgroup and in all the positive samples. Sample Virus type B Virus Type C a01 47a01 106a01 93a01 98a %0100 All(%)1865

Predictive error Number of features ATE Complete dataset Shaved dataset Confidence interval (.95 level)