Savvas Petrou EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface
March 2010SPRINT2 Overview What is SPRINT How is SPRINT different from other parallel R packages Biological example: Post-genomic data analysis Code comparison
March 2010SPRINT3 SPRINT Simple Parallel R INTerface ( ) “SPRINT: A new parallel framework for R”, J Hill et al, BMC Bioinformatics, Dec SPRINT
March 2010SPRINT4 Issues of existing parallel R packages Difficult to program Require scientist to also be a parallel programmer! Require substantial changes to existing scripts Can’t be used to solve some problems No data dependencies allowed
March 2010SPRINT5 Data: Data: A matrix of expression measurements with genes in rows and samples in columns Biological example
March 2010SPRINT6 Biological example Problem Problem Using all or many genes will either crash or be very slow (R memory allocation limits, number of computations) Input array dimensions and size Final array size in memory 11,000 x MB MB 0.9 GB (0.9 GB) 22,000 x MB 3, MB 3.6 GB (3.6 GB) 35,000 x MB 9,346 MB 9.12 GB (9.12 GB) 45,000 x MB 15, MB GB (15.08 GB) Data limitations (correlations) Input array dimensions and permutation count Estimated total run time 36,612 x 76500,000 20,750 seconds 6 hours 36,612 x 761,000,000 41,500 seconds 12 hours 73,224 x 76500,000 35,000 seconds 10 hours 73,224 x 761,000,000 70,000 seconds 20 hours Work load limitations (permutations)
March 2010SPRINT7 Workarounds and solution Workaround: Workaround: –Remove as many genes as possible before applying algorithm. This can be an arbitrary process and remove relevant data. –Perform multiple executions and post-process the data. Can become very painful procedure. Solution: expert Solution: Parallelisation of R code can be made accessible to bioinformaticians/statisticians. A library with expert coded solutions once, then easy end-point use by all. SPRINT R Biological Results HPC Big Post Genomic Data
March 2010SPRINT8 Benchmarks (256 processes) Input array dimensions and size Final array size in memory Total run time (in serial) (in seconds) Total run time (in parallel) (in seconds) 11,000 x MB MB 0.9 GB (0.9 GB) ,000 x MB 3, MB 3.6 GB (3.6 GB) “Error: cannot allocate vector of size 3.6 Gb” ,000 x MB 9,346 MB 9.12 GB (9.12 GB)CRASHED ,000 x MB 15, MB GB (15.08 GB)CRASHED42.18 Data limitations (correlations) Input array dimensions and permutation count Estimated total run time (in serial) Total run time (in parallel) (in seconds) 36,612 x 76500,000 20,750 seconds 6 hours ,612 x 761,000,000 41,500 seconds 12 hours ,224 x 76500,000 35,000 seconds 10 hours ,224 x 761,000,000 70,000 seconds 20 hours Work load limitations (permutations)
March 2010SPRINT9 edata <- read.table("largedata.dat") pearsonpairwise <- cor(edata) write.table(pearsonpairwise, "Correlations.txt") quit(save="no") library("sprint") edata <- read.table("largedata.dat") ff_handle <- pcor(edata) pterminate() quit(save="no") Correlation code comparison
March 2010SPRINT10 data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- mt.maxT(smallgd, classlabel, test="t", side="abs") quit(save="no") library("sprint") data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- pmaxT(smallgd, classlabel, test="t", side="abs") pterminate() quit(save="no") Permutation testing code comparison
March 2010SPRINT11 Website: Website: Source code can be downloaded from website Source code can be downloaded from website Soon also in the CRAN repository Soon also in the CRAN repository Mailing list: Mailing list: Contact Contact SPRINT
March 2010SPRINT12 Acknowledgements DPM Team: Peter Ghazal Thorsten Forster Muriel Mewissen EPCC Team: Terry Sloan Michal Piotrowski Savvas Petrou Bartek Dobrzelecki Jon Hill Florian Scharinger Wellcome TrustNAG dCSE Support This work is supported by the Wellcome Trust and the NAG dCSE Support service. Numerical Algorithms Group
March 2010SPRINT13 R –-vanilla --slave –f maxT_serial.R mpiexec –n 2 R –-vanilla --slave –f maxT_parallel.R SPRINT - Demo Executing the same code in serial and in parallel.