Presentation is loading. Please wait.

Presentation is loading. Please wait.

Savvas Petrou EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface.

Similar presentations


Presentation on theme: "Savvas Petrou EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface."— Presentation transcript:

1 Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface

2 March 2010SPRINT2 Overview What is SPRINT How is SPRINT different from other parallel R packages Biological example: Post-genomic data analysis Code comparison

3 March 2010SPRINT3 SPRINT Simple Parallel R INTerface ( www.r-sprint.org ) “SPRINT: A new parallel framework for R”, J Hill et al, BMC Bioinformatics, Dec 2008. SPRINT

4 March 2010SPRINT4 Issues of existing parallel R packages Difficult to program Require scientist to also be a parallel programmer! Require substantial changes to existing scripts Can’t be used to solve some problems No data dependencies allowed

5 March 2010SPRINT5 Data: Data: A matrix of expression measurements with genes in rows and samples in columns Biological example

6 March 2010SPRINT6 Biological example Problem Problem Using all or many genes will either crash or be very slow (R memory allocation limits, number of computations) Input array dimensions and size Final array size in memory 11,000 x 320 26.85 MB 923.15 MB 0.9 GB (0.9 GB) 22,000 x 320 53.7 MB 3,692.62 MB 3.6 GB (3.6 GB) 35,000 x 320 85.44 MB 9,346 MB 9.12 GB (9.12 GB) 45,000 x 320 109.86 MB 15,449.52 MB 15.08 GB (15.08 GB) Data limitations (correlations) Input array dimensions and permutation count Estimated total run time 36,612 x 76500,000 20,750 seconds 6 hours 36,612 x 761,000,000 41,500 seconds 12 hours 73,224 x 76500,000 35,000 seconds 10 hours 73,224 x 761,000,000 70,000 seconds 20 hours Work load limitations (permutations)

7 March 2010SPRINT7 Workarounds and solution Workaround: Workaround: –Remove as many genes as possible before applying algorithm. This can be an arbitrary process and remove relevant data. –Perform multiple executions and post-process the data. Can become very painful procedure. Solution: expert Solution: Parallelisation of R code can be made accessible to bioinformaticians/statisticians. A library with expert coded solutions once, then easy end-point use by all. SPRINT R Biological Results HPC Big Post Genomic Data

8 March 2010SPRINT8 Benchmarks (256 processes) Input array dimensions and size Final array size in memory Total run time (in serial) (in seconds) Total run time (in parallel) (in seconds) 11,000 x 320 26.85 MB 923.15 MB 0.9 GB (0.9 GB)63.184.76 22,000 x 320 53.7 MB 3,692.62 MB 3.6 GB (3.6 GB) “Error: cannot allocate vector of size 3.6 Gb” 13.87 35,000 x 320 85.44 MB 9,346 MB 9.12 GB (9.12 GB)CRASHED36.64 45,000 x 320 109.86 MB 15,449.52 MB 15.08 GB (15.08 GB)CRASHED42.18 Data limitations (correlations) Input array dimensions and permutation count Estimated total run time (in serial) Total run time (in parallel) (in seconds) 36,612 x 76500,000 20,750 seconds 6 hours 73.18 36,612 x 761,000,000 41,500 seconds 12 hours 146.64 73,224 x 76500,000 35,000 seconds 10 hours 148.46 73,224 x 761,000,000 70,000 seconds 20 hours 294.61 Work load limitations (permutations)

9 March 2010SPRINT9 edata <- read.table("largedata.dat") pearsonpairwise <- cor(edata) write.table(pearsonpairwise, "Correlations.txt") quit(save="no") library("sprint") edata <- read.table("largedata.dat") ff_handle <- pcor(edata) pterminate() quit(save="no") Correlation code comparison

10 March 2010SPRINT10 data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- mt.maxT(smallgd, classlabel, test="t", side="abs") quit(save="no") library("sprint") data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- pmaxT(smallgd, classlabel, test="t", side="abs") pterminate() quit(save="no") Permutation testing code comparison

11 March 2010SPRINT11 Website: http://www.r-sprint.org/ Website: http://www.r-sprint.org/ http://www.r-sprint.org/ Source code can be downloaded from website Source code can be downloaded from website Soon also in the CRAN repository Soon also in the CRAN repository Mailing list: sprint@lists.ed.ac.uk Mailing list: sprint@lists.ed.ac.uk sprint@lists.ed.ac.uk Contact email: sprint@ed.ac.uk Contact email: sprint@ed.ac.uksprint@ed.ac.uk SPRINT

12 March 2010SPRINT12 Acknowledgements DPM Team: Peter Ghazal Thorsten Forster Muriel Mewissen EPCC Team: Terry Sloan Michal Piotrowski Savvas Petrou Bartek Dobrzelecki Jon Hill Florian Scharinger Wellcome TrustNAG dCSE Support This work is supported by the Wellcome Trust and the NAG dCSE Support service. Numerical Algorithms Group

13 March 2010SPRINT13 R –-vanilla --slave –f maxT_serial.R mpiexec –n 2 R –-vanilla --slave –f maxT_parallel.R SPRINT - Demo Executing the same code in serial and in parallel.


Download ppt "Savvas Petrou EPCC, The University of Edinburgh SPRINT SPRINT A Simple Parallel R INTerface."

Similar presentations


Ads by Google