# Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,

## Presentation on theme: "Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,"— Presentation transcript:

Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS, July 31, 2014

 Why use HPC with R?  Accelerating mKrig & Krig  Parallel Cholesky ◦ Software Packages  Parallel Eigen Decomposition  Conclusions & Future Works

 Accelerate the ‘Fields’ Krig and mKrig functions  Survey of parallel linear algebra software ◦ Multicore (Shared Memory) ◦ GPU ◦ Xeon Phi

 Many developers & users in the field of Statistics ◦ Readily available code base  Problem: R is slow for large size problems

 Bottleneck in Linear Algebra operations ◦ mKrig – Cholesky Decomposition ◦ Krig – Eigen Decomposition  R uses sequential algorithms  Strategy: Use C interoperable libraries to parallelize linear algebra ◦ C functions callable through R environment

 Symmetric positive definite ->Triangular ◦ A = LL^T ◦ Nice properties for determinant calculation

 PLASMA (Multicore Shared Memory) ◦ http://icl.cs.utk.edu/plasma/ http://icl.cs.utk.edu/plasma/  MAGMA (GPU & Xeon Phi) ◦ http://icl.cs.utk.edu/magma/ http://icl.cs.utk.edu/magma/  CULA (GPU) ◦ http://www.culatools.com/ http://www.culatools.com/

 Multicore (Shared Memory)  Block Scheduling ◦ Determines what operations should be done on which core  Block Size optimization ◦ Dependent on Cache Memory

0 5 Speedup v s. 1 Core 10 15 Plasma using 1 Node (# of Observations = 25000) 8 # of Cores 1241216 Speedup Optimal Speedup

6 7 50040 Mb1000 Block Size 1500 3 4 Time(sec) 5 PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16) 256 Kb

01000020000 # of Observations 3000040000 100 200 300 400 500 600 PLASMA Optimal Block Sizes (Cores=16) Optimal Block si z e

 Utilizes GPUs or Xeon Phi for parallelization ◦ Multiple GPU & Multiple Xeon Phi implementations available ◦ 1 CPU drives one 1GPU  Block Scheduling ◦ Similar to PLASMA  Block Size dependent on Accelerator Architecture

 CUDA Proprietary linear algebra package  Capable of doing Lapack operations using 1 GPU  API written in C  Dense & Spare operations available

 1 Node of Caldera or Pronghorn ◦ 2 x 8 core Intel Xeon E5-2670 (Sandy Bridge) processors per Node  64 GB RAM (~59 GB available)  Cache Per Core: L1=32Kb, L2 =256Kb  Cache Per Socket: L3=20Mb ◦ 2 x Nvidia Tesla M270Q GPU (Caldera)  ~5.2 GB RAM per device  1 core drives 1 GPU ◦ 2 x Xeon Phi 5110P (Pronghorn)  ~7.4 GB RAM per device

Serial R: ~3 GFLOP/sec Theoretical Peak Performance 16 core Xeon SandyBridge: ~333 GFLOP/sec 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec 1 Xeon Phi 5110P: ~1,011 GFLOP/sec 01000020000 # of Observations 3000040000 0 100 GFLOP/sec 200 300 400 Accelerated Hardware has Room for Improvement Plasma (16 cores) Magma 1 GPU Magma 2 GPUs Magma 1 MIC Magma 2 MICs CULA

All Parallel Cholesky Implementations are Faster than Serial R 20000 # of Observations Time(sec) 0100003000040000 0.01 0.1 1 10 100 1000 Serial R Plasma (16 Cores) CULA Magma 1 GPU Magma 2 GPUs Magma 1 Xeon Phi Magma 2 Xeon Phis >100 Times Speedup over serial R when # of Observations = 10k

~6 Times Speedup over serial R when # of Observations = 10k 0200040006000800010000 0 50 100 Time(sec) 150 200 250 300 Eigendecomposition also Faster on Accelerated Hardware # of Observations Serial R CULA Magma 1 GPU Magma 2 GPUs

Both times taken using MAGMA w/ 2 GPUs 0200040006000800010000 0 5 10 15 20 25 30 Can Run ~30 Cholesky Decompositions per Eigen Decomposition # of Observations Time Eigendecomposition / Time Cholesky

If we want to do 16Cholesky decompositions in parallel, we are guaranteed better performance when speedup >16 05000 0 5 10 15 20 25 10000 # of Observations 1500020000 Parallel Cholesky Beats Parallel R for Moderate to Large Matrices Speedup v s. P a r alle l R Plasma Magma 2 GPUs

 Using Caldera ◦ Single Cholesky Decomposition ◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size) ◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs ◦ Matrix Size > 35k use PLASMA (16 cores w/ optimal block size)  Dependent on computing resources available

 Explored Implementation on accelerated hardware ◦ GPUs ◦ Multicore (Shared Memory) ◦ Xeon Phis  Installed third party linear algebra packages & programmed wrappers that call these packages from R ◦ Installation instructions and programs available through bitbucket repo for access contact Srinath Vadlamani  Future Work ◦ Multicore Distributed Memory ◦ Single Precision

Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: http://CRAN.R-project.org/package=fields. R package version 7.1.http://CRAN.R-project.org/package=fields. Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009. Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, 2010. Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.

xPOTRF xTRSMxTRSM xTRSMxTRSMxTRSM xTRSMxTRSMxTRSMxTRSM xTRSMxTRSM xSYRK xGEMM xPOTRFxTRSMxSYRKxGEMM 0123001230 12301230 123123 0101 2 FINAL http://www.netlib.org/lapack/lawnspdf/lawn223.pdf http://www.netlib.org/lapack/lawnspdf/lawn223.pdf

Download ppt "Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,"

Similar presentations