Presentation is loading. Please wait.

Presentation is loading. Please wait.

Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,

Similar presentations


Presentation on theme: "Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,"— Presentation transcript:

1 Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS, July 31, 2014

2  Why use HPC with R?  Accelerating mKrig & Krig  Parallel Cholesky ◦ Software Packages  Parallel Eigen Decomposition  Conclusions & Future Works

3  Accelerate the ‘Fields’ Krig and mKrig functions  Survey of parallel linear algebra software ◦ Multicore (Shared Memory) ◦ GPU ◦ Xeon Phi

4  Many developers & users in the field of Statistics ◦ Readily available code base  Problem: R is slow for large size problems

5  Bottleneck in Linear Algebra operations ◦ mKrig – Cholesky Decomposition ◦ Krig – Eigen Decomposition  R uses sequential algorithms  Strategy: Use C interoperable libraries to parallelize linear algebra ◦ C functions callable through R environment

6  Symmetric positive definite ->Triangular ◦ A = LL^T ◦ Nice properties for determinant calculation

7  PLASMA (Multicore Shared Memory) ◦  MAGMA (GPU & Xeon Phi) ◦  CULA (GPU) ◦

8  Multicore (Shared Memory)  Block Scheduling ◦ Determines what operations should be done on which core  Block Size optimization ◦ Dependent on Cache Memory

9 0 5 Speedup v s. 1 Core Plasma using 1 Node (# of Observations = 25000) 8 # of Cores Speedup Optimal Speedup

10 Mb1000 Block Size Time(sec) 5 PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16) 256 Kb

11 # of Observations PLASMA Optimal Block Sizes (Cores=16) Optimal Block si z e

12  Utilizes GPUs or Xeon Phi for parallelization ◦ Multiple GPU & Multiple Xeon Phi implementations available ◦ 1 CPU drives one 1GPU  Block Scheduling ◦ Similar to PLASMA  Block Size dependent on Accelerator Architecture

13  CUDA Proprietary linear algebra package  Capable of doing Lapack operations using 1 GPU  API written in C  Dense & Spare operations available

14  1 Node of Caldera or Pronghorn ◦ 2 x 8 core Intel Xeon E (Sandy Bridge) processors per Node  64 GB RAM (~59 GB available)  Cache Per Core: L1=32Kb, L2 =256Kb  Cache Per Socket: L3=20Mb ◦ 2 x Nvidia Tesla M270Q GPU (Caldera)  ~5.2 GB RAM per device  1 core drives 1 GPU ◦ 2 x Xeon Phi 5110P (Pronghorn)  ~7.4 GB RAM per device

15 Serial R: ~3 GFLOP/sec Theoretical Peak Performance 16 core Xeon SandyBridge: ~333 GFLOP/sec 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec 1 Xeon Phi 5110P: ~1,011 GFLOP/sec # of Observations GFLOP/sec Accelerated Hardware has Room for Improvement Plasma (16 cores) Magma 1 GPU Magma 2 GPUs Magma 1 MIC Magma 2 MICs CULA

16 All Parallel Cholesky Implementations are Faster than Serial R # of Observations Time(sec) Serial R Plasma (16 Cores) CULA Magma 1 GPU Magma 2 GPUs Magma 1 Xeon Phi Magma 2 Xeon Phis >100 Times Speedup over serial R when # of Observations = 10k

17 ~6 Times Speedup over serial R when # of Observations = 10k Time(sec) Eigendecomposition also Faster on Accelerated Hardware # of Observations Serial R CULA Magma 1 GPU Magma 2 GPUs

18 Both times taken using MAGMA w/ 2 GPUs Can Run ~30 Cholesky Decompositions per Eigen Decomposition # of Observations Time Eigendecomposition / Time Cholesky

19 If we want to do 16Cholesky decompositions in parallel, we are guaranteed better performance when speedup > # of Observations Parallel Cholesky Beats Parallel R for Moderate to Large Matrices Speedup v s. P a r alle l R Plasma Magma 2 GPUs

20  Using Caldera ◦ Single Cholesky Decomposition ◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size) ◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs ◦ Matrix Size > 35k use PLASMA (16 cores w/ optimal block size)  Dependent on computing resources available

21  Explored Implementation on accelerated hardware ◦ GPUs ◦ Multicore (Shared Memory) ◦ Xeon Phis  Installed third party linear algebra packages & programmed wrappers that call these packages from R ◦ Installation instructions and programs available through bitbucket repo for access contact Srinath Vadlamani  Future Work ◦ Multicore Distributed Memory ◦ Single Precision

22 Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: R package version 7.1.http://CRAN.R-project.org/package=fields. Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page IOP Publishing, Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.

23 xPOTRF xTRSMxTRSM xTRSMxTRSMxTRSM xTRSMxTRSMxTRSMxTRSM xTRSMxTRSM xSYRK xGEMM xPOTRFxTRSMxSYRKxGEMM FINAL


Download ppt "Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,"

Similar presentations


Ads by Google