Download presentation

Presentation is loading. Please wait.

Published byJanet Simeon Modified over 2 years ago

1
Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS, July 31, 2014

2
Why use HPC with R? Accelerating mKrig & Krig Parallel Cholesky ◦ Software Packages Parallel Eigen Decomposition Conclusions & Future Works

3
Accelerate the ‘Fields’ Krig and mKrig functions Survey of parallel linear algebra software ◦ Multicore (Shared Memory) ◦ GPU ◦ Xeon Phi

4
Many developers & users in the field of Statistics ◦ Readily available code base Problem: R is slow for large size problems

5
Bottleneck in Linear Algebra operations ◦ mKrig – Cholesky Decomposition ◦ Krig – Eigen Decomposition R uses sequential algorithms Strategy: Use C interoperable libraries to parallelize linear algebra ◦ C functions callable through R environment

6
Symmetric positive definite ->Triangular ◦ A = LL^T ◦ Nice properties for determinant calculation

7
PLASMA (Multicore Shared Memory) ◦ http://icl.cs.utk.edu/plasma/ http://icl.cs.utk.edu/plasma/ MAGMA (GPU & Xeon Phi) ◦ http://icl.cs.utk.edu/magma/ http://icl.cs.utk.edu/magma/ CULA (GPU) ◦ http://www.culatools.com/ http://www.culatools.com/

8
Multicore (Shared Memory) Block Scheduling ◦ Determines what operations should be done on which core Block Size optimization ◦ Dependent on Cache Memory

9
0 5 Speedup v s. 1 Core 10 15 Plasma using 1 Node (# of Observations = 25000) 8 # of Cores 1241216 Speedup Optimal Speedup

10
6 7 50040 Mb1000 Block Size 1500 3 4 Time(sec) 5 PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16) 256 Kb

11
01000020000 # of Observations 3000040000 100 200 300 400 500 600 PLASMA Optimal Block Sizes (Cores=16) Optimal Block si z e

12
Utilizes GPUs or Xeon Phi for parallelization ◦ Multiple GPU & Multiple Xeon Phi implementations available ◦ 1 CPU drives one 1GPU Block Scheduling ◦ Similar to PLASMA Block Size dependent on Accelerator Architecture

13
CUDA Proprietary linear algebra package Capable of doing Lapack operations using 1 GPU API written in C Dense & Spare operations available

14
1 Node of Caldera or Pronghorn ◦ 2 x 8 core Intel Xeon E5-2670 (Sandy Bridge) processors per Node 64 GB RAM (~59 GB available) Cache Per Core: L1=32Kb, L2 =256Kb Cache Per Socket: L3=20Mb ◦ 2 x Nvidia Tesla M270Q GPU (Caldera) ~5.2 GB RAM per device 1 core drives 1 GPU ◦ 2 x Xeon Phi 5110P (Pronghorn) ~7.4 GB RAM per device

15
Serial R: ~3 GFLOP/sec Theoretical Peak Performance 16 core Xeon SandyBridge: ~333 GFLOP/sec 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec 1 Xeon Phi 5110P: ~1,011 GFLOP/sec 01000020000 # of Observations 3000040000 0 100 GFLOP/sec 200 300 400 Accelerated Hardware has Room for Improvement Plasma (16 cores) Magma 1 GPU Magma 2 GPUs Magma 1 MIC Magma 2 MICs CULA

16
All Parallel Cholesky Implementations are Faster than Serial R 20000 # of Observations Time(sec) 0100003000040000 0.01 0.1 1 10 100 1000 Serial R Plasma (16 Cores) CULA Magma 1 GPU Magma 2 GPUs Magma 1 Xeon Phi Magma 2 Xeon Phis >100 Times Speedup over serial R when # of Observations = 10k

17
~6 Times Speedup over serial R when # of Observations = 10k 0200040006000800010000 0 50 100 Time(sec) 150 200 250 300 Eigendecomposition also Faster on Accelerated Hardware # of Observations Serial R CULA Magma 1 GPU Magma 2 GPUs

18
Both times taken using MAGMA w/ 2 GPUs 0200040006000800010000 0 5 10 15 20 25 30 Can Run ~30 Cholesky Decompositions per Eigen Decomposition # of Observations Time Eigendecomposition / Time Cholesky

19
If we want to do 16Cholesky decompositions in parallel, we are guaranteed better performance when speedup >16 05000 0 5 10 15 20 25 10000 # of Observations 1500020000 Parallel Cholesky Beats Parallel R for Moderate to Large Matrices Speedup v s. P a r alle l R Plasma Magma 2 GPUs

20
Using Caldera ◦ Single Cholesky Decomposition ◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size) ◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs ◦ Matrix Size > 35k use PLASMA (16 cores w/ optimal block size) Dependent on computing resources available

21
Explored Implementation on accelerated hardware ◦ GPUs ◦ Multicore (Shared Memory) ◦ Xeon Phis Installed third party linear algebra packages & programmed wrappers that call these packages from R ◦ Installation instructions and programs available through bitbucket repo for access contact Srinath Vadlamani Future Work ◦ Multicore Distributed Memory ◦ Single Precision

22
Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: http://CRAN.R-project.org/package=fields. R package version 7.1.http://CRAN.R-project.org/package=fields. Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009. Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, 2010. Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.

23
xPOTRF xTRSMxTRSM xTRSMxTRSMxTRSM xTRSMxTRSMxTRSMxTRSM xTRSMxTRSM xSYRK xGEMM xPOTRFxTRSMxSYRKxGEMM 0123001230 12301230 123123 0101 2 FINAL http://www.netlib.org/lapack/lawnspdf/lawn223.pdf http://www.netlib.org/lapack/lawnspdf/lawn223.pdf

Similar presentations

OK

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,17 2009.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,17 2009.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

World map download ppt on pollution Ppt on single phase energy meter Ppt on effect of global warming on weather radio Ppt on first conditional questions Ppt on career in economics in india Ppt on quality education school Ppt on operators in c programming Ppt on transportation in human body Ppt on energy giving food pictures The nervous system for kids ppt on batteries