Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thinking Outside of the Tera-Scale Box Piotr Luszczek.

Similar presentations


Presentation on theme: "Thinking Outside of the Tera-Scale Box Piotr Luszczek."— Presentation transcript:

1 Thinking Outside of the Tera-Scale Box Piotr Luszczek

2 Brief History of Tera-flop: 1997 1997 ASCI Red

3 Brief History of Tera-flop: 2007 1997 2007 ASCI Red Intel Polaris

4 Brief History of Tera-flop: GPGPU 2008 1997 2007 ASCI Red Intel Polaris GPGPU

5 Tera-flop: a Dictionary Perspective Belongs to supercomputer expert Consumes 500 kW Costs over $1M Requires distributed memory programming Memory: 1 GiB or more Belongs to gaming geek Consumes under 500 W Costs under $200 Requires only a CUDA kernel Memory: 1 GiB or more

6 Double-Dip Recession? 20072005 Time Productivity Hybrid Multicore Multicore Cilk, Intel TBB, Intel Ct, Apple GCD, MS Concrt, … HMPP, PGI Accelerator, PathScale?

7 Locality Space in HPC Challenge HPL Mat-Mat-Mul Parallel-Transpose STREAM FFTRandomAccess

8 Locality Space in HPC Challenge: Bounds HPL Mat-Mat-Mul Parallel-Transpose STREAM FFTRandomAccess Latency-bound Bandwidth-bound Compute-bound AORSA2D PCIe

9 CUDA vs. OpenCL: Similar Software Stacks

10 Matrix-Matrix Multiply: the Simple Form for i = 1:M for j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j) CUBLAS Volkov et al. MAGMA Nakasato

11 Fermi C2050: from CUDA to OpenCL

12 Fermi C2050: Simple OpenCL Porting

13 ATI 5870: Simple OpenCL Porting

14 Making Case for Autotuning: Use of Texture

15 Making Case for Autotuning: Blocking Factors Provided by NVIDIA

16 Mat-Mul Portability Summary Achieved Performance Single precision Achieved Performance Double precision ATI OpenCL 52%57% ATI IL 74%87% NVIDIA CUDA 63%58% NVIDIA OpenCL 57%49% Multicore (vendor) 79% Multicore (autotuned) 46%40%

17 Now, let’s think outside of the box

18 Recipe for Performance Before Multicore Wider superscalar instr. issue Increase clock frequency Increase number of transistor Load/Store Instr. scheduling

19 Recipe for Performance After Multicore More cores Parallel software

20 Recipe for Future Performance More cores Better sequential software DAG Runtime

21 From Assembly to DAG: PLASMA Typical loop in assembly –load R0, #1000 –divf F1, F2, F3 –load R1, #1000 –divf F2, F4, F5 –dec R1 –jump … LU factorization –for i = 1:N – GETR2(A[i, i]) – for j = i+1:N – TRSM(A[i, i], A[i, j]) – for k = i+1:N – GEMM(A[k,i],A[i,j],A[k,j]) Instr. scheduling GETRF2 TRSM GETRF2 GEMM TRSM GEMM DAG scheduling Sequential instruction stream Sequential task stream

22 Scalability of LU Factorization on Multicore PLASMA MKL 11.0 LAPACK 6 cores = 1 socket 24 cores = 4 sockets 48 cores = 8 sockets AMD Istanbul 2.8 GHz

23 Combining Multicore and GPU for QR Fact. AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480

24 Performance on Distributed Memory QR LU Cholesky ORNL Kraken Cray XT5 AMD Istanbul 2.6 GHz Dual-socet 6-core Interconnect: Seastar, 3D torus Cray LibSci DPLASMA Peak

25 People and Projects Jack Dongarra Mathieu Faverge Jakub Kurzak Hatem Ltaief Stan Tomov Asim YarKhan Peng Du Rick Weber MAGMA PLASMA DPLASMA and DAGuE –George Bosilca –Aurelien Bouteiller –Anthony Danalis –Thomas Herault –Pierre Lemarinier


Download ppt "Thinking Outside of the Tera-Scale Box Piotr Luszczek."

Similar presentations


Ads by Google