August 2013. Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10x Performance 5x Energy Efficiency.

August 2013

Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10x Performance 5x Energy Efficiency

10x Accelerator Performance Advantage SPECFEM3D: Wave Propagation 8.8x Chroma: Lattice QCD 10.2x AMBER: Molecular Dynamics 7.1x WS-LMS: Material Science 3.2x 2x CPU = 2x Sandy Bridge E5-2687, 3.10 GHz 1x Tesla K20X + 1x CPU = 1x Tesla K20 GPU; 1x Sandy Bridge E5-2687, 3.10 GHz Kepler

Unprecedented Value to Scientific Computing 1 Tesla K20X GPU 90 ns/day 1 Tesla K20X GPU 90 ns/day 12 Sandy Bridge CPUs 41 ns/day 12 Sandy Bridge CPUs 41 ns/day AMBER Molecular Dynamics Simulation JAC NVE Benchmark

GPU Accelerated Computing Growing Fast “Intel is not taking share away from NVIDIA but rather both are expanding the use of accelerators.” Systems with Accelerators ” “ Intersect360 Research HPC User Site Census July, 2013

How GPU Acceleration Works Application Code + GPUCPU 5% of Code Compute-Intensive Functions Rest of Sequential CPU Code

3 Ways to Program GPUs Applications Libraries “Drop-in” Acceleration Programming Languages Maximum Flexibility OpenACC Directives Easily Accelerate Applications

Dynamic Parallelism Kepler Fastest, Most Efficient HPC Architecture Ever 3x Performance per Watt SMX Easy Speed-up for Legacy MPI Apps Hyper-Q Parallel Programming Made Easier than Ever

Tesla Kepler Family World’s Fastest and Most Efficient HPC Accelerators GPUs Single Precision Peak (SGEMM) Double Precision Peak (DGEMM) Memory Size Memory Bandwidth (ECC off) System Solution Weather & Climate, Physics, BioChemistry, CAE, Material Science K20X 3.95 TF (2.90 TF) 1.32 TF (1.22 TF) 6 GB250 GB/sServer only K20 3.52 TF (2.61 TF) 1.17 TF (1.10 TF) 5 GB208 GB/s Server + Workstation Image, Signal, Video, Seismic K10 4.58 TF0.19 TF8 GB320 GB/sServer only

Strong CUDA GPU Roadmap 2012201420082010 DP GFLOPS per Watt Kepler Tesla Fermi Maxwell Volta Stacked DRAM Unified Virtual Memory Dynamic Parallelism FP64 CUDA 32 16 8 4 2 1 0.5

GPU Accelerated Applications

Top Scientific Apps Computational Chemistry AMBER CHARMM GROMACS LAMMPS NAMD DL_POLY Material Science QMCPACK Quantum Espresso GAMESS-US Gaussian NWChem VASP Climate & Weather COSMO GEOS-5 CAM-SE NIM WRF Physics Chroma Denovo GTC GTS ENZO MILC CAE ANSYS Mechanical MSC Nastran SIMULIA Abaqus ANSYS Fluent OpenFOAM LS-DYNA Explosive Growth of GPU Accelerated Apps # of Apps 40% Increase 61% Increase Accelerated, In Development

207 GPU-Accelerated Applications www.nvidia.com/appscatalog

Performance on Leading Scientific Applications K20X Relative Performance vs. dual-socket Sandy Bridge 2x CPU = 2x Sandy Bridge E5-2687, 3.10 GHz 1x Tesla K20X + 1x CPU = 1x Tesla K20 GPU; 1x Sandy Bridge E5-2687, 3.10 GHz 1x

Applications Scale to 1000s of GPUs Weak Scaling Strong Scaling

More on Kepler

Hyper-Q Speedup Legacy MPI Apps FERMI 1 Work Queue KEPLER 32 Concurrent Work Queues CPUFermi GPU CPUKepler GPU Dynamic Parallelism Less Back-Forth, Simpler Code Kepler Features Make GPU Coding Easier

Hyper-Q: 32 MPI jobs per GPU Easy Speed-up for Legacy MPI Apps Dynamic Parallelism: GPU Generates Own Work Less Effort, Higher Performance 3x2x GPU Coding Made Easier & More Efficient

Tesla Kepler Family of Accelerators World’s Fastest, Most Energy Efficient Accelerator 0.17* 0.19* 0.83* 1.1* 1.22* * Size of the bubble = Double Precision Perf. (TFlops) Sandy Bridge E5-2690 Intel Xeon Phi TESLA K20X TESLA K20 TESLA K10 Tesla K20X vs Xeon CPU 8x Faster SGEMM 6x Faster DGEMM Tesla K20X vs Xeon Phi 70% Faster SGEMM 50% Faster DGEMM

Fastest, Most Energy Efficient Supercomputers World’s Fastest Open Science Supercomputer 18,688 Tesla K20X GPU Accelerators 27 Petaflops Peak 90% of Performance from GPUs World’s Most Energy Efficient Supercomputer 128 Tesla K20 GPU Accelerators 3150 MFLOPS/Watt $100k Energy & 300 Tons of CO 2 Saving Per Year

GPU Programming

GPU Accelerated Libraries “Drop-in” Acceleration for your Applications Linear Algebra FFT, BLAS, SPARSE, Matrix Numerical & Math RAND, Statistics Data Struct. & AI Sort, Scan, Zero Sum Visual Processing Image & Video NVIDIA cuFFT, cuBLAS, cuSPARSE NVIDIA Math Lib NVIDIA cuRAND NVIDIA NPP NVIDIA Video Encode GPU AI – Board Games GPU AI – Path Finding

OpenACC: Open, Simple, Portable Open Standard Easy, Compiler-Driven Approach Portable on GPUs and Xeon Phi main() { … … #pragma acc kernels { } … } CompilerHintCompilerHint CAM-SE Climate 6x Faster on GPU Top Kernel: 50% of Runtime Available from:

void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } //Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } //Invoke parallel SAXPY kernel w/ 256 threads/blk int nblocks = (n + 255) / 256; saxpy_parallel >>(n, 2.0, x, y); GPUs: C, C++, Fortran, Python Programmable Standard C Code Parallel C Code

Get Started with GPU Programming Get CUDA Access Tools Learn with Tutorials Join the Community bit.ly/GPUGetStarted WatchExplore developer.nvidia.com/get-started-parallel-computing

Designed for Cluster Deployment ECC 24x7 Runtime GPU Monitoring Cluster Management GPUDirect-RDMA Hyper-Q for MPI 3 Year Warranty Integrated OEM Systems, Professional Support Designed for Gamers & Developers 1+ Teraflop Double Precision Performance Dynamic Parallelism Hyper-Q for CUDA Streams Available Everywhere! Develop on GeForce, Deploy on Tesla GeForce GTX Titan Tesla K20X/K20

Competition

If its so easy, why aren’t there 1000s of applications already?

Developers: Not as Easy as Intel Says We have also discovered the necessity in making use of the long vector units available on Xeon Phi… cannot be easily optimized automatically by the compiler. Ian Wainwright Developer at HPC Sweden Consulting ” Professor Steven Gottlieb Indiana University NERSC HEP Requirements Review- Nov, 2012 ” Meng, et al University of Utah Preliminary Experiences with the Uintah Framework on Intel Xeon Phi and Stampede ” Frankly, I am not that enamored of having to deal with three levels of parallelism. (referring to MPI + OpenMP + SIMD vectorization) If you have CPU-centric code, run it on a CPU as the Phi will always lose. If you have a GPU-centric code (data-parallelism and such), run it on a GPU as the Phi will always lose. Phi: What is it good for? “ “ “

GPUs: Always Faster than Xeon Phi Source: Tokyo Tech Presentation- Application Performances on Many-core Processors, Xeon Phi versus Kepler GPU Tokyo Tech CFD Code Benchmark

Independent Tests from Xeon Phi Users GPUs: 2-20x Faster Than Optimized Xeon Phi Apps Tokyo Tech 20X Univ. of Warwick 4X CGGVeritas2XNREL1.5X C’t Mag 1.6X Georgia Tech 2X

Backup

Developer Momentum Continues to Grow 20082013 4,000 Academic Papers 150K CUDA Downloads 60 University Courses 100M CUDA –Capable GPUs 1 Supercomputer 430M CUDA-Capable GPUs 50 Supercomputers 1.6M CUDA Downloads 640 University Courses 37,000 Academic Papers

Germany Juelich HLRS Max Planck TU Dresden UK Cambridge EPCC Oxford STFC Japan Tokyo Tech RIKEN Tsukuba Rest of Europe BSC, Spain CINECA, Italy CEA, France CSCS, Switzerland China NSC, Shenzhen NSC, Tianjin CAS IPE Rest of World MSU, Russia RAS, Russia IITs, India United States Lawrence Livermore National Labs Oak Ridge National Labs Sandia National Labs NOAA NCSA BlueWaters Leadership HPC Sites Now GPU Accelerated

CUDA Accelerating 19% of FLOPS from GPU Systems Total Performance (PFLOPS) NVIDIA Kepler NVIDIA Fermi Intel Xeon Phi IBM Cell Other

Top Applications Now with Built-in GPU Support Application Market Share by Segment 207 GPU-Accelerated Applications www.nvidia.com/appscatalog

Tesla Kepler Family of Accelerators World’s Fastest, Most Energy Efficient Accelerator Tesla K20X vs Xeon CPU 8x Faster SGEMM 6x Faster DGEMM Tesla K20X vs Xeon Phi 70% Faster SGEMM 50% Faster DGEMM Sandy Bridge E5-2690 Intel Xeon Phi TESLA K20 TESLA K20X TESLA K10

August 2013. Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10x Performance 5x Energy Efficiency.

Similar presentations

Presentation on theme: "August 2013. Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10x Performance 5x Energy Efficiency."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

August 2013. Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10x Performance 5x Energy Efficiency.

Similar presentations

Presentation on theme: "August 2013. Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10x Performance 5x Energy Efficiency."— Presentation transcript:

Similar presentations

About project

Feedback