Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graphics Processors and the Exascale: Parallel Mappings, Scalability and Application Lifespan Rob Farber, Senior Scientist, PNNL.

Similar presentations


Presentation on theme: "Graphics Processors and the Exascale: Parallel Mappings, Scalability and Application Lifespan Rob Farber, Senior Scientist, PNNL."— Presentation transcript:

1 Graphics Processors and the Exascale: Parallel Mappings, Scalability and Application Lifespan Rob Farber, Senior Scientist, PNNL

2 Questions 1 and 2 1. Looking forward in the 2-5 year timeframe will we continue to need new languages, compiler directives, or language extensions to use accelerators? Absolutely as will be discussed in the next few slides 1. Will compiler technology advance sufficiently to seamlessly use accelerators as when the 8087 was added to the 8086 in the early days of the x86 architecture or when instruction sets were extended to include SSE or AltiVec and compilers eventually generated code for them? Oh I wish! However, there is hope for data-parallel problems 2. What is your vision of what a unified Heterogeneous HPC ecosystem should encompass? What Languages, Libraries, frameworks? Should debuggers and profiling tools be integrated across heterogeneous architectures? Humans are the weak link A scalable globally unified file-system is essential Yes to a unified set of debugger and profiling tools I’d like to say any language but many semantics and assumptions will not scale!

3 A perfect storm of opportunities and technology (Summary of Farber, Scientific Computing, “Realizing the Benefits of Affordable Teraflop-capable Hardware”) Multi-threaded software is a must-have because manufacturers were forced to move to multi-core CPUs The failure of Dennard’s scaling laws meant processor manufacturers had to add cores to increase performance and entice customers This is a new model for a huge body of legacy code! Multi-core is disruptive to single-threaded and poorly scaling legacy apps GPGPUs, the Cray XMT, Blue Waters have changed the numbers. Commodity systems are catching up. Massive threading is the future. Research efforts will not benefit from new hardware unless they invest in scalable, multi-threaded software Lack of investment risks stagnation and losing to the competition Competition is fierce, the new technology is readily available and it is inexpensive! Which software and models? Look to those successes that are: Widely adopted and have withstood the test of time Briefly examine CUDA, OpenCL, data-parallel extensions

4 GPGPUs: an existing capability Market forces evolved GPUs into massively parallel GPGPUs (General Purpose Graphics Processing Units). NVIDIA quotes a 100+ million installed base of CUDA-enabled GPUs GPUs put supercomputing in the hands of the masses. December 1996, ASCI Red the first teraflop supercomputer Today: kids buy GPUs with flop rates comparable to systems available to scientists with supercomputer access in the mid to late 1990s. Remember that Finnish kid who wrote some software to understand operating systems? Inexpensive commodity hardware enables: New thinking A large educated base of developers GPUPeak 32-bit GF/s Peak 64-bit GF/s Cost $ GeForce GTX 480 1.35168< $500 AMD Radeon HD 5870 2.72544< $380

5 Meeting the need. CUDA was adopted quickly! February 2007: The initial CUDA SDK was made public. Now: CUDA-based GPU Computing is part of the curriculum at more than 200 universities. MIT, Harvard, Cambridge, Oxford, the Indian Institutes of Technology, National Taiwan University, and the Chinese Academy of Sciences. Application speed tells the story. fastest 100 apps in the NVIDIA Showcase Sept. 8, 2010 Fastest: 2600x Median: 253x Slowest: 98x URL: http://www.nvidia.com/object/cuda_apps_flash_new.html click on Sort by Speed Uphttp://www.nvidia.com/object/cuda_apps_flash_new.html

6 GPGPUs are not a one-trick pony Used on a wide-range of computational, data driven, and real-time applications Exhibit knife-edge performance Balance ratios can help map problems Can really be worth the effort  10x can make computational workflows more interactive (even poorly performing GPU apps are useful).  100x is disruptive and has the potential to fundamentally affect scientific research by removing time-to-discovery barriers.  1000x and greater achieved through the use of optimized transcendental functions and/or multiple GPUs.

7 Three rules for fast GPU codes 1.Get the data on the GPU (and keep it there!) PCIe x16 v2.0 bus: 8 GiB/s in a single direction 20-series GPUs: 140-200 GiB/s 2.Give the GPU enough work to do Assume 10  s latency and 1 TF device Can waste (10 -6 * 10 12 ) = 1M operations 3.Reuse and locate data to avoid global memory bandwidth bottlenecks 10 12 flop hardware delivers 10 10 flop when global memory limited Can cause a 100x slowdown! Tough for people. Tools need heuristics that can work on incomplete data and adjust for bad decisions. It’s even worse in a distributed and non-failsafe environment.

8 Results presented at SC09 (courtesy TACC) Application lifespan SIMD: a key from the past Farber: general SIMD mapping from the 1980s Acknowledgements: Work performed at or funded by the Santa Fe Institute, the theoretical division at Los Alamos National Laboratory and various NSF, DOE and other funding sources including the Texas Advance Computer Center. This mapping for Neural Networks … “Most efficient implementation to date” (Singer 1990), (Thearling 1995) The Connection Machine 60,000 cores: 363 TF/s measured 62,796 cores: 386 TF/s (projected)

9 The Parallel Mapping energy = objFunc(p 1, p 2, … p n ) Examples 0, N-1 Examples N, 2N-1 Examples 2N, 3N-1 Examples 3N, 4N-1 Step 2 Calculate partials Step 3 Sum partials to get energy Step1 Broadcast parameters Optimization Method (Powell, Conjugate Gradient, Other) p 1,p 2, … p n GPU 1 p 1,p 2, … p n GPU 2 p 1,p 2, … p n GPU 3 p 1,p 2, … p n GPU 4

10 Results = The Connection Machine * C NVIDIA (where C NVIDIA >> 1) Nonlinear PCA Average 100 iterations (sec) 8x core* 0.877923 C2050 ** 0.021667 speedup40x vs. 1 core295x (measured) Linear PCA Average 100 iterations (sec) 8x core* 0.164605 C2050 ** 0.020173 speedup8x vs. 1 core57x (measured) * 2x Intel (quadcore) E5540s @ 2.53 GHz, openmp, SSE enabled via g++ ** includes all data transfer overhead (“Effective Flops”) What is C NVIDIA for modern x86_64 machines?

11 Scalability across GPU/CPU cluster nodes (big hybrid supercomputers are coming) Oak Ridge National Laboratory looks to NVIDIA “Fermi” architecture for new supercomputer NERSC experimental GPU cluster: Dirac EMSL experimental GPU cluster: Barracuda

12 Looking into my crystal ball I predict long life for GPGPU applications Why? SIMD/SPMD/MIMD mappings translate well to new architectures CUDA/OpenCL provide an excellent way to create these codes Will these applications always be written in these languages? Data-parallel extensions are hot!

13 Data-parallel extensions URL: http://code.google.com/p/thrust/http://code.google.com/p/thrust/ Example from website int main(void) { // generate random data on the host thrust::host_vector h_vec(100); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer to device and compute sum thrust::device_vector d_vec = h_vec; int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus ()); return 0; }

14 OpenCL has potential (but is still very new) X86: The dominant architecture More cores with greater memory bandwidth and lower power Power 7 Blue waters with over 1 million concurrent threads of execution in a petabyte of shared memory Innovative design feature to avoid SMP scaling bottlenecks Hybrid architectures CPU/GPU clusters Problems dominated by irregular access in large data. Cray XMT: specialized for large graph problems

15 Question 3 Will we need a whole new computational execution model for Exascale systems Eg. something like LSU’s ParallelX ? It certainly sounds wonderful! A new model of parallel computation Semantics for state objects, functions, parallel flow control, and distributed interactions Unbounded policies for implementation technology, structure, and mechanism Intrinsic system-wide latency hiding Near fine-grain global parallelism Global unified parallel programming Humans are the weak link A scalable globally unified file-system is essential Yes to a unified set of debugger and profiling tools Many language semantics and assumptions will not scale!


Download ppt "Graphics Processors and the Exascale: Parallel Mappings, Scalability and Application Lifespan Rob Farber, Senior Scientist, PNNL."

Similar presentations


Ads by Google