Liberating GPGPU Programming

Slides:

Advertisements

Similar presentations

Chapter 4 Loops Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved

Advertisements

Lazy Asynchronous I/O For Event-Driven Servers Khaled Elmeleegy, Anupam Chanda and Alan L. Cox Department of Computer Science Rice University, Houston,

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.

Process Description and Control

Chapter 7 System Models.

Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.

Chapter 6 Structures and Classes. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 6-2 Learning Objectives Structures Structure types Structures.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Copyright © 2003 Pearson Education, Inc. Slide 1.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Processes and Operating Systems

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

Writing Pseudocode And Making a Flow Chart A Number Guessing Game

Chapter 5 Input/Output 5.1 Principles of I/O hardware

Prasanna Pandit R. Govindarajan

Week 2 The Object-Oriented Approach to Requirements

Chapter 7: Arrays In this chapter, you will learn about

Discrete Math Recurrence Relations 1.

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Configuration management

13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Chapter 17 Linked Lists.

Chapter 24 Lists, Stacks, and Queues

11 Data Structures Foundations of Computer Science ã Cengage Learning.

Chapter 1 Object Oriented Programming 1. OOP revolves around the concept of an objects. Objects are created using the class definition. Programming techniques.

Semantic Analysis and Symbol Tables

Chapter 10: Virtual Memory

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.

Procedures. 2 Procedure Definition A procedure is a mechanism for abstracting a group of related operations into a single operation that can be used repeatedly.

1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.

Executional Architecture

Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.

Chapter 5 Loops Liang, Introduction to Java Programming, Tenth Edition, (c) 2015 Pearson Education, Inc. All rights reserved.

Systems Analysis and Design in a Changing World, Fifth Edition

Pointers and Arrays Chapter 12

Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.

INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:

Chapter 16 Graphical User Interfaces John Keyser’s Modifications of Slides by Bjarne Stroustrup

User Defined Functions Lesson 1 CS1313 Fall User Defined Functions 1 Outline 1.User Defined Functions 1 Outline 2.Standard Library Not Enough #1.

Abstraction, Modularity, Interfaces and Pointers Original slides by Noah Mendelsohn, including content from Mark Sheldon, Noah Daniels, Norman Ramsey COMP.

Chapter 3 General-Purpose Processors: Software

THUMB Instructions: Branching and Data Processing

The University of Adelaide, School of Computer Science

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Lecture 22 Miscellaneous Topics 4 + Memory Allocation.

OpenCL Programming James Perry EPCC The University of Edinburgh.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

The C++11 Memory Model CDP Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

My Coordinates Office EM G.27 contact time:

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

The Small batch (and Other) solutions in Mantle API

Many-core Software Development Platforms

SOC Runtime Gregory Stoner.

Presentation transcript:

Liberating GPGPU Programming The Future of is Heterogeneous Benedict R. Gaster AMD Research

last year… I talked about similarities between GPU and CPU AMD is pushing this story further The launch of the first graphics core next-based GPUs this year The HSA devices, software infrastructure and associated specifications are progressing Today I want to get people thinking about reality Get away from the mythology of GPU execution Consider how we can streamline programming to make good use of hardware Improve composability of the programming model and architecture

First of all: GPU myths Most people outside of marketing view GPU architecture claims as slightly exaggerated We all know that talking about hundreds of GPU cores and comparing that to a few CPU cores is not entirely accurate Of course the same applies to threads – the way we often use threads is an artifact of the way we are mapping programs to hardware rather than the hardware itself 100x improvements on GPUs are largely due to poor CPU code, new algorithms or relaxed descriptive writing But what about “facts” about how good the GPU is at executing certain code? Are GPUs good at branching?

So are GPUs good at branching? GPUs are bad at executing branchy code This is a claim we hear repeatedly What does it really mean, and given a definition, is it actually true? The usual clarification is: GPUs are bad at executing divergent branches Let’s think about the architecture slightly more abstractly How does the GPU fit in? Let’s try to think about this a little. How easy is it to compile pre-vectorised code like OpenCL, CUDA, or other complicated functions mapped over data to the CPU? Easy you think… but is it really. How well are you using those AVX units? How easy is it to compile that code to the GPU? So why the difference? GPUs are *better* at divergent branch management than the CPU. The only reason people claim otherwise is that they don’t care very much if they only use one lane of the CPU’s vector unit and per-thread performance is higher than on throughput processors such as GPUs.

The SIMD core The SIMD unit on the AMD Radeon™ HD 6970 architecture (and related designs) had a branch control but full scalar execution was performed globally

The SIMD core On the AMD Radeon™ HD 7970 (and other chips; better known as “Graphics Core Next”) we have a full scalar processor and the L1 cache and LDS have been doubled in size Then let us consider the VLIW ALUs Notice that this already doesn’t seem quite so different from a CPU core with vector units?

The SIMD core Remember we could view the architecture two ways: An array of VLIW units A VLIW cluster of vector units

The SIMD core Now that we have a scalar processor we can dynamically schedule instructions rather than relying on the compiler No VLIW! The heart of Graphics Core Next: A scalar processor with four 16-wide vector units Each lane of the vector, and hence each IL work item, is now scalar

Branching and the new ISA //Registers r0 contains “a”, r1 contains “b” //Value is returned in r2 v_cmp_gt_f32 r0,r1 //a > b, establish VCC s_mov_b64 s0,exec //Save current exec mask s_and_b64 exec,vcc,exec //Do “if” s_cbranch_vccz label0 //Branch if all lanes fail v_sub_f32 r2,r0,r1 //result = a – b v_mul_f32 r2,r2,r0 //result=result * a label0: s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec) s_cbranch_execz label1 //Branch if all lanes fail v_sub_f32 r2,r1,r0 //result = b – a v_mul_f32 r2,r2,r1 //result = result * b label1: s_mov_b64 exec,s0 //Restore exec mask float fn0(float a,float b) { if(a>b) return((a-b)*a); else return((b-a)*b); } Optional: Use based on the number of instruction in conditional section. Executed in branch unit First we need to prepare the vector condition register by performing a comparison instruction. Using this value we manipulate the execution mask having stored the previous instance. We can then optionally branch depending on the number lanes that pass or fail the comparison. At the end of the if block we will manipulate the mask based on the value we stored originally. We see here that we are explicitly manipulating the execution mask in generated code.

Familiar? If we add the frontend of the core… v_cmp_gt_f32 r0,r1 //a > b, establish VCC s_mov_b64 s0,exec //Save current mask s_and_b64 exec,vcc,exec //Do “if” s_cbranch_vccz label0 //Branch if all fail v_sub_f32 r2,r0,r1 //result = a – b v_mul_f32 r2,r2,r0 //result=result * a “Graphics Core Next” core “Barcelona” core How would an implicitly vectorized program look mapped onto here using SSE instructions? So different?

So returning to the topic of branches Are GPUs bad at executing branchy code? Aren’t CPUs just as bad at this? Try running a divergent branch on an AVX unit. Actually, CPUs are worse at this because they don’t have the same flexibility in their vector units. Which is easier to compile a pre-vectorized input like OpenCL to: CPU or GPU? So, what did we really mean? GPUs are bad at executing branchy code IF we assume that the input was mapped in SPMD-on-SIMD fashion AND we do NOT assume the same about the CPU Somehow that’s less exciting… Instead, how SHOULD we be thinking about programming these devices? Let’s try to think about this a little. How easy is it to compile pre-vectorised code like OpenCL, CUDA, or other complicated functions mapped over data to the CPU? Easy you think… but is it really. How well are you using those AVX units? How easy is it to compile that code to the GPU? So why the difference? GPUs are *better* at divergent branch management than the CPU. The only reason people claim otherwise is that they don’t care very much if they only use one lane of the CPU’s vector unit and per-thread performance is higher than on throughput processors such as GPUs.

The CPU programmatically: a trivial example What’s the fastest way to perform an associative reduction across an array on a CPU? Take an input array Block it based on the number of threads (one per core usually, maybe 4 or 8 cores) Iterate to produce a sum in each block Reduce across threads Vectorize float4 sum( 0, 0, 0, 0 ) for( i = n/4 to (n + b)/4 ) sum += input[i] float scalarSum = sum.x + sum.y + sum.z + sum.w float sum( 0 ) for( i = n to n + b ) sum += input[i] float reductionValue( 0 ) for( t in threadCount ) reductionValue += t.sum

The GPU programmatically: The same trivial example What’s the fastest way to perform an associative reduction across an array on a GPU? Take an input array Block it based on the number of threads (8 or so per core up to 32 or so cores) Iterate to produce a sum in each block Reduce across threads Vectorize (this bit may be a different kernel dispatch given current models) float64 sum( 0, …, 0 ) for( i = n/64 to (n + b)/64 ) sum += input[i] float scalarSum = waveReduce(sum) float sum( 0 ) for( i = n to n + b ) sum += input[i] Current models ease programming by viewing the vector as a set of scalars ALUs, apparently though not really independent, with varying degree of hardware assistance (and hence overhead): float sum( 0 ) for( i = n/64 to (n + b)/64; i += 64) sum += input[i] float scalarSum = waveReduceViaLocalMemory(sum) float reductionValue( 0 ) for( t in threadCount ) reductionValue += t.sum

They don’t seem so different! For simple cases this is true Vector issues for GPUs are not a programming problem The differences are more about lazy CPU programming than difficulties of GPU programming

Maybe we are going about it wrong? Do we really want to be doing these big data-parallel clustered dispatches? What happens if we step a little back and rethink? What do people really want to do? What about a motion vector search: foreach block: while(found closest match): foreach element in comparison There is serial code mixed in with two levels of parallel code So what if instead of trying to do a single blocked dispatch we explicitly layer the launch for the programmer?

using OpenMP as an example #omp parallel for for( int y = 0; y < YMax; ++y ){ // for each macroblock in Y for( int x = 0; x < XMax; ++x ) { // for each macroblock in X while( not found optimal match ) { // Do something to choose next test region – serial heuristic // Compute tx, ty as coordinates of match region for( int y2 = 0; y2 < 16; ++y2 ) { for( int x2 = 0; x2 < 16; ++x2 ) { diff = block(x2 + 16*x, y2 + 16*y) – target(x2 + tx, y2 + ty); // use diff… maybe sum of squared differences } Where would we parallelise this for the CPU? OpenMP pragmas here? These loops make sense. There’s a lot of parallelism to move into threads. A serial loop that searches for the best match. This is best written as a serial loop and is clean scalar code. Extracting this from the parallel context can feel messy. But here? Not really… we’d really just want to vectorize this. It’s too fine-grained for OpenMP. What we want is a guarantee of parallelism for the compiler.

Flattening nested data parallel code can be messy What do we do when we manually write that in OpenCL? We launch a workgroup for each block Some work items in the workgroup that may or may not match the number of iterations we need to do Each work item will contain the same serial loop Will the compiler realize that this is the same loop across the group? We’ve flattened a clean nested algorithm From a 2D loop + a serial loop + a 2D loop Into a 4D blocked iteration space Spread the serial loop around the blocks It’s a messy projection... surely not the best way?

Even so… Even if our programming model isn’t quite right yet GPUs really have been hard to program for So let’s look at the real problems and how we can address them If we do that right, then a good programming model can be placed on top

Improving the programming model Fixing the composition problem

algorithm = f g Function COMPOSITION The ability of decomposing a larger problem into sub-components The ability to replace one sub-component with another sub-component with the same external semantics algorithm = f g

Composability in current models Current GPU programming models suffer from composability limitations The data-parallel model works in simple cases. Its fixed, limited nature breaks down when: We need to use long-running threads to more efficiently perform reductions We want to synchronize inside and outside library calls We want to pass memory spaces into libraries We need to pass data structures cleanly between threads

address spaces and communication

Memory address spaces are not COMPOSABLE void foo(global int *) { … } void bar(global int * x) foo(x); // works fine local int a[1]; a[0] = *x; foo(a); // will now not work

OpenCL™ C++ addresses memory spaces to a degree Develop device code using full C++ Extending OpenCL™ address spaces in the C++ type system Automatic inference of address spaces, handle this pointer address space deduction This works particularly well with the Khronos C++ API #include <CL/cl.hpp> Kernel functors: std::function<Event (const cl::EnqueueArgs&, cl::Pointer<int>) plus = make_kernel<cl::Pointer<int>, int>( “kernel void plus(global Pointer<int> io) { int I = get_global_id(0); *(io+i) = *(io+i) * 2; }”);

Opencl™ Kernel Language Abstract over address space qualifiers. Methods can be annotated with address spaces, controls “this” pointer location. Extended to overloading. Default address space for “this” is deduced automatically. Support default constructors. template<address-space aspace_> struct Shape { int foo(aspace_ Colour&) global + local; int foo(aspace_ Colour&) private; int bar(void); }; operator (const decltype(this)& rhs) -> decltype(this)& { if (this == &rhs) { return *this; } … return *this; } C+11 features used to handle cases when type of “this” needs to be written down by the developer.

Memory consistency We can also add a “flat” memory address space to improve the situation Memory regions computed based on address, not instruction Even in global memory, communication is problematic We need to guarantee correct memory consistency “at end of kernel execution” is simply not fine-grained enough for a lot of workloads Solution! Define a clear memory consistency model

Kernels that take real pointers We can launch kernels with real pointers as parameters No pre-construction of buffers Fewer concerns about memory movement (and no *correctness* concerns on a single machine) Allow for system-wide architected queues Write kernel dispatch packets directly into virtual memory HSA device reads those packets and performs the work No kernel mode transitions Allow for direct work on data structures from other devices Less memory traffic Less code restructuring

HSA Memory Model Key concepts Simplified view of memory Sharing pointers across devices is possible makes it possible to run a task on a any device Possible to use pointers and data structures that require pointer chasing correctly across device boundaries Relaxed consistency memory model Acquire/release Barriers Designed to be compatible with C++11, Java and .NET Memory Models Platform-level visibility control The memory model for a new architecture is key

Synchronization

traditional Barrier synchronization IS not COMPOSaBLE Producer/consumer patterns in SPMD-on-SIMD model must take SIMD execution into account foreach work item i in buffer range for( some loop ) { if( work item is producer ) { // communicate barrier(); } else { } Even if the barriers could be treated correctly, the SIMD width matters. Simple OpenCL barriers do not work in control flow. These are different barriers, we must hit both.

traditional Barrier synchronization IS not COMPOSaBLE Producer/consumer patterns in SPMD-on-SIMD model must take SIMD execution into account foreach work item i in buffer range for( some loop ) { if( work item is producer ) { // communicate barrier(); } else { } Wavefront (thread) Wavefront (thread) Dependency – we hit both barrier instances

traditional Barrier synchronization IS not COMPOSaBLE Producer/consumer patterns in SPMD-on-SIMD model must take SIMD execution into account foreach work item i in buffer range for( some loop ) { if( work item is producer ) { // communicate barrier(); } else { } Wavefront (thread) Wavefront (thread) Dependency – We only have one PC, can’t hit both barrier instances

traditional Barrier synchronization IS not COMPOSaBLE What if we use libraries? foreach work item i in buffer range for( some loop ) { if( work item is producer ) { // communicate send(); } else { receive(); } Do send and receive contain barriers? In the presence of libraries,wavefront-structured barriers are very hard to enforce. Composability is at risk.

Making Barriers first class Barrier objects, introduce barriers as first class values with a set of well define operations: Construction – initialize a barrier for some sub-set of work-items within a work-group or across work- groups Arrive – work-item marks the barrier as satisfied Skip – work-item marks the barrier as satisfied and note that it will no longer take part in barrier operations. Allows early exit from a loop or divergent control where work-item never intends to hit take part in barrier Wait – work-item marks the barrier as satisfied and waits for all other work-items to arrive, wait, or skip.

BARRIER OBJECT EXAMPLE – Simple Divergent control flow barrier b(8); parallelFor(Range<1>(8), [&b] (Index<1> i) { int val = i.getX(); scratch[i] = val; if( i < 4 ) { b.wait(); x[i] = scratch[i+1]; } else { b.skip(); x[i] = 17; }}); We can arrive or skip here, meaning that we acknowledge the barrier but do not wait on it if i >= 4

BARRIER OBJECT EXAMPLE – LOOP FLOW barrier b(8); parallelFor(Range<1>(8) [&b] (Index<1> i) { scratch[i.getX()] = i.getX(); if( i.getX() < 4 ) { for( int j = 0; j < i; ++j ) { b.wait(); x[i] += scratch[i+1]; } b.skip(); } else { x[i.getX()] = 17; }}); Inside the loop we need to synchronize repeatedly Early exit or skipping the loop entirely should never again affect synchronization within the loop.

BARRIER OBJECT EXAMPLE – Calling a function from within Control flow barrier b(8); parallelFor(Range<1>(8), [&b] (Index<1> i) { scratch[i] = i.getX(); if( i.getX() < 4 ) { someOpaqueLibraryFunction(i.getX(), b); } else { b.skip(); x[i.getX()] = 17; }}); void someOpaqueLibraryFunction( const int i, int *scratch, barrier &b) { for( int j = 0; j < i; ++j ) { b.wait(); x[i] += scratch[i+1]; } b.skip(); Barrier used outside the library function and passed in The same barrier used from within the library function

Barriers and threads Think about what this means about our definition of a thread Without first class barriers A single work item can only be treated as a thread in the absence of synchronization With synchronization the SIMD nature of execution must be accounted for With first class barriers A single work item can be treated as a thread It will behave, from a correctness point of view, like a thread Only performance suffers from the SPMD-on-SIMD execution

Back to basics

Research directions There is a need to look at more source languages targeting heterogeneous and vector architectures C++AMP and similar models fill a certain niche, but they target simplification of the tool chain rather than expanding the scope of the programming model HSA will help Multi-vendor Architected low-level APIs Well defined low-level intermediate language Vitally: with largely preserved optimizations. Finalizer will not interfere too much with code generation. Fast linking.

Conclusions HSA takes some steps towards improving the execution model and giving a programming model target HSA does not directly improve the programming model It offers an excellent low-level framework It is time to start enthusiastically looking at the programming model Let’s program the system in a single clean manner Let’s forget nonsense about how GPUs are vastly different beasts from CPUs – it’s just not true Let’s get away from 400x speedups, and magical efficiency gains and look at the system realistically But let’s not forget: performance portability without abstraction is a pipe dream Please use HSA as a starting point Go wild and use your imaginations

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, AMD Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is a trademark of Apple Inc. used with permission by Khronos. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. © 2012 Advanced Micro Devices, Inc.