Presentation is loading. Please wait.

Presentation is loading. Please wait.

HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD.

Similar presentations


Presentation on theme: "HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD."— Presentation transcript:

1

2 HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD

3 3| Presentation Title | Month ##, 2012 THE HETEROGENEOUS SYSTEM ARCHITECTURE Taking to programmers

4 4| Presentation Title | Month ##, 2012 OPENCL™ AND HSA  HSA is an optimized platform architecture for OpenCL™ –Not an alternative to OpenCL™  OpenCL™ on HSA will benefit from –Avoidance of wasteful copies –Low latency dispatch –Improved memory model –Pointers shared between CPU and GPU  HSA also exposes a lower level programming interface, for those that want the ultimate in control and performance –Optimized libraries may choose the lower level interface

5 5| Presentation Title | Month ##, 2012 HSA TAKING PLATFORM TO PROGRAMMERS  Balance between CPU and GPU for performance and power efficiency  Make GPUs accessible to wider audience of programmers –Programming models close to today’s CPU programming models –Enabling more advanced language features on GPU –Shared virtual memory enables complex pointer-containing data structures (lists, trees, etc) and hence more applications on GPU –Kernel can enqueue work to any other device in the system (e.g. GPU->GPU, GPU->CPU) Enabling task-graph style algorithms, Ray-Tracing, etc  Clearly defined HSA memory model enables effective reasoning for parallel programming  HSA provides a compatible architecture across a wide range of programming models and HW implementations.

6 6| Presentation Title | Month ##, 2012  How we deliver the HSA value proposition?  Overall Vision: –Make GPU easily accessible  Support mainstream languages  Expandable to domain specific languages –Make compute offload efficient  Direct path to GPU (avoid Graphics overhead)  Eliminate memory copy  Low-latency dispatch –Make it ubiquitous  Drive HSA as a standard through HSA Foundation  Open Source key components HSA SOFTWARE STACK Application and System Languages, domain specific languages, etc e.g. OpenCL™, C++ AMP, Python, R, JS, etc. Application and System Languages, domain specific languages, etc e.g. OpenCL™, C++ AMP, Python, R, JS, etc. HSA Runtime LLVM IR HSA Hardware Applications HSAIL

7 7| Presentation Title | Month ##, 2012 HSA EXECUTION MODEL VIA HSA RUNTIME  HSA Runtime User-mode work queues –Uniform abstraction across devices, simple insertion mechanism –Multi-level parallelism -- within a queue and across queues  Simple parallelism specifier –Range/Grid, and group –HW specifics have a simple abstraction  Analogous to programming based on cache-line size –Implicit preemption – launch and execute multiple tasks simultaneously User Write Device Read User Write Device Read User Write Device Read

8 8| Presentation Title | Month ##, 2012 HSA MEMORY MODEL VIA HSA RUNTIME  Key concepts –Simplified view of memory –Sharing pointers across devices is possible  makes it possible to run a task on a any device  Possible to use pointers and data structures that require pointer chasing correctly across device boundaries –Relaxed consistency memory model  Acquire/release  Barriers  HSA Runtime exposes allocation interfaces with control over memory attributes –Types of memories can be mixed and matched based on usage needs  Simplified launches – dispatch(task, arg1, arg2, …) –Run device tasks with stack memory

9 9| Presentation Title | Month ##, 2012 ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE  Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU) Enabling task-graph style algorithms, Ray-Tracing, etc Queue is an architected feature Format of what represents a queue is architected Methods to enqueue follow Decoupled from HSAIL language Unique way of dynamically specifying where enqueues go to Resolution at the time of execution permits many load-balancing solutions User Write Device Read

10 10| Presentation Title | Month ##, 2012 ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE Device Write Device Read  Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU) Enabling task-graph style algorithms, Ray-Tracing, etc Queue is an architected feature Format of what represents a queue is architected Methods to enqueue follow Decoupled from HSAIL language Unique way of dynamically specifying where enqueues go to Resolution at the time of execution permits many load-balancing solutions

11 11| Presentation Title | Month ##, 2012 ARCHITECTED QUEUES AND DEVICE-TO-DEVICE ENQUEUE Device Read  Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU) Enabling task-graph style algorithms, Ray-Tracing, etc Queue is an architected feature Format of what represents a queue is architected Methods to enqueue follow Decoupled from HSAIL language Unique way of dynamically specifying where enqueues go to Resolution at the time of execution permits many load-balancing solutions

12 12| Presentation Title | Month ##, 2012 ABSTRACTING ARCHITECTED FEATURES  CreateQueue(ptr,size,…);  For(i=1,n)queue.dispatch(kernel,args,dep n+i,n-i);  queue.dispatch(1minuteKernel,args)  Event.wait(); event.getExceptionDetails();  Call CPU_FUNCTION_FROM_GPU  *fptr(….)  queue.dispatch(kernel,iptr); *iptr=2;  Queue.dispatch(kernel_set_i_value_1, iptr); While(i==1);  HSAAllocate(1) LDS/GDS as virtual memory  Access any address from host/kernel  Do atomics on the queue, in host and in kernel  Channels  User Mode Queuing  Context Switching  Process Reset (to avoid TDRs)  HW Exceptions  Function calls  Virtual functions  Memory Coherence  Unpinned Memory Access (for DMA and Compute Shader)  Flat Address Space  Unaligned Addressing / Memory Access  Platform Atomic Operations  Memory Watchpoints

13 13| Presentation Title | Month ##, 2012 ENABLING DIFFERENT KINDS OF PROBLEM DOMAINS  Memory model  HSAIL Language  Execution Model  Architected Features  Utilize combination of characteristics per application requirements Memory Model HSAIL Language Execution Model Architect Feature Architected Feature- 1 Architected Feature- 2 Architected Feature- 3 Model1 Architected Feature- 1 Architected Feature- 3 Architected Feature- 4 Model2 Architected Feature- 1 Architected Feature- 4 Model3 Application and System Languages, domain specific languages, etc e.g. OpenCL™, C++ AMP, Python, R, JS, etc. Application and System Languages, domain specific languages, etc e.g. OpenCL™, C++ AMP, Python, R, JS, etc. HSA Runtime LLVM IR HSA Hardware Applications HSAIL

14 14| Presentation Title | Month ##, 2012 EXPOSING DATAFLOW THROUGH DEVICE-SIDE ENQUEUE

15 15| Presentation Title | Month ##, 2012 CHANNELS - PERSISTENT CONTROL PROCESSOR THREADING MODEL  Add data-flow support to GPGPU  We are not primarily notating this as producer/consumer kernel bodies –That is that we are not promoting a method where one kernel loops producing values and another loops to consume them –That has the negative behavior of promoting long-running kernels –We’ve tried to avoid this elsewhere by basing in-kernel launches around continuations rather than waiting on children  Instead we assume that kernel entities produce/consume but consumer work-items are launched on- demand  An alternative to the point to point data flow using of persistent threads, avoiding the uber-kernel

16 16| Presentation Title | Month ##, 2012 OPERATIONAL FLOW

17 17| Presentation Title | Month ##, 2012 OPERATIONAL FLOW

18 18| Presentation Title | Month ##, 2012 OPERATIONAL FLOW

19 19| Presentation Title | Month ##, 2012 OPERATIONAL FLOW

20 20| Presentation Title | Month ##, 2012 OPERATIONAL FLOW

21 21| Presentation Title | Month ##, 2012 OPERATIONAL FLOW

22 22| Presentation Title | Month ##, 2012 OPERATIONAL FLOW

23 23| Presentation Title | Month ##, 2012 OPERATIONAL FLOW

24 24| Presentation Title | Month ##, 2012 OPERATIONAL FLOW

25 25| Presentation Title | Month ##, 2012 CHANNEL EXAMPLE std::function *)> predicate = [] (opp::Channel * c) -> bool __device(fql) { return c->size() % PACKET_SIZE == 0; }; opp::Channel b(N); b.executeWith( predicate, opp::Range (CHANNEL_SIZE), [&sumB] (opp::Index ) __device(opp) { sumB++; }); opp::Channel c(N); c.executeWith( predicate, opp::Range (CHANNEL_SIZE), [&sumC] (opp::Index, const int v) __device(opp) { sumC += v; }); opp::parallelFor( opp::Range (N), [a, &b, &c] (opp::Index index) __device(opp) { unsigned int n = *(a+index.getX()); if (n > 5) { b.write(n); } else { c.write(n); } });

26 26| Presentation Title | Month ##, 2012 EXAMPLE PROBLEMS Rigid body/cloth collision

27 27| Presentation Title | Month ##, 2012 CLOTH SIMULATION AND COLLISION DETECTION  Physics simulation has a range of properties  Rigid body simulation is often –Not highly parallel –Very dynamic –Not necessarily a good match for wide SIMD architectures  Cloth simulation is –Highly parallel –While meshes are complicated connectivity is largely static

28 28| Presentation Title | Month ##, 2012 EFFICIENT GPU CLOTH SIMULATION: TWO-LEVEL BATCHING  Offline static batching of the mesh  Create independent subsets of links through graph coloring.  Synchronize between batches 10 batches

29 29| Presentation Title | Month ##, 2012 EFFICIENT GPU CLOTH SIMULATION: BATCHING  Chunk mesh into larger groups of links  Batch those chunks –4 global dispatches  Iterate within the workgroups –8 secondary batches 4 batches8 secondary batches

30 30| Presentation Title | Month ##, 2012 COLLISION WITH RIGID BODY  Small set of rigid bodies  Rigid bodies best computed on the CPU  Cloth on GPU

31 31| Presentation Title | Month ##, 2012 OPTIONS  Either –Small launches of rigid body/cloth collisions against cloth –Process rigid body/cloth collisions on CPU  On GPU: –Small launches suffer dispatch overhead –Must update rigid body data structures from the GPU  On CPU: –Must continuously move cloth mesh data to and from GPU RB solve Cloth solve Cloth/RB collide CPUGPU RB solve Cloth solve Cloth/RB collide

32 32| Presentation Title | Month ##, 2012 RB solve Cloth solve Cloth/RB collide CPUGPU RB solve Cloth solve Cloth/RB collide OPTIONS  Either –Small launches of rigid body/cloth collisions against cloth –Process rigid body/cloth collisions on CPU  On GPU: –Small launches suffer dispatch overhead –Must update rigid body data structures from the GPU  On CPU: –Must continuously move cloth mesh data to and from GPU

33 33| Presentation Title | Month ##, 2012 WHY HSA?  Colliding rigid bodies are likely to be very sparse in memory –Do not want to copy the rigid body array to the GPU “just in case” –Do not even want to incur OS page lock overhead –Accessing targeted virtual addresses as necessary reduces the overhead  HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task  Operations are in a tight loop –Overhead from dispatch grows quickly –User mode queuing reduces this significantly  Architected queues are exposed via simple API  Do not want to transform rigid body code –It is a common problem to do vast, confusing, transformations to host code to enabled wide vector processing  Shared pointer model enables access to those structures directly rather than restructuring them

34 34| Presentation Title | Month ##, 2012 HOW DO YOU TRIGGER CLOTH/RB COLLIDE? RB solve Cloth solve Cloth/RB collide CPUGPU CS = dispatch(cloth_solve, x, y, …) CPU does RB Solve (p, q, …) Wait for CS Dispatch(cloth_RB_collide, x, p, …) RB = dispatch(RB_solve, p, q, …) CS = dispatch(cloth_solve, x, y, …) Collide = dispatch(cloth_RB_collide, x, p, …, RB & CS)

35 35| Presentation Title | Month ##, 2012 IN PSEUDOCODE  batch rigid bodies and RB/cloth pairs  dispatch cloth solver to GPU;  dispatch cloth/rigid body collision solver to GPU pending event;  foreach rigid body batch:  for rigid body pair in batch:  compute force  update position and velocity of rigid body  Signal GPU event  Foreach rigid body not involved in cloth collision: –update positions  Return to next iteration  Cloth solver:  for iteration towards convergence:  foreach cloth link batch:  foreach cloth link subbatch:  update positions and velocities  Cloth/RB solver:  for each batch of RB/Cloth pairs: –Read rigid body data directly from data structures used by the CPU –Test cloth against RB and update cloth –Read/write update to global data (relies on memory visibility rules guaranteed by HSA)

36 36| Presentation Title | Month ##, 2012 EXAMPLE PROBLEMS Tree search

37 37| Presentation Title | Month ##, 2012 NESTED DATA PARALLELISM AND EFFICIENT EXECUTION OF UNSTRUCTURED DATA  Perfectly balanced trees are easy: –If the tree is being regularly rebalanced and stored contiguously then data may be moved around where needed  Large, poorly balanced trees are harder: –Layout is ambiguous so copying data is challenging –Amount of parallelism is unpredictable  One approach to deal with this, on a single node: –Fine-grained tasking –Share memory infrastructure –Picture breadth first search through FIFO queues  Example: UTS (unbalanced Tree Search): –To count the number of nodes in an implicitly constructed tree –tree is parameterized in shape, depth, size, and imbalance.

38 38| Presentation Title | Month ##, 2012 SEARCHING A TREE  As we move through the tree: –Unpredictable amount of parallelism –Unpredictable dependence structure  Launch tasks as tree space is available: –Perform BFS queuing into a buffer –When buffer reaches a certain size, launch processing code

39 39| Presentation Title | Month ##, 2012 SEARCHING A TREE  As we move through the tree: –Unpredictable amount of parallelism –Unpredictable dependence structure  Launch tasks as tree space is available: –Perform BFS queuing into a buffer –When buffer reaches a certain size, launch processing code

40 40| Presentation Title | Month ##, 2012 SEARCHING A TREE  As we move through the tree: –Unpredictable amount of parallelism –Unpredictable dependence structure  Launch tasks as tree space is available: –Perform BFS queuing into a buffer –When buffer reaches a certain size, launch processing code –Slowly increase launch batch size to improve efficiency

41 41| Presentation Title | Month ##, 2012 SEARCHING A TREE  As we move through the tree: –Unpredictable amount of parallelism –Unpredictable dependence structure  Launch tasks as tree space is available: –Perform BFS queuing into a buffer –When buffer reaches a certain size, launch processing code –Slowly increase launch batch size to improve efficiency

42 42| Presentation Title | Month ##, 2012 SEARCHING A TREE  As we move through the tree: –Unpredictable amount of parallelism –Unpredictable dependence structure  Launch tasks as tree space is available: –Perform BFS queuing into a buffer –When buffer reaches a certain size, launch processing code –Slowly increase launch batch size to improve efficiency

43 43| Presentation Title | Month ##, 2012 SEARCHING A TREE  As we move through the tree: –Unpredictable amount of parallelism –Unpredictable dependence structure  Launch tasks as tree space is available: –Perform BFS queuing into a buffer –When buffer reaches a certain size, launch processing code –Slowly increase launch batch size to improve efficiency

44 44| Presentation Title | Month ##, 2012 EXTENDING THE TREE  Many tree search algorithms expand the tree as time goes on –A lot of overhead in the absence of shared memory –With shared memory we can be searching parts of the tree while adding to others  Example: Multiresolution Analysis (MRA) is a mathematical technique for approximating a continuous function as a hierarchy of coefficients over a set of basis functions. –characterized by dynamic adaptivity to guarantee the accuracy of approximation, Challenges include:  The coefficient trees are unbalanced on account of the adaptive multiresolution  properties leading to different scales of information granularity  The tree structure may be refined in an uncoordinated fashion – different parts of the tree may be refined independently and the intervals of such refinement are not preset.

45 45| Presentation Title | Month ##, 2012 WHY HSA?  Dynamic adaptive nature means unpredictable amount of parallelism unpredictable dependence structure –Do not want to copy sections array to the GPU “just in case” –Accessing targeted virtual addresses as necessary reduces the overhead  HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task  Tremendous nesting of parallelism is inherent in the problem –Single search-node can lead to a very large number of additional searches –Ability to efficiently do nesting is key and triggering searches based on grouping is important  HSA allows for device-to-device enqueue that permits nested parallelism  Significant load imbalance is possible –Need to group searches and trigger them when they reach a certain size  Support for dataflow via channels makes this possible –Need to balance what is already launched due to the unpredictable amount of parallelism  Queues are in user mode, balancing is enabled by the architected features that allow user-level access to a queue

46 46| Presentation Title | Month ##, 2012 HOW DO YOU BALANCE?  Several user mode queues –Number of nodes to start with don’t represent the real load (imbalance)

47 47| Presentation Title | Month ##, 2012 IN PSEUDOCODE  Control process: –dipatch(unbalanced_tree_kernel, root….)  GPU/CPU unbalanced_tree_kernel –For the next n-1 levels  Count –For each child at level n  Insert child into parse channel  Control process: –dipatch(unbalanced_tree_kernel, root….) –balance and terminate  GPU/CPU : unbalanced_tree_kernel –For the next n-1 levels  Count –For each child at level n  dispatch(unbalanced_tree_kernel, child….)

48 48| Presentation Title | Month ##, 2012 CONCLUSIONS  HSA has several architected features that can improve programmability and an ecosystem that exposes these features to the users effectively  HSA runtime is how these features are exposed to higher-level programming models –Composability is possible, a new higher-level model can be composed of multiple architected features  Channels are a very unique technology made possible with HSA –Channels enable many applications that need dataflow model or features  Cloth simulation and collision detection is an example that shows how several of HSA features both simplify the solution and avoid unnecessary costs typically involved with using GPUs to solve this problem  Unbalanced tree search is a domain with unpredictable amount of parallelism, a major load balancing problem and need for adjusting the granularity of a task –HSA features significantly simplify and allow a natural solution to this problem. –Channels address adjusting of granularity of launches by allowing a dataflow patterns that launches a task when certain data-dependent criterion are met

49 49| Presentation Title | Month ##, 2012 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. © 2012 Advanced Micro Devices, Inc.


Download ppt "HSA FOR THE COMMON MAN Vinod Tipparaju Heterogeneous System Software, AMD Lee Howes Heterogeneous System Software, AMD."

Similar presentations


Ads by Google