HSA for the common man Vinod Tipparaju

HSA for the common man Vinod Tipparaju
Heterogeneous System Software, AMD Lee Howes

The heterogeneous system architecture Taking to programmers

HSA is an optimized platform architecture for OpenCL™
OPENCL™ AND HSA HSA is an optimized platform architecture for OpenCL™ Not an alternative to OpenCL™ OpenCL™ on HSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU HSA also exposes a lower level programming interface, for those that want the ultimate in control and performance Optimized libraries may choose the lower level interface

HSA taking platform to programmers
Balance between CPU and GPU for performance and power efficiency Make GPUs accessible to wider audience of programmers Programming models close to today’s CPU programming models Enabling more advanced language features on GPU Shared virtual memory enables complex pointer-containing data structures (lists, trees, etc) and hence more applications on GPU Kernel can enqueue work to any other device in the system (e.g. GPU->GPU, GPU->CPU) Enabling task-graph style algorithms, Ray-Tracing, etc Clearly defined HSA memory model enables effective reasoning for parallel programming HSA provides a compatible architecture across a wide range of programming models and HW implementations.

Application and System Languages, domain specific languages, etc
HSA Software Stack How we deliver the HSA value proposition? Overall Vision: Make GPU easily accessible Support mainstream languages Expandable to domain specific languages Make compute offload efficient Direct path to GPU (avoid Graphics overhead) Eliminate memory copy Low-latency dispatch Make it ubiquitous Drive HSA as a standard through HSA Foundation Open Source key components Applications Application and System Languages, domain specific languages, etc e.g. OpenCL™, C++ AMP, Python, R, JS, etc. LLVM IR HSAIL HSA Runtime HSA Hardware

HSA EXECUTION model via HSA runtime
User Write Device Read User Write Device Read User Write Device Read HSA Runtime User-mode work queues Uniform abstraction across devices, simple insertion mechanism Multi-level parallelism -- within a queue and across queues Simple parallelism specifier Range/Grid, and group HW specifics have a simple abstraction Analogous to programming based on cache-line size Implicit preemption – launch and execute multiple tasks simultaneously

HSA MEMORY MODEL VIA HSA RUNTIME
Key concepts Simplified view of memory Sharing pointers across devices is possible makes it possible to run a task on a any device Possible to use pointers and data structures that require pointer chasing correctly across device boundaries Relaxed consistency memory model Acquire/release Barriers HSA Runtime exposes allocation interfaces with control over memory attributes Types of memories can be mixed and matched based on usage needs Simplified launches – dispatch(task, arg1, arg2, …) Run device tasks with stack memory

Architected queues and device-to-device enqueue
Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU) Enabling task-graph style algorithms, Ray-Tracing, etc Queue is an architected feature Format of what represents a queue is architected Methods to enqueue follow Decoupled from HSAIL language Unique way of dynamically specifying where enqueues go to Resolution at the time of execution permits many load-balancing solutions User Write Device Read

Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU) Enabling task-graph style algorithms, Ray-Tracing, etc Queue is an architected feature Format of what represents a queue is architected Methods to enqueue follow Decoupled from HSAIL language Unique way of dynamically specifying where enqueues go to Resolution at the time of execution permits many load-balancing solutions Device Write Device Read

Kernel can enqueue work to any other queue in the system (e.g. GPU->GPU, GPU->CPU) Enabling task-graph style algorithms, Ray-Tracing, etc Queue is an architected feature Format of what represents a queue is architected Methods to enqueue follow Decoupled from HSAIL language Unique way of dynamically specifying where enqueues go to Resolution at the time of execution permits many load-balancing solutions Device Read

Abstracting architected features
CreateQueue(ptr,size,…); For(i=1,n)queue.dispatch(kernel,args,dep n+i,n-i); queue.dispatch(1minuteKernel,args) Event.wait(); event.getExceptionDetails(); Call CPU_FUNCTION_FROM_GPU *fptr(….) queue.dispatch(kernel,iptr); *iptr=2; Queue.dispatch(kernel_set_i_value_1, iptr); While(i==1); HSAAllocate(1) LDS/GDS as virtual memory Access any address from host/kernel Do atomics on the queue, in host and in kernel Channels User Mode Queuing Context Switching Process Reset (to avoid TDRs) HW Exceptions Function calls Virtual functions Memory Coherence Unpinned Memory Access (for DMA and Compute Shader) Flat Address Space Unaligned Addressing / Memory Access Platform Atomic Operations Memory Watchpoints

Enabling different kinds of problem domains
Applications Memory model HSAIL Language Execution Model Architected Features Utilize combination of characteristics per application requirements Architected Feature- 1 Architected Feature- 2 Architected Feature- 3 Architected Feature- 1 Architected Feature- 3 Architected Feature- 4 Model1 Model2 Application and System Languages, domain specific languages, etc e.g. OpenCL™, C++ AMP, Python, R, JS, etc. LLVM IR Memory Model HSAIL HSAIL Language Architected Feature- 1 Architected Feature- 4 Model3 HSA Runtime Architect Feature HSA Hardware Execution Model

Exposing dataflow through device-side enqueue

CHANNELS - Persistent Control Processor Threading Model
Add data-flow support to GPGPU We are not primarily notating this as producer/consumer kernel bodies That is that we are not promoting a method where one kernel loops producing values and another loops to consume them That has the negative behavior of promoting long-running kernels We’ve tried to avoid this elsewhere by basing in-kernel launches around continuations rather than waiting on children Instead we assume that kernel entities produce/consume but consumer work-items are launched on- demand An alternative to the point to point data flow using of persistent threads, avoiding the uber-kernel

Operational Flow

channel example std::function<bool (opp::Channel<int>*)> predicate = [] (opp::Channel<int>* c) -> bool __device(fql) { return c->size() % PACKET_SIZE == 0; }; opp::Channel<int> b(N); b.executeWith( predicate, opp::Range<1>(CHANNEL_SIZE), [&sumB] (opp::Index<1>) __device(opp) { sumB++; }); opp::Channel<int> c(N); c.executeWith( predicate, opp::Range<1>(CHANNEL_SIZE), [&sumC] (opp::Index<1>, const int v) __device(opp) { sumC += v; }); opp::parallelFor( opp::Range<1>(N), [a, &b, &c] (opp::Index<1> index) __device(opp) { unsigned int n = *(a+index.getX()); if (n > 5) { b.write(n); } else { c.write(n);

Example problems Rigid body/cloth collision

cloth simulation and collision detection
Physics simulation has a range of properties Rigid body simulation is often Not highly parallel Very dynamic Not necessarily a good match for wide SIMD architectures Cloth simulation is Highly parallel While meshes are complicated connectivity is largely static

Efficient GPU cloth simulation: two-level batching
4/6/2017 1:56 PM Efficient GPU cloth simulation: two-level batching Offline static batching of the mesh Create independent subsets of links through graph coloring. Synchronize between batches 10 batches Finally we realize we need no fewer than 10 batches to fully cover the mesh. This means that we have 10 sets of link processing, each of which is internally parallel but which must be processed in series with synchronization points in between. © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Efficient GPU cloth simulation: Batching
4/6/2017 1:56 PM Efficient GPU cloth simulation: Batching Chunk mesh into larger groups of links Batch those chunks 4 global dispatches Iterate within the workgroups 8 secondary batches Finally we realize we need no fewer than 10 batches to fully cover the mesh. This means that we have 10 sets of link processing, each of which is internally parallel but which must be processed in series with synchronization points in between. 4 batches 8 secondary batches © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Collision with rigid body
Small set of rigid bodies Rigid bodies best computed on the CPU Cloth on GPU

Options RB solve Cloth solve Cloth/RB collide CPU GPU Either Small launches of rigid body/cloth collisions against cloth Process rigid body/cloth collisions on CPU On GPU: Small launches suffer dispatch overhead Must update rigid body data structures from the GPU On CPU: Must continuously move cloth mesh data to and from GPU

Why HSA? Colliding rigid bodies are likely to be very sparse in memory
Do not want to copy the rigid body array to the GPU “just in case” Do not even want to incur OS page lock overhead Accessing targeted virtual addresses as necessary reduces the overhead HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task Operations are in a tight loop Overhead from dispatch grows quickly User mode queuing reduces this significantly Architected queues are exposed via simple API Do not want to transform rigid body code It is a common problem to do vast, confusing, transformations to host code to enabled wide vector processing Shared pointer model enables access to those structures directly rather than restructuring them

How do you trigger Cloth/RB collide?
RB solve Cloth solve Cloth/RB collide CPU GPU CS = dispatch(cloth_solve, x, y, …) CPU does RB Solve (p, q, …) Wait for CS Dispatch(cloth_RB_collide, x, p, …) RB = dispatch(RB_solve, p, q, …) CS = dispatch(cloth_solve, x, y, …) Collide = dispatch(cloth_RB_collide, x, p, …, RB & CS)

In pseudocode batch rigid bodies and RB/cloth pairs
dispatch cloth solver to GPU; dispatch cloth/rigid body collision solver to GPU pending event; foreach rigid body batch: for rigid body pair in batch: compute force update position and velocity of rigid body Signal GPU event Foreach rigid body not involved in cloth collision: update positions Return to next iteration Cloth solver: for iteration towards convergence: foreach cloth link batch: foreach cloth link subbatch: update positions and velocities Cloth/RB solver: for each batch of RB/Cloth pairs: Read rigid body data directly from data structures used by the CPU Test cloth against RB and update cloth Read/write update to global data (relies on memory visibility rules guaranteed by HSA)

Example problems Tree search

Nested data parallelism and efficient execution of unstructured data
Perfectly balanced trees are easy: If the tree is being regularly rebalanced and stored contiguously then data may be moved around where needed Large, poorly balanced trees are harder: Layout is ambiguous so copying data is challenging Amount of parallelism is unpredictable One approach to deal with this, on a single node: Fine-grained tasking Share memory infrastructure Picture breadth first search through FIFO queues Example: UTS (unbalanced Tree Search): To count the number of nodes in an implicitly constructed tree tree is parameterized in shape, depth, size, and imbalance.

Searching a tree As we move through the tree:
Unpredictable amount of parallelism Unpredictable dependence structure Launch tasks as tree space is available: Perform BFS queuing into a buffer When buffer reaches a certain size, launch processing code

Searching a tree As we move through the tree:
Unpredictable amount of parallelism Unpredictable dependence structure Launch tasks as tree space is available: Perform BFS queuing into a buffer When buffer reaches a certain size, launch processing code Slowly increase launch batch size to improve efficiency

Extending the tree Many tree search algorithms expand the tree as time goes on A lot of overhead in the absence of shared memory With shared memory we can be searching parts of the tree while adding to others Example: Multiresolution Analysis (MRA) is a mathematical technique for approximating a continuous function as a hierarchy of coefficients over a set of basis functions. characterized by dynamic adaptivity to guarantee the accuracy of approximation, Challenges include: The coefficient trees are unbalanced on account of the adaptive multiresolution properties leading to different scales of information granularity The tree structure may be refined in an uncoordinated fashion – different parts of the tree may be refined independently and the intervals of such refinement are not preset.

Why HSA? Dynamic adaptive nature means unpredictable amount of parallelism unpredictable dependence structure Do not want to copy sections array to the GPU “just in case” Accessing targeted virtual addresses as necessary reduces the overhead HSA runtime exposes the architected feature and permits use of any memory either as an argument or otherwise in a task Tremendous nesting of parallelism is inherent in the problem Single search-node can lead to a very large number of additional searches Ability to efficiently do nesting is key and triggering searches based on grouping is important HSA allows for device-to-device enqueue that permits nested parallelism Significant load imbalance is possible Need to group searches and trigger them when they reach a certain size Support for dataflow via channels makes this possible Need to balance what is already launched due to the unpredictable amount of parallelism Queues are in user mode, balancing is enabled by the architected features that allow user-level access to a queue

How do you balance? Several user mode queues
Number of nodes to start with don’t represent the real load (imbalance)

In pseudocode Control process: dipatch(unbalanced_tree_kernel, root….)
GPU/CPU unbalanced_tree_kernel For the next n-1 levels Count For each child at level n Insert child into parse channel Control process: dipatch(unbalanced_tree_kernel, root….) balance and terminate GPU/CPU : unbalanced_tree_kernel For the next n-1 levels Count For each child at level n dispatch(unbalanced_tree_kernel, child….)

conclusions HSA has several architected features that can improve programmability and an ecosystem that exposes these features to the users effectively HSA runtime is how these features are exposed to higher-level programming models Composability is possible, a new higher-level model can be composed of multiple architected features Channels are a very unique technology made possible with HSA Channels enable many applications that need dataflow model or features Cloth simulation and collision detection is an example that shows how several of HSA features both simplify the solution and avoid unnecessary costs typically involved with using GPUs to solve this problem Unbalanced tree search is a domain with unpredictable amount of parallelism, a major load balancing problem and need for adjusting the granularity of a task HSA features significantly simplify and allow a natural solution to this problem. Channels address adjusting of granularity of launches by allowing a dataflow patterns that launches a task when certain data-dependent criterion are met

Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. © 2012 Advanced Micro Devices, Inc.

HSA for the common man Vinod Tipparaju

Similar presentations

Presentation on theme: "HSA for the common man Vinod Tipparaju"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HSA for the common man Vinod Tipparaju

Similar presentations

Presentation on theme: "HSA for the common man Vinod Tipparaju"— Presentation transcript:

Similar presentations

About project

Feedback