Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

Similar presentations


Presentation on theme: "Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014."— Presentation transcript:

1 Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014 †§

2 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Executive Summary  SIMT languages (e.g. CUDA & OpenCL) restrict GPU programmers to regular parallelism ‒Compare to Pthreads, Cilk, MapReduce, TBB, etc.  Goal: enable irregular parallelism on GPUs ‒Why? More GPU applications ‒How? Fine-grain task aggregation ‒What? Cilk on GPUs

3 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Outline  Background ‒GPUs ‒Cilk ‒Channel Abstraction  Our Work ‒Cilk on Channels ‒Channel Design  Results/Conclusion

4 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, CP GPUs Today  GPU tasks scheduled by control processor (CP)— small, in-order programmable core  Today’s GPU abstractions are coarse-grain GPU CP SIMD System Memory SIMD + Maps well to SIMD hardware - Limits fine-grain scheduling

5 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Cilk Background  Cilk extends C for divide and conquer parallelism  Adds keywords ‒ spawn : schedule a thread to execute a function ‒ sync : wait for prior spawns to complete 1: int fib(int n) { 2: if (n <= 2) return 1; 3: int x = spawn fib(n - 1); 4: int y = spawn fib(n - 2); 5: sync; 6: return (x + y); 7: }

6 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Agg Prior Work on Channels  CP, or aggregator (agg), manages channels  Finite task queues, except: 1.User-defined scheduling 2.Dynamic aggregation 3.One consumption function channels GPU SIMD System Memory Dynamic aggregation enables “CPU-like” scheduling abstractions on GPUs

7 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Outline  Background ‒GPUs ‒Cilk ‒Channel Abstraction  Our Work ‒Cilk on Channels ‒Channel Design  Results/Conclusion

8 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Enable Cilk on GPUs via Channels  Cilk routines split by sync into sub-routines Step 1 1: int fib (int n) { 2: if (n<=2) return 1; 3: int x = spawn fib (n-1); 4: int y = spawn fib (n-2); 5: sync; 6: return (x+y); 7: } 1: int fib (int n) { 2: if (n<=2) return 1; 3: int x = spawn fib (n-1); 4: int y = spawn fib (n-2); 5: } 6: int fib_cont(int x, int y) { 7: return (x+y); 8: } “pre-sync” “continuation”

9 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Enable Cilk on GPUs via Channels  Channels instantiated for breadth-first traversal ‒Quickly populates GPU’s tens of thousands of lanes ‒Facilitates coarse-grain dependency management Step 2 “pre-sync” task ready “continuation” task task A spawned task B A B task B depends on task A A B “pre-sync” task done fib_cont channel stack: top of stack fib channel

10 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Bound Cilk’s Memory Footprint  Bound memory to the depth of the Cilk tree by draining channels closer to the base case ‒The amount of work generated dynamically is not known a priori  We propose that GPUs allow SIMT threads to yield ‒Facilitates resolving conflicts on shared resources like memory

11 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Channel Implementation  Our design accommodates SIMT access patterns + array-based + lock-free + non-blocking See Paper

12 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Outline  Background ‒GPUs ‒Cilk ‒Channel Abstraction  Our Work ‒Cilk on Channels ‒Channel Design  Results/Conclusion

13 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Methodology  Implemented Cilk on channels on a simulated APU ‒Caches are sequentially consistent ‒Aggregator schedules Cilk tasks

14 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Cilk scales with the GPU Architecture More Compute Units  Faster execution

15 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Conclusion  We observed that dynamic aggregation enables new GPU programming languages and abstractions  We enabled dynamic aggregation by extending the GPU’s control processor to manage channels  We found that breadth first scheduling works well for Cilk on GPUs  We proposed that GPUs allow SIMT threads to yield for breadth first scheduling Future work should focus on how the control processor can enable more GPU applications

16 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Backup

17 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Divergence and Channels  Branch divergence  Memory divergence + Data in channels good ‒Pointers to data in channels bad

18 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, GPU NOT Blocked on Aggregator

19 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, GPU Cilk vs. standard GPU workloads  Cilk is more succinct than SIMT languages  Channels trigger more GPU dispatches LOC reduction Dispatch rate Speedup Strassen42%13x1.06 Queens36%12.5x0.98 Same performance, easier to program

20 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.


Download ppt "Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014."

Similar presentations


Ads by Google