Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.

Similar presentations


Presentation on theme: "On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology."— Presentation transcript:

1 On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

2 Overview Motivation Methods Experimental evaluation Conclusion

3 The problem setting Work Task Offline Online

4 Static Load Balancing Processor

5 Static Load Balancing Processor Task

6 Static Load Balancing Processor Task

7 Static Load Balancing Processor Task Subtask

8 Static Load Balancing Processor Task Subtask

9 Dynamic Load Balancing Processor Task Subtask

10 Task sharing Work done? Try to get task New tasks? Perform task Got task? Add task Task Set No, retry Check condition Acquire Task Add Task No, continue Task Done

11 System Model CUDA Global Memory Gather and scatter Compare-And-Swap Fetch-And-Inc Multiprocessors Maximum number of concurrent thread blocks Multi- processor Thread Block Multi- processor Thread Block Multi- processor Thread Block Global Memory

12 Synchronization Blocking Uses mutual exclusion to only allow one process at a time to access the object. Lockfree Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. Waitfree Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.

13 Load Balancing Methods Blocking Task Queue Non-blocking Task Queue Task Stealing Static Task List

14 Blocking queue TB 1 TB 2 TB n Free Head Tail

15 Blocking queue TB 1 TB 2 TB n Free Head Tail

16 Blocking queue T1T1 TB 1 TB 2 TB n Free Head Tail

17 Blocking queue T1T1 TB 1 TB 2 TB n Free Head Tail

18 Blocking queue T1T1 TB 1 TB 2 TB n Free Head Tail

19 Non-blocking Queue T1T1 T2T2 T3T3 T4T4 TB 1 TB 2 TB 1 TB 2 TB n Head Tail Reference P. Tsigas and Y. Zhang, A simple, fast and scalable non- blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]

20 Non-blocking Queue T1T1 T2T2 T3T3 T4T4 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

21 Non-blocking Queue T1T1 T2T2 T3T3 T4T4 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

22 Non-blocking Queue T1T1 T2T2 T3T3 T4T4 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

23 Non-blocking Queue T1T1 T2T2 T3T3 T4T4 T5T5 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

24 Non-blocking Queue T1T1 T2T2 T3T3 T4T4 T5T5 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

25 Task stealing T1T1 T3T3 T2T2 TB 1 TB 2 TB n Reference Arora N. S., Blumofe R. D., Plaxton C. G., Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]

26 Task stealing T1T1 T4T4 T3T3 T2T2 TB 1 TB 2 TB n

27 Task stealing T1T1 T4T4 T5T5 T3T3 T2T2 TB 1 TB 2 TB n

28 Task stealing T1T1 T4T4 T3T3 T2T2 TB 1 TB 2 TB n

29 Task stealing T1T1 T3T3 T2T2 TB 1 TB 2 TB n

30 Task stealing T3T3 T2T2 TB 1 TB 2 TB n

31 Task stealing T2T2 TB 1 TB 2 TB n

32 Static Task List T1T1 T2T2 T3T3 T4T4 In

33 Static Task List T1T1 T2T2 T3T3 T4T4 In TB 1 TB 2 TB 3 TB 4

34 Static Task List T1T1 T2T2 T3T3 T4T4 In Out TB 1 TB 2 TB 3 TB 4

35 Static Task List T1T1 T2T2 T3T3 T4T4 T5T5 In Out TB 1 TB 2 TB 3 TB 4

36 Static Task List T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 In Out TB 1 TB 2 TB 3 TB 4

37 Static Task List T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 In Out TB 1 TB 2 TB 3 TB 4

38 Octree Partitioning Bandwidth bound

39 Octree Partitioning Bandwidth bound

40 Octree Partitioning Bandwidth bound

41 Octree Partitioning Bandwidth bound

42 Four-in-a-row Computation intensive

43 Graphics Processors 8800GT 14 Multiprocessors 57 GB/sec bandwidth 9600GT 8 Multiprocessors 57 GB/sec bandwidth

44 Blocking Queue – Octree/9600GT

45 Blocking Queue – Octree/8800GT

46 Blocking Queue – Four-in-a-row

47 Non-blocking Queue – Octree/9600GT

48 Non-blocking Queue – Octree/8800GT

49 Non-blocking Queue - Four-in-a-row

50 Task stealing – Octree/9600GT

51 Task stealing – Octree/8800GT

52 Task stealing – Four-in-a-row

53 Static List

54 Octree Comparison

55 Previous work Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003 Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998 Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005

56 Conclusion Synchronization plays a significant role in dynamic load- balancing Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming Locks perform poorly It is good that operations such as CAS and FAA have been introduced in the new GPUs Work stealing could outperform static load balancing

57 Thank you! http://www.cs.chalmers.se/~dcs


Download ppt "On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology."

Similar presentations


Ads by Google