Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Similar presentations


Presentation on theme: "Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio."— Presentation transcript:

1 Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio State University (Joint work with Rishi Kumar and Guang Gao At University of Delaware)

2 Outline: Motivation Problems Considered Near-neighbor communication, global reduction, producer- consumer pattern Compiler Analysis and Cost Model Code generation for combination of patterns Results Conclusion

3 Motivation Communication volume and efficiency have a major impact on performance of parallel code Multithreading can hide communication and synchronization costs by switching to a different ready thread Compiler generated multithreaded code can provide ease of programming and good performance

4 Issues and Approach What multithreaded programming styles could be used for different types of programs How can the compiler automatically generate these patterns What kind of performance can be achieved Focus on three types of communication patterns: near-neighbor, global reduction, and producer- consumer

5 Near Neighbor Communication Widely studied for distributed memory compilation Typically use collective communication Communication volume and frequency can limit parallel performance For (m….) For (n…) A(m,n) = ƒ{B(m-1,n), B(m+1,n), B(m, n-1), B(m,n+1)

6 Multithreaded programming style Thread P 1 1 Thread P n 3 Thread P 1 2 Thread P 1 3 Thread P 2 3 Thread P 2 2 Thread P 2 1 Thread P n 2 Thread P n 1 Processor 1Processor 3Processor 2 Only the first and last threads on each processor need to wait for data

7 Compilation and Performance Issues How can the compiler automatically generate code What thread granularity to choose Larger number of threads mean more work w/o waiting for data from other processors Smaller number of threads mean less threading overhead

8 Global Reduction Main issue: avoiding collective communication phases using multithreading DATA SYNC Reduce DATA SYNC Broadcast Building Block of Reduction Tree Real A(n); Real sum=0; For(j=0; j<n; j++) S1: sum=sum+A(j); Example of Sum Reduction

9 Producer Consumer Pattern Example: Sparse matrix vector multiplication, A.v=q, where A=NxN sparse matrix, v=Nx1 vector and q is result. Reduction Implementations on p processors: Logarithmic tree approach: O(N log p). Pipelined reduction in 2p distinct phases: O(N). Pipelined reduction eliminates the need for global barriers or synchronizations.

10 Pipelining of MVM Reduction X X

11 Cost Model for Thread Granularity Total threading costs depend on: Threading overheads Time spent waiting for data over network For Problem Size=NxN, Processors=p, Threads per processor=T Granularity, G = N/pT Overhead in 1 thread creation=β Amount of computation present in 1 row=C Minimizing total threading cost w.r.t. the number of threads, T, we get T = sqrt (2CN/p β)

12 Code Generation for Combination of Patterns Point to point communication – need to set up program to have addresses of threads you comminite with The three patterns are Binary Tree Pattern – global reductions Linear Chain Pattern – near neighbor communication Circular Ring Pattern - producer consumer Issues in combination Communication Setup Thread renumbering Synchronization between different patterns

13 Experimental Results Speedup Conjugate Gradient, Class A (14,000 Rows)

14 Experimental Results Speedup Tomcatv, Problem Size=256x256 elements

15 Comparison of Compiler Generated and Hand Written Code CG: Class A

16 Comparison of Compiler Generated and Hand Written Code Tomcatv, Problem Size=256x256 elements

17 Comparison of Compiler Generated and Hand Written Code Jacobi

18 Conclusion Compiler generated codes for CG, and Tomcatv provide good speedup and scalability. Compiler generated codes exhibit good performance for small problem sizes. In accordance with the cost model, Tomcatv which has higher amounts of computation gives better performance with higher number of threads. Similarly, Jacobi benchmark gives good performance with lower number of threads, due to less amount of computation present in the problem.


Download ppt "Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio."

Similar presentations


Ads by Google