Presentation is loading. Please wait.

Presentation is loading. Please wait.

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Similar presentations


Presentation on theme: "Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan."— Presentation transcript:

1 Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan

2 Authors Ernie Chan Robert van de Geijn  Department of Computer Sciences The University of Texas at Austin William Gropp Rajeev Thakur  Mathematics and Computer Science Division Argonne National Laboratory

3 Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

4 Testbed Architecture IBM Blue Gene/L  3D torus point-to-point interconnect network  One rack 1024 dual-processor nodes Two 8 x 8 x 8 midplanes  Special feature to send simultaneously Use multiple calls to MPI_Isend

5 Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

6 Model of Parallel Computation Target Architectures  Distributed-memory parallel architectures Indexing  p computational nodes  Indexed 0 … p - 1 Logically Fully Connected  A node can send directly to any other node

7 Model of Parallel Computation Topology  N-dimensional torus 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

8 Model of Parallel Computation Old Model of Communicating Between Nodes  Unidirectional sending or receiving

9 Model of Parallel Computation Old Model of Communicating Between Nodes  Simultaneous sending and receiving

10 Model of Parallel Computation Old Model of Communicating Between Nodes  Bidirectional exchange

11 Model of Parallel Computation Communicating Between Nodes  A node can send or receive with 2N other nodes simultaneously along its 2N different links

12 Model of Parallel Computation Communicating Between Nodes  Cannot perform bidirectional exchange on any link while sending or receiving simultaneously with multiple nodes

13 Model of Parallel Computation Cost of Communication α + nβ  α: startup time, latency  n: number of bytes to communicate  β: per data transmission time, bandwidth

14 Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

15 Sending Simultaneously Old Cost of Communication with Sends to Multiple Nodes  Cost to send to m separate nodes (α + nβ) m

16 Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1)

17 Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends

18 Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends 0 ≤ τ ≤ 1

19 Sending Simultaneously Benchmarking Sending Simultaneously  Logarithmic-Logarithmic timing graphs  Midplane – 512 nodes  Sending simultaneously with 1 – 6 neighbors  8 bytes – 4 MB

20 Sending Simultaneously

21

22 Cost of Communication with Simultaneous Sends (α + nβ) (1 + (m - 1) τ)

23 Sending Simultaneously

24

25 Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

26 Collective Communication Broadcast (Bcast)  Motivating example Before After

27 Collective Communication Scatter Before After

28 Collective Communication Allgather Before After

29 Collective Communication Broadcast  Can be implemented as a Scatter followed by an Allgather

30 ScatterAllgather

31 Collective Communication Lower Bounds: Latency  Broadcastlog 2N+1 (p) α  Scatterlog 2N+1 (p) α  Allgatherlog 2N+1 (p) α

32 Collective Communication Lower Bounds: Bandwidth  Broadcast nβ 2N  Scatter p - 1 nβ p 2N  Allgather p - 1 nβ p 2N

33 Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

34 Generalized Algorithms Short-Vector Algorithms  Minimum-Spanning Tree Long-Vector Algorithms  Bucket Algorithm

35 Generalized Algorithms Minimum-Spanning Tree

36 Generalized Algorithms Minimum-Spanning Tree  Recursively divide network of nodes in half  Cost of MST Bcast log 2 (p) (α + nβ)  What if can send to N nodes simultaneously?

37 Generalized Algorithms Minimum-Spanning Tree  Divide p nodes into N+1 partitions

38 Generalized Algorithms Minimum-Spanning Tree  Disjointed partitions on N-dimensional mesh 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

39 Generalized Algorithms Minimum-Spanning Tree  Divide dimensions by a decrementing counter from N+1 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

40 Generalized Algorithms Minimum-Spanning Tree  Now divide into 2N+1 partitions 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

41 Generalized Algorithms Minimum-Spanning Tree  Cost of new Generalized MST Bcast log 2N+1 (p) (α + nβ)  Attains lower bound for latency!

42 Generalized Algorithms Minimum-Spanning Tree  MST Scatter Only send data that must reside in that partition at each step  Cost of new generalized MST Scatter  Attains lower bound for latency and bandwidth! log 2N+1 (p) α + p - 1 p nβnβ 2N

43 Generalized Algorithms Bucket Algorithm

44 Generalized Algorithms Bucket Algorithm  Send n/p sized data messages at each step  Cost of Bucket Allgather  What if can send to N nodes simultaneously? p - 1 p nβnβ (p - 1) α +

45 Generalized Algorithms Bucket Algorithm  Collect data around N buckets simultaneously 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

46 Generalized Algorithms Bucket Algorithm  Cannot send to N neighbors at each step 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

47 Generalized Algorithms Bucket Algorithm  Assume collecting data in buckets is free in all but one dimension  D is an N-ordered tuple representing the number of nodes in each dimension of the torus π D i = p0 1 | D | = N i = 1 N

48 Generalized Algorithms Bucket Algorithm  Cost of the new generalized Bucket Allgather where D j - 1 DjDj nβnβ (d - N) α + d = Σ D i i ≠ j, D j ≥ D i A i = 1 N

49 Generalized Algorithms Bucket Algorithm  New generalized Bcast derived from MST Scatter followed by Bucket Allgather  Cost of new long-vector Bcast p - 1 D j - 1 2Np D j nβnβ (log 2N+1 (p) + d - N) α + + ( )

50 Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

51 Performance Results Logarithmic-Logarithmic Timing Graphs  Collective Communication Operations Broadcast Scatter Allgather  Algorithms MST Bucket  8 bytes – 4 MB

52 Performance Results Single point-to-point communication

53 Performance Results my-bcast-MST

54 Performance Results

55

56

57

58

59 Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

60 IBM Blue Gene/L supports functionality of sending simultaneously  Benchmarking along with model checking verifies this claim New generalized algorithms show clear performance gains

61 Conclusion Future Directions  Room for optimization to reduce implementation overhead  What if not using MPI_COMM_WORLD ?  Possible new algorithm for Bucket Algorithm Questions?echan@cs.utexas.eduechan@cs.utexas.edu


Download ppt "Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan."

Similar presentations


Ads by Google