Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan

Authors Ernie Chan Robert van de Geijn  Department of Computer Sciences The University of Texas at Austin William Gropp Rajeev Thakur  Mathematics and Computer Science Division Argonne National Laboratory

Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Testbed Architecture IBM Blue Gene/L  3D torus point-to-point interconnect network  One rack 1024 dual-processor nodes Two 8 x 8 x 8 midplanes  Special feature to send simultaneously Use multiple calls to MPI_Isend

Model of Parallel Computation Target Architectures  Distributed-memory parallel architectures Indexing  p computational nodes  Indexed 0 … p - 1 Logically Fully Connected  A node can send directly to any other node

Model of Parallel Computation Topology  N-dimensional torus 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

Model of Parallel Computation Old Model of Communicating Between Nodes  Unidirectional sending or receiving

Model of Parallel Computation Old Model of Communicating Between Nodes  Simultaneous sending and receiving

Model of Parallel Computation Old Model of Communicating Between Nodes  Bidirectional exchange

Model of Parallel Computation Communicating Between Nodes  A node can send or receive with 2N other nodes simultaneously along its 2N different links

Model of Parallel Computation Communicating Between Nodes  Cannot perform bidirectional exchange on any link while sending or receiving simultaneously with multiple nodes

Model of Parallel Computation Cost of Communication α + nβ  α: startup time, latency  n: number of bytes to communicate  β: per data transmission time, bandwidth

Sending Simultaneously Old Cost of Communication with Sends to Multiple Nodes  Cost to send to m separate nodes (α + nβ) m

Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1)

Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends

Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends 0 ≤ τ ≤ 1

Sending Simultaneously Benchmarking Sending Simultaneously  Logarithmic-Logarithmic timing graphs  Midplane – 512 nodes  Sending simultaneously with 1 – 6 neighbors  8 bytes – 4 MB

Sending Simultaneously

Cost of Communication with Simultaneous Sends (α + nβ) (1 + (m - 1) τ)

Sending Simultaneously

Collective Communication Broadcast (Bcast)  Motivating example Before After

Collective Communication Scatter Before After

Collective Communication Allgather Before After

Collective Communication Broadcast  Can be implemented as a Scatter followed by an Allgather

ScatterAllgather

Collective Communication Lower Bounds: Latency  Broadcastlog 2N+1 (p) α  Scatterlog 2N+1 (p) α  Allgatherlog 2N+1 (p) α

Collective Communication Lower Bounds: Bandwidth  Broadcast nβ 2N  Scatter p - 1 nβ p 2N  Allgather p - 1 nβ p 2N

Generalized Algorithms Short-Vector Algorithms  Minimum-Spanning Tree Long-Vector Algorithms  Bucket Algorithm

Generalized Algorithms Minimum-Spanning Tree

Generalized Algorithms Minimum-Spanning Tree  Recursively divide network of nodes in half  Cost of MST Bcast log 2 (p) (α + nβ)  What if can send to N nodes simultaneously?

Generalized Algorithms Minimum-Spanning Tree  Divide p nodes into N+1 partitions

Generalized Algorithms Minimum-Spanning Tree  Disjointed partitions on N-dimensional mesh 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

Generalized Algorithms Minimum-Spanning Tree  Divide dimensions by a decrementing counter from N+1 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

Generalized Algorithms Minimum-Spanning Tree  Now divide into 2N+1 partitions 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

Generalized Algorithms Minimum-Spanning Tree  Cost of new Generalized MST Bcast log 2N+1 (p) (α + nβ)  Attains lower bound for latency!

Generalized Algorithms Minimum-Spanning Tree  MST Scatter Only send data that must reside in that partition at each step  Cost of new generalized MST Scatter  Attains lower bound for latency and bandwidth! log 2N+1 (p) α + p - 1 p nβnβ 2N

Generalized Algorithms Bucket Algorithm

Generalized Algorithms Bucket Algorithm  Send n/p sized data messages at each step  Cost of Bucket Allgather  What if can send to N nodes simultaneously? p - 1 p nβnβ (p - 1) α +

Generalized Algorithms Bucket Algorithm  Collect data around N buckets simultaneously 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

Generalized Algorithms Bucket Algorithm  Cannot send to N neighbors at each step 5 911 3 7 8 0 10 12 13 15 1 4 14 6 2

Generalized Algorithms Bucket Algorithm  Assume collecting data in buckets is free in all but one dimension  D is an N-ordered tuple representing the number of nodes in each dimension of the torus π D i = p0 1 | D | = N i = 1 N

Generalized Algorithms Bucket Algorithm  Cost of the new generalized Bucket Allgather where D j - 1 DjDj nβnβ (d - N) α + d = Σ D i i ≠ j, D j ≥ D i A i = 1 N

Generalized Algorithms Bucket Algorithm  New generalized Bcast derived from MST Scatter followed by Bucket Allgather  Cost of new long-vector Bcast p - 1 D j - 1 2Np D j nβnβ (log 2N+1 (p) + d - N) α + + ( )

Performance Results Logarithmic-Logarithmic Timing Graphs  Collective Communication Operations Broadcast Scatter Allgather  Algorithms MST Bucket  8 bytes – 4 MB

Performance Results Single point-to-point communication

Performance Results my-bcast-MST

Performance Results

IBM Blue Gene/L supports functionality of sending simultaneously  Benchmarking along with model checking verifies this claim New generalized algorithms show clear performance gains

Conclusion Future Directions  Room for optimization to reduce implementation overhead  What if not using MPI_COMM_WORLD ?  Possible new algorithm for Bucket Algorithm Questions?echan@cs.utexas.eduechan@cs.utexas.edu

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Similar presentations

Presentation on theme: "Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Similar presentations

Presentation on theme: "Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan."— Presentation transcript:

Similar presentations

About project

Feedback