A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan

PPL@UIUC Collective Communication Collective communication is a performance impediment Collective personalized communication –All to all personalized communication (AAPC) –Many to many personalized communication (MMPC)

PPL@UIUC Issues Communication latencies not scaling with bandwidth and processor speeds –High software over head (α) –Message combining Synchronous operations (MPI_Alltoall) do not utilize the co processor effectively Performance Metrics –Completion time vs Compute overhead

PPL@UIUC AAPC Each processors sends a distinct message to every other processor High software overhead for small messages Direct AAPC –Cost = (P – 1) × (α + mβ) α is the total software overhead of sending the message β is the per byte network overhead m is the size of the message

PPL@UIUC Optimizing AAPC Direct AAPC is α dominated Message combining for small messages –Reduce the total number of messages –Messages sent along a virtual topology –Multistage algorithm to send messages –Group of messages combined and sent to an intermediate processors which then forward them to the final destinations

PPL@UIUC Virtual Topology: Mesh Organize processors in a 2D (virtual) Mesh Phase 1: Processors send messages to row neighbors Message from (x1,y1) to (x2,y2) goes via (x1,y2) Phase 1: Processors send messages to column neighbors 2* messages instead of P-1 As compared to T Direct = (P – 1) × (α + mβ)

PPL@UIUC Virtual Topology: Hypercube Dimensional exchange 6 7 3 6 10 2

PPL@UIUC Virtual Topology: 3d Grid Messages exchanged along a 3d Grid AAPC Overhead

PPL@UIUC Imperfect MESH Holes are evenly distributed among the members of that column

PPL@UIUC Experiments Benchmarks run on PSC Lemeiux –750 quad alpha nodes –Connected by the QsNet Elan network Elan NIC has a communication co-processor capable of doing asynchronous remote DMA Elan network also likes fewer messages to be sent and received

PPL@UIUC AAPC Scalibility

PPL@UIUC AAPC on 1024 Processors of Lemieux

PPL@UIUC Case Study 1: Radix Sort 7664848KB 4162564KB 2213332KB MeshDirectSize AAPC Time (ms)

PPL@UIUC AAPC Processor Overhead

PPL@UIUC Compute Overhead: A New Metric Strategies should also be evaluated on compute overhead Asynchronous non blocking primitives needed –Compute overhead of the Mesh strategy is a small fraction of the total AAPC completion time –A data driven system like Charm++ will automatically support this

PPL@UIUC AMPI Provides virtualization and other features of Charm++ to MPI programs AMPI AAPC interface –Split phase interface MPI_Ialltoall(sndbuf, msg_size, MPI_CHAR, recvbuf, msg_size, MPI_CHAR, MPI_COMM_WORLD, &req); // User Code while(!MPI_Test(&req,...)){ // Do computation; } Also recommended for other MPI implementations!

PPL@UIUC MMPC Many (not all) processors send data to many other processors –New metric δ to evaluate performance δ : number of messages a processor sends or receives –Uniform MMPC Small variance in δ –Non Uniform MMPC Large variance in δ

PPL@UIUC Uniform MMPC δ is the degree of the communication graph

PPL@UIUC Case Study 2: Neighbor Send Synthetic benchmark where processor i sends messages to processors {i+1, i+2, …, (i+ δ )%P}

PPL@UIUC Case Study 3: Namd Performance of Namd with the Atpase molecule. PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages

PPL@UIUC Related Work The topologies described in this paper have also been is presented in Kirshnan’s Masters thesis, 1999 The Mesh and the 3d Grid strategies are also presented in C. Christara, X. Ding, and K. Jackson. An efficient transposition algorithm for distributed memory clusters. In 13th Annual International Symposium on High Performance Computing Systems and Applications, 1999

PPL@UIUC Summary We present a non blocking framework for collective personalized communication –New performance metric AAPC compute time MPI programs can make use of it though a split phase interface

PPL@UIUC Future Work Optimal strategy depends on ( δ,p,m) –Develop a learning framework using principle of persistence Physical topologies –Bluegene! Non Uniform MMPC –Analysis and new strategies Smart strategies for multiple simultaneous AAPC’s over sections of processors –Needed by ab initio molecular dynamics Software available at http://charm.cs.uiuc.eduhttp://charm.cs.uiuc.edu

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Similar presentations

Presentation on theme: "A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Similar presentations

Presentation on theme: "A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan."— Presentation transcript:

Similar presentations

About project

Feedback