Presentation is loading. Please wait.

Presentation is loading. Please wait.

HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien.

Similar presentations


Presentation on theme: "HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien."— Presentation transcript:

1 HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien Bouteiller, Jack J. Dongarra Dec 2. 2011 @ICL Lunch Talk

2 Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion

3 Introduction Hierarchies brought by multi-core cluster Message Passing is still dominative Programming. Programming libraries want to handle hierarchies internally. Collective communication is critical to application’s performance

4 Problem: Tuned Collective It cannot see the edges brought by the hierarchies of multi-core clusters Build a logical topology without runtime hardware topology information.

5 Topology-Unaware: Mismatch problem* 1 4 3 2 2 1 4 3 3 2 1 4 4 3 2 1 Core0Core1Core2Core3 Node 0 Node 1 1 4 3 2 2 1 4 3 3 2 1 4 4 3 2 1 Core0 Core2 Core1Core3 Node 0Node 1 P0P1P2P3 P0P1P2P3 Open MPI Tuned Allgather Ring algorithm under different process-core binding cases. --bycore--bynode * T. Ma, T. Herault, G. Bosilca and J. J. Dongarra, Process Distance-aware Adaptive MPI Collective Communications, Cluster 2011 # of nodes # of cores

6 Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion

7 Related work Cheetah R. Graham and etc., Cheetah: A Framework for Scalable Hierarchical Collective Operations CCGRID 2011 Distance-aware framework T. Ma, and etc., Process Distance- Aware Adaptive MPI Collective Communications. CLUSTER 2011 SBGP BCOL IB links NUMA links Intra-socket links

8 Agenda Introduction Related work Kernel-assisted Approach HierKNEM Experiments Conclusion

9 Status of Kernel-assisted One- sided Single-copy Inter-Process communication KNEM(0.9.7) and LIMIC(0.5.5) XPMEM(Cross-Process Memory Mapping) CMA(Cross Memory Attach).

10 Development of kernel-assisted approach in MPI stacks Intra-node p2p comm. MPICH2-LMT(KNEM), Open MPI(SM/KNEM BTL, vader BTL), MVAPICH2(LIMIC) Intra-node collective comm. KNEM Coll T Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres, J. J. Dongarra: Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. ICPP 2011 Inter- and intra-node collective comm. HierKNEM Coll T Ma, G. Bosilca, A. Bouteiller, J. J. Dongarra: HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters, submitted to IPDPS2012

11 Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion

12 Framework of HierKNEM Subgroup: Intra-node Comm. Inter-node Comm.

13 Broadcast Inter-node forward KNEM read Leader processes Non-Leader processes

14 SendRecv KNEM Copy Bcast with 64 processes on Dancer’s 8 nodes(8 cores/node), 256KB message size.

15 Reduce Intra-node Comm. Inter-node Comm. New_Comm. Inter-node forward KNEM read/write

16 Allgather: Topology-aware Ring

17 Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion

18 Hardware Environment Stremi Cluster 32 nodes Node: AMD’s 24-core Gigabit Ethernet Parapluie Cluster 32 nodes Node: AMD’s 24-core 20 G Infiniband

19 Software Environment Open MPI 1.5.3, MPICH2-1.4 and MVAPICH2-1.7 KNEM version 0.9.6, LIMIC 0.5.5 IMB-3.2(cache on) Always use the same mapping between cores and processes if without special mention. (--bycore way)

20 Broadcast Performance Figure: Aggregate Broadcast bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32nodes). More than 30 times!! More than twice

21 Reduce Performance Figure: Aggregate Reduce bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32 nodes).

22 Allgather Performance Figure: Aggregate Allgather bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node).

23 Topology-aware Operations Figure: Impact of process mapping: aggregate Broadcast and Allgather bandwidth of the collective modules for two different process-core bindings: by core and by node (Parapluie cluster, IB20G, 768 processes, 24 cores/node).

24 Core per Node Scalability Figure: Core per node scalability: aggregate bandwidth of Broadcast for 2MB messages on multicore clusters (32 nodes).

25 Conclusion HierKNEM achieved huge speedup from overlap between inter- and intra-node communication. HierKNEM is immune to modifications of the underlying process-core binding.(topology- aware). HierKNEM provides a linear speedup with the increase of the number of cores per node


Download ppt "HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien."

Similar presentations


Ads by Google