HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien.

Slides:



Advertisements
Similar presentations
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Advertisements

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
SE-292 High Performance Computing
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Source: MPI – Message Passing Interface Communicator groups and Process Topologies Source:
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
Multi-core and Network Aware MPI Topology Functions Mohammad J. Rashti, Jonathan Green, Pavan Balaji, Ahmad Afsahi, and William D. Gropp Department of.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Asymmetric Interactions in Symmetric Multicore Systems: Analysis, Enhancements, and Evaluation Tom Scogland * P. Balaji + W. Feng * G. Narayanaswamy *
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Remigius K Mommsen Fermilab A New Event Builder for CMS Run II A New Event Builder for CMS Run II on behalf of the CMS DAQ group.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
The hybird approach to programming clusters of multi-core architetures.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.
Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
AICS Café – 2013/01/18 AICS System Software team Akio SHIMADA.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,
Evaluating the Performance of MPI Java in FutureGrid Nigel Pugh 2, Tori Wilbon 2, Saliya Ekanayake 1 1 Indiana University 2 Elizabeth City State University.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Network Aware Resource Allocation in Distributed Clouds.
Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.
MIT Lincoln Laboratory VXFabric-1 Kontron 9/22/2011 VXFabric: PCI-Express Switch Fabric for HPEC Poster B.7, Technologies and Systems Robert Negre, Business.
An architecture for space sharing HPC and commodity workloads in the cloud Jack Lange Assistant Professor University of Pittsburgh.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication.
Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
Interconnection network network interface and a case study.
ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme Gabriel Antoniu, Luc Bougé, Sébastien Lacour IRISA / INRIA & ENS.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Dynamic Load Balancing in Scientific Simulation
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
Open MPI - A High Performance Fault Tolerant MPI Library Richard L. Graham Advanced Computing Laboratory, Group Leader (acting)
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Interactive Website (
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
The Multikernel A new OS architecture for scalable multicore systems
NumaGiC: A garbage collector for big-data on big NUMA machines
Cooperative Rendezvous Protocols for Improved Performance and Overlap
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Presentation transcript:

HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien Bouteiller, Jack J. Dongarra Dec 2. Lunch Talk

Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion

Introduction Hierarchies brought by multi-core cluster Message Passing is still dominative Programming. Programming libraries want to handle hierarchies internally. Collective communication is critical to application’s performance

Problem: Tuned Collective It cannot see the edges brought by the hierarchies of multi-core clusters Build a logical topology without runtime hardware topology information.

Topology-Unaware: Mismatch problem* Core0Core1Core2Core3 Node 0 Node Core0 Core2 Core1Core3 Node 0Node 1 P0P1P2P3 P0P1P2P3 Open MPI Tuned Allgather Ring algorithm under different process-core binding cases. --bycore--bynode * T. Ma, T. Herault, G. Bosilca and J. J. Dongarra, Process Distance-aware Adaptive MPI Collective Communications, Cluster 2011 # of nodes # of cores

Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion

Related work Cheetah R. Graham and etc., Cheetah: A Framework for Scalable Hierarchical Collective Operations CCGRID 2011 Distance-aware framework T. Ma, and etc., Process Distance- Aware Adaptive MPI Collective Communications. CLUSTER 2011 SBGP BCOL IB links NUMA links Intra-socket links

Agenda Introduction Related work Kernel-assisted Approach HierKNEM Experiments Conclusion

Status of Kernel-assisted One- sided Single-copy Inter-Process communication KNEM(0.9.7) and LIMIC(0.5.5) XPMEM(Cross-Process Memory Mapping) CMA(Cross Memory Attach).

Development of kernel-assisted approach in MPI stacks Intra-node p2p comm. MPICH2-LMT(KNEM), Open MPI(SM/KNEM BTL, vader BTL), MVAPICH2(LIMIC) Intra-node collective comm. KNEM Coll T Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres, J. J. Dongarra: Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. ICPP 2011 Inter- and intra-node collective comm. HierKNEM Coll T Ma, G. Bosilca, A. Bouteiller, J. J. Dongarra: HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters, submitted to IPDPS2012

Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion

Framework of HierKNEM Subgroup: Intra-node Comm. Inter-node Comm.

Broadcast Inter-node forward KNEM read Leader processes Non-Leader processes

SendRecv KNEM Copy Bcast with 64 processes on Dancer’s 8 nodes(8 cores/node), 256KB message size.

Reduce Intra-node Comm. Inter-node Comm. New_Comm. Inter-node forward KNEM read/write

Allgather: Topology-aware Ring

Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion

Hardware Environment Stremi Cluster 32 nodes Node: AMD’s 24-core Gigabit Ethernet Parapluie Cluster 32 nodes Node: AMD’s 24-core 20 G Infiniband

Software Environment Open MPI 1.5.3, MPICH2-1.4 and MVAPICH2-1.7 KNEM version 0.9.6, LIMIC IMB-3.2(cache on) Always use the same mapping between cores and processes if without special mention. (--bycore way)

Broadcast Performance Figure: Aggregate Broadcast bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32nodes). More than 30 times!! More than twice

Reduce Performance Figure: Aggregate Reduce bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32 nodes).

Allgather Performance Figure: Aggregate Allgather bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node).

Topology-aware Operations Figure: Impact of process mapping: aggregate Broadcast and Allgather bandwidth of the collective modules for two different process-core bindings: by core and by node (Parapluie cluster, IB20G, 768 processes, 24 cores/node).

Core per Node Scalability Figure: Core per node scalability: aggregate bandwidth of Broadcast for 2MB messages on multicore clusters (32 nodes).

Conclusion HierKNEM achieved huge speedup from overlap between inter- and intra-node communication. HierKNEM is immune to modifications of the underlying process-core binding.(topology- aware). HierKNEM provides a linear speedup with the increase of the number of cores per node