Multi-core and Network Aware MPI Topology Functions Mohammad J. Rashti, Jonathan Green, Pavan Balaji, Ahmad Afsahi, and William D. Gropp Department of.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

Distributed Data Processing

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Asymmetric Interactions in Symmetric Multicore Systems: Analysis, Enhancements, and Evaluation Tom Scogland * P. Balaji + W. Feng * G. Narayanaswamy *

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud Alexander G. Connor Panos K. Chrysanthis Alexandros Labrinidis Advanced Data Management.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

DIDS part II The Return of dIDS 2/12 CIS GrIDS Graph based intrusion detection system for large networks. Analyzes network activity on networks.

NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Network File System (NFS) in AIX System COSC513 Operation Systems Instructor: Prof. Anvari Yuan Ma SID:

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Fine Grain MPI Earl J. Dodd Humaira Kamal, Alan University of British Columbia 1.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.

Network Aware Resource Allocation in Distributed Clouds.

Emalayan Vairavanathan

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien.

SOFTWARE DESIGN.

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.

Objectives Functionalities and services Architecture and software technologies Potential Applications –Link to research problems.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Copyright © 2011, Performance Evaluation of a Green Scheduling Algorithm for Energy Savings in Cloud Computing Truong Vinh Truong Duy; Sato,

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.

Sept 20-21, 2001R. Scott Cost - CADIP, UMBC1 CARROT II Collaborative Agent-based Routing and Retrieval of Text, Version 2 CADIP Fall Research Symposium.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Graphs Definition: a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected.

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

Layers Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Distributed Network Traffic Feature Extraction for a Real-time IDS

University of Technology

System G And CHECS Cal Ribbens

CSCE569 Parallel Computing

CLUSTER COMPUTING.

Presentation transcript:

Multi-core and Network Aware MPI Topology Functions Mohammad J. Rashti, Jonathan Green, Pavan Balaji, Ahmad Afsahi, and William D. Gropp Department of Electrical and Computer Engineering, Queen’s University Mathematics and Computer Science, Argonne National Laboratory Department of Computer Science, University of Illinois at Urbana-Champaign

Ahmad Afsahi Parallel Processing Research Laboratory 1 Presentation Outline  Introduction  Background and Motivation  MPI Graph and Cartesian Topology Functions  Related Work  Design and Implementation of Topology Functions  Experimental Framework and Performance Results  Micro-benchmark Results  Applications Results  Concluding Remarks and Future Work

Ahmad Afsahi Parallel Processing Research Laboratory 2 Introduction  MPI is the main standard for communication in HPC clusters.  Scalability is the major concern for MPI over large-scale hierarchical systems.  System topology awareness is essential for MPI scalability:  Being aware of performance implications in each and every architectural hierarchy of the machine  Efficiently mapping processes to processor cores, based on applications’ communication pattern  Such functionality should be embedded in MPI topology interface

Ahmad Afsahi Parallel Processing Research Laboratory 3 Background and Motivation  MPI topology functions:  Define the communication topology of the application o Logical process arrangement or virtual topology  Possibly reorder the processes to efficiently map over the system architecture (physical topology) for more performance  Virtual topology models:  Cartesian topology: multi-dimensional Cartesian arrangement  Graph topology: non-specific graph arrangement  Graph topology representation  Non-distributed: easier to manage, less scalable  Distributed: new to the standard, more scalable

Ahmad Afsahi Parallel Processing Research Laboratory 4 Background and Motivation (II)  However, topology functions are mostly utilized for the construction of process arrangement (i.e., virtual topology).  Most MPI applications are not utilizing them for performance improvement  In addition, MPI implementations offer trivial functionality for these functions.  Mainly constructing the virtual topology  No reordering of the ranks; thus no performance improvement  This work designs topology functions with reorder ability:  Designing non-distributed API functions  Supporting multi-hierarchy nodes and networks

Ahmad Afsahi Parallel Processing Research Laboratory 5 MPI Graph and Cartesian Topology Functions  MPI defines a set of virtual topology definition functions for graph and Cartesian structures.  MPI_Graph_create and MPI_Cart_create non-distributed functions:  Are collective calls that accept a virtual topology  Return a new MPI communicator enclosing the desired topology  The input topology is in a non-distributed form  All nodes have a full view of the entire structure o Pass the whole information to the function  If the user opts for reordering, the function may reorder the ranks for an efficient process-to-core mapping.

Ahmad Afsahi Parallel Processing Research Laboratory 6 MPI Graph and Cartesian Topology Functions (II)  MPI_Cart_create(comm_old, ndims, dims, periods, reorder, comm_cart )  comm_old[in] input communicator without topology (handle)  ndims[in] number of dimensions of Cartesian grid (integer)  dims[in] integer array of size ndims specifying the number of processes in each dimension  periods[in] logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each dimension  reorder[in] ranking may be reordered (true) or not (false) (logical)  comm_graph[out] communicator with Cartesian topology (handle) Dimension#Processes ndims = 2 dims = 4, 2 periods = 1, 0 4x2 2D-Torus

Ahmad Afsahi Parallel Processing Research Laboratory 7 MPI Graph and Cartesian Topology Functions (III)  MPI_Graph_create(comm_old, nnodes, index, edges, reorder, comm_graph )  comm_old[in] input communicator without topology (handle)  nnodes[in] number of nodes in graph (integer)  index[in] array of integers describing node degrees  edges[in] array of integers describing graph edges  reorder[in] ranking may be reordered (true) or not (false) (logical)  comm_graph[out] communicator with graph topology added (handle) ProcessNeighbors , , nnodes = 4 index = 2, 3, 4, 6 edges = 1, 3, 0, 3, 0, 2

Ahmad Afsahi Parallel Processing Research Laboratory 8 Presentation Outline  Introduction  Background and Motivation  MPI Graph and Cartesian Topology Functions  Related Work  Design and Implementation of Topology Functions  Experimental Framework and Performance Results  Micro-benchmark Results  Applications Results  Concluding Remarks and Future Work

Ahmad Afsahi Parallel Processing Research Laboratory 9 Related Work (I)  Hatazaki, Träff, worked on topology mapping using graph embedding algorithms (Euro PVM/MPI 1998, SC 2002)  Träff et. al, proposed extending MPI-1 topology interface (HIPS 2003, Euro PVM/MPI 2006)  To support weighted-edge topologies and dynamic process reordering, and to  Provide architectural clues to the applications for a better mapping  MPI Forum introduced distributed topology functionality in MPI- 2.2 (2009)  Hoefler et. al, proposed guidelines for efficient implementation of distributed topology functionality (CCPE 2010)

Ahmad Afsahi Parallel Processing Research Laboratory 10 Related Work (II)  Mercier et. al, studied efficient process-to-core mapping (Euro PVM/MPI 2009, EuroPar 2010]  Using external libraries for node architecture discovery and graph mapping  Using weighted graphs and/or trees, and outside MPI topology interface  How is our work different from the related work?  Supports a physical topology spanning nodes and the network  Uses edge replication to support weighted edges in virtual topology graphs  Integrates the above functionality in MPI non-distributed topology interface

Ahmad Afsahi Parallel Processing Research Laboratory 11 Presentation Outline  Introduction  Background and Motivation  MPI Graph and Cartesian Topology Functions  Related Work  Design and Implementation of Topology Functions  Experimental Framework and Performance Results  Micro-benchmark Results  Applications Results  Concluding Remarks and Future Work

Ahmad Afsahi Parallel Processing Research Laboratory 12 Design of MPI Topology Functions (I)  Both Cartesian and graph interfaces are treated as graph at the underlying layers  Cartesian topology is internally copied to a graph topology  Virtual topology graph:  Vertices: MPI processes  Edges: existence, or significance, of communication between any two processes  Significance of communication : normalized total communication volume between any pair of processes, used as edge weights  Edge replication is used to represent graph edge weight o Recap: MPI non-distributed interface does not support weighted edges

Ahmad Afsahi Parallel Processing Research Laboratory 13 Design of MPI Topology Functions (II)  Physical topology graph:  Integrated node and network architecture  Vertices: architectural components such as: o Network nodes o Cores o Caches  Edges: communication links between the components  Edge weights: communication performance between components o Processor cores: closer cores have higher edge weight o Network nodes: closer nodes have higher edge weight o Farthest on-node cores get higher weight than closest network nodes

Ahmad Afsahi Parallel Processing Research Laboratory 14 Physical Topology Distance Example  d1 will have the highest load value in the graph.  The path between N2 and N3 (d4) will have the lowest load value, indicating the lowest performance path.  d1 > d2 > d3 > d4 = 1

Ahmad Afsahi Parallel Processing Research Laboratory 15 Tools for Implementation of Topology Functions  HWLOC library for extracting node architecture:  A tree architecture, with nodes at top level and cores at the leaves  Cores with lower-level parents (such as caches) are considered to have higher communication performance  IB subnet manager (ibtracert) for extracting network distances:  Do the discovery offline, before the application run  Make a pre-discovered network distance file  Scotch library for mapping virtual to physical topologies:  Source and target graphs are weighted and undirected  Uses recursive bi-partitioning for graph mapping

Ahmad Afsahi Parallel Processing Research Laboratory 16 Implementation of Topology Functions  Communication pattern profiling:  Probes are placed inside MPI library to profile applications’ communication pattern.  Pairwise communication volume is normalized in the range of , with 0 meaning no edge between the two vertices.  All processes perform node architecture discovery  One process performs network discovery for all  Make the physical architecture view unified across the processes (using Allgather)

Ahmad Afsahi Parallel Processing Research Laboratory 17 Existing MPICH function Graph topology Graph topology initialization Creating physical topology: by extracting and merging node and network architectures. 1. Initialize Scotch architecture. 2. Extract network topology (if required). 3. Extract node topology. 4. Merge node and network topology. 5. Distribute the merged topology among processes (using allgather). 6. Build Scotch physical topology. Constructing a new reordered communicator: using Scotch mapping of the previous step. SCOTCH HWLOC Cartesian topology Trivial graph topology creation Trivial Cartesian topology creation Cartesian topology initialization No Reorder Reorder SCOTCH Graph mapping: by constructing Scotch weighted virtual topology from the input graph and mapping it to the extracted physical topology. 1. Initialize and build the Scotch virtual topology graph. 2. Initialize the mapping algorithms’ strategy in Scotch. 3. Map the virtual topology graph to the extracted physical topology. Creating the new MPI communicator IB Subnet manager Flow of Functionalities Creating equivalent graph topology Application profiling Input virtual topology graph New function added to MPICH External library utilized Calling a function Following a function in the code

Ahmad Afsahi Parallel Processing Research Laboratory 18 Presentation Outline  Introduction  Background and Motivation  MPI Graph and Cartesian Topology Functions  Related Work  Design and Implementation of Topology Functions  Experimental Framework and Performance Results  Micro-benchmark Results  Applications Results  Concluding Remarks and Future Work

Ahmad Afsahi Parallel Processing Research Laboratory 19 Experimental Framework  Cluster A (4 servers, 32-cores total)  Hosts: 2-way quad-core AMD Opteron 2350 servers, with 2MB shard L3 cache per processor, and 8GB RAM  Network: QDR InfiniBand, 3 switches at 2 levels  Software: Fedora 12, Kernel , MVAPICH2 1.5, OFED  Cluster B (16 servers, 192 cores total)  Hosts: 2-way hexa-core Intel Xeon X5670 servers, with a 12MB multi-level cache per processor, and 24GB RAM  Network: QDR InfiniBand, 4 switches at 2 levels  Software: RHEL 5, Kernel , MVAPICH2 1.5, OFED 1.5.2

Ahmad Afsahi Parallel Processing Research Laboratory 20 MPI Applications – Some Statistics MPI ApplicationCommunication Primitives NPB CG - MPI Send/Irecv: ~100% of the calls - MPI Barrier: ~0% of the calls NPB MG - MPI Send/Irecv: 98.5% of the calls, ~100% of the volume - MPI Allreduce, Reduce, Barrier, Bcast: 1.5% of the calls, ~0.002% of the volume LAMMPS - MPI Send/Recv/Irecv/Sendrecv: 95% of the calls, 99% of the volume - MPI Allreduce, Reduce, Barrier, Bcast, Scatter, Allgather, Allgatherv: 5% of the calls, 1% of the volume

Ahmad Afsahi Parallel Processing Research Laboratory 21 Exchange Micro-benchmark: Topology-aware Mapping Improvement over Block Mapping (%)

Ahmad Afsahi Parallel Processing Research Laboratory 22 Exchange Micro-benchmark: Topology-aware Mapping Improvement over Block Mapping (%)

Ahmad Afsahi Parallel Processing Research Laboratory 23 Collective Micro-benchmark: Topology-aware Mapping Improvement over Block Mapping (%)

Ahmad Afsahi Parallel Processing Research Laboratory 24 Applications: Topology-aware Mapping Improvement over Cyclic Mapping (%) 32-core cluster A

Ahmad Afsahi Parallel Processing Research Laboratory 25 Applications: Topology-aware Mapping Improvement over Block Mapping (%) 32-core cluster A

Ahmad Afsahi Parallel Processing Research Laboratory 26 Applications: Topology-aware Mapping Improvement over Cyclic Mapping (%) 128-core cluster B

Ahmad Afsahi Parallel Processing Research Laboratory 27 Applications: Topology-aware Mapping Improvement over Block Mapping (%) 128-core cluster B

Ahmad Afsahi Parallel Processing Research Laboratory 28 Communicator Creation time in MPI_Graph_create for LAMMPS System# ProcessesTrivial (ms) Non-weighted Graph (ms) Weighted Graph (ms) Network-aware Graph (ms) Cluster A Cluster B

Ahmad Afsahi Parallel Processing Research Laboratory 29 Presentation Outline  Introduction  Background and Motivation  MPI Graph and Cartesian Topology Functions  Related Work  Design and Implementation of Topology Functions  Experimental Framework and Performance Results  Micro-benchmark Results  Applications Results  Concluding Remarks and Future Work

Ahmad Afsahi Parallel Processing Research Laboratory 30 Concluding Remarks  We presented design and implementation of MPI non-distributed graph and Cartesian functions in MVAPICH2 for multi-core nodes connected through multi-level InfiniBand networks.  The micro-benchmarks showed that the effect of reordering process ranks can be significant, and when the communication is heavier on one dimension the benefits of using weighted and network-aware graphs (instead of non-weighted graph) are considerable.  We also modified MPI applications with MPI_Graph_create. The evaluation results showed that MPI applications can benefit from topology-aware MPI_Graph_create.

Ahmad Afsahi Parallel Processing Research Laboratory 31 Future Work  We intend to evaluate the effect of topology awareness on other MPI applications.  We would also like to run our applications on a larger testbed.  We would like to design a more general communication cost/weight model for graph mapping, and use other libraries.  We also intend to design and implement MPI distributed topology functions for more scalability in a more distributed, scalable fashion.

Ahmad Afsahi Parallel Processing Research Laboratory 32 Acknowledgment

Ahmad Afsahi Parallel Processing Research Laboratory 33 Thank you! Contacts:  Mohammad Javad Rashti:  Jonathan Green:  Pavan Balaji:  Ahmad Afsahi:  William D. Gropp:

Ahmad Afsahi Parallel Processing Research Laboratory 34 Backup Slides

Ahmad Afsahi Parallel Processing Research Laboratory 35 MPI_Graph_create MPIR_Graph_create_reorder MPIU_Get_scotch_arch MPIR_Comm_copy_reorder SCOTCH HWLOC MPI_Cart_create MPIR_Graph_create MPIR_Cart_create_reorder MPIR_Topo_create No Reorder Reorder SCOTCH_Graph_build/map MPIR_Comm_copy Scotch mapping Legend Existing MPICH function New function added to MPICH External library utilized IB Subnet manager Calling a function Following a function in the code Flow of function calls in MVAPICH code