Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur,

Slides:

Advertisements

Similar presentations

RMA Considerations for MPI-3.1 (or MPI-3 Errata)

Advertisements

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

Building Algorithmically Nonstop Fault Tolerant MPI Programs Rui Wang, Erlin Yao, Pavan Balaji, Darius Buntinas, Mingyu Chen, and Guangming Tan Argonne.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Operating System Architecture and Distributed Systems

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

The University of Adelaide, School of Computer Science

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan,

KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.

Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.

Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,

Distributed Shared Memory Systems and Programming

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Content Sharing over Smartphone-Based Delay- Tolerant Networks.

Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Kernel Synchronization in Linux Uni-processor and Multi-processor Environment By Kathryn Bean and Wafa’ Jaffal (Group A3)

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Background Computer System Architectures Computer System Software.

Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.

Balazs Voneki CERN/EP/LHCb Online group

Processes and threads.

OpenMosix, Open SSI, and LinuxPMI

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Consistency in Distributed Systems

Chapter 4: Threads.

Yiannis Nikolakopoulos

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur, W. Gropp D. K. Panda Dept. of Computer Science, Ohio State University Mathematics and Computer Science, Argonne National Laboratory Dept. of Computer Science, University of Illinois, Urbana Champaign

Massive Systems & Scalable Applications Massive High-end Computing (HEC) Systems available –Systems with few thousand cores are common –Systems with 10s to 100s of thousands of cores available Scalable application communication patterns –Clique-based communication Nearest neighbor: Ocean/Climate modeling, PDE solvers Cartesian grids: 3DFFT –Unsynchronized communication Minimize need to synchronize with other processes Non-blocking communication is heavily used One-sided communication is getting popular CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Process 1Process 2Process 3 Private Virtual Address Process 0 Private Virtual Address One-sided Communication (RMA) Popular in many modern programming models –MPI, UPC, Global Arrays Idea is to have an easily accessible global address space together with each process’ virtual address space CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory Shared Virtual Address Global Address Space

RMA benefits for Applications Access to additional address space –Not limited by per-core memory available –Two sided communication requires data to be explicitly moved between virtual address spaces Tolerant to Load Imbalance –Work stealing is simple and efficient (unsynchronized) Lock “global address region” Modify required regions Unlock “global address region” CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Presentation Layout Need for RMA Communication State of Art and Prior Work Proposed Design Experimental Evaluation Concluding Remarks and Future Work CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

RMA in MPI MPI is the dominant programming model for HEC systems –Extremely portable –MPI (v1) Traditionally relied on two-sided communication Each process sends data from its virtual address space Each process receives data into its virtual address space Did not have a notion of a “global address space” MPI-2 introduced RMA capability –Each process can use a part of its address space to form a “global address space” window –Processes can perform one-sided operations on this window CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Prior Work: Hardware Supported RMA Initial implementations internally relied on two-sided communication –Loses all benefits of a “one- sided” programming model Our prior work utilized InfiniBand RDMA and atomic operations –One-sided communication truly one-sided CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory MPI_Win_lock Lock (unset) Compare & Swap Lock (set) Lock Acquired MPI_Get MPI_Put MPI_Win_unlock Compare & Swap Lock Released

Multicore Systems: Issues and Pitfalls Increasing number of cores per node Network-based lock/unlock operations not ideal –Network locks go through the network adapter (several microseconds) CPU locks useful when on the same node –Uses CPU atomic instructions: few hundred nanoseconds Problem: –Network locks and CPU locks are unaware of each other –No atomicity guarantee between the two CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Our Focus in this Paper To provide true one-sided communication capabilities within the node as well as outside To propose a design that achieves two seemingly contradictory goals: –Take advantage of network hardware atomic operations for efficient inter-node RMA –Take advantage of CPU atomic operations for efficient intranode RMA CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Presentation Layout Need for RMA Communication State of Art and Prior Work Proposed Design Experimental Evaluation Concluding Remarks and Future Work CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Towards a Hybrid Lock Design Goal: –Use network locks for inter-node RMA –Use CPU locks for intra-node RMA Caveat: –No atomicity guarantee between network and CPU locks Hybrid approach: –At any point lock is either CPU based or Network based Network mode locks always go through the network –Loopback if needed (inefficient !) CPU mode locks always go through the CPU –Using two-sided communication if needed (inefficient !) CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Migrating Locks CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory (4) Modify Lock Mechanism NETWORK LOCK CPU LOCK WINDOW NODE A P3 P0 P2P1 (5) Grant Lock (2) Request Lock (3) Acquire CPU and Network Locks (1) Compare & Swap returns CPU Lock Mode LOCK MODE: CPU P11 LOCK MODE: Network Synchronization only required during migration of lock “type” All other operations are truly one-sided

Migration Policy Lock Migration Policy: –When should we use a network lock vs. a CPU lock Different policies possible: –Communication pattern history –User-specified priority –Native hardware capabilities (performance of network lock as compared to CPU lock) We use a simple approach: –Migrate lock on first conflicting request CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Presentation Layout Need for RMA Communication State of Art and Prior Work Proposed Design Experimental Evaluation Concluding Remarks and Future Work CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Experimental Testbed 8-node AMD Barcelona Cluster –Each node 4 processors, each with GHz cores –64KB per-core dedicated L1 cache –512KB per-core dedicated L2 cache –2MB per-processor dedicated L3 cache –16GB RAM Network configuration: –InfiniBand ConnectX DDR (MT25418) adapters –PCIe Gen1 x8 –Mellanox Infiniscale III 24-port fully non-blocking switch CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Overview of MVAPICH and MVAPICH2 High Performance MPI for InfiniBand and iWARP Clusters –MVAPICH (MPI-1) and MVAPICH2 (MPI-2): Available since 2002 –Derived from the MPICH2 implementation from Argonne MVAPICH2 shares upper-level code and includes IB/iWARP specific enhancements –Used by more than 900 organizations in 48 countries (registered with OSU); More than 29,000 downloads from OSU web site –Empowering many TOP500 production clusters (Nov ‘08 listing) 62,976-core cluster (Ranger) at TACC (6 th rank) 18,176-core cluster (Chinook) at PNNL (20 th rank) –Available with software stacks of many server vendors including Open Fabrics Enterprise Distribution (OFED) CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Intra-node Performance CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Contention Measurements CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Migration Overhead CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Emulated mpiBLAST Communication Kernel CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Presentation Layout Need for RMA Communication State of Art and Prior Work Proposed Design Experimental Evaluation Concluding Remarks and Future Work CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Concluding Remarks and Future Work Proposed a new design for MPI RMA optimized for high- speed networks and multicore architectures –Uses InfiniBand RDMA and atomics for internode RMA –Uses CPU atomic primitives for intranode RMA Significantly improved performance in benchmarks as well as real application kernels Future Work: –Study on other architectures such as Blue Gene/P –Different migration policies –Evaluation on various other applications CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Thank You! Contacts: G. Santhanaraman: P. Balaji: K. Gopalakrishnan: R. Thakur: W. Gropp: D. K. Panda: Project Websites: CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Backup Slides CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory

Inter-node Performance CCGrid (05/21/2009) Pavan Balaji, Argonne National Laboratory