Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Computer Systems/Operating Systems - Class 8
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
Threads CSCI 444/544 Operating Systems Fall 2008.
Figure 1.1 Interaction between applications and the operating system.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Slide 3-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 3 Operating System Organization.
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.
A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Srihari Makineni & Ravi Iyer Communications Technology Lab
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Department of Computer Science and Software Engineering
ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Programming Parallel Hardware using MPJ Express By A. Shafi.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
Introduction to Operating Systems Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 3: Windows7 Part 5.
Chapter 4: Threads.
Chapter 3: Windows7 Part 5.
Chapter 4: Threads.
Chapter 4: Threads.
Chapter 2: System Structures
Chapter 2: The Linux System Part 1
CS703 - Advanced Operating Systems
Multithreaded Programming
MPJ: A Java-based Parallel Computing System
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 4: Threads & Concurrency
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne National Laboratory, Argonne, IL, USA 2 Microsoft Corporation, Redmond, WA, USA

2 Windows and HPC  Popular in the desktop world  Increasing presence in HPC world  Windows Server 2008 HPC Edition (  Dawning 5000A cluster with 30,720 cores at Shanghai Supercomputer Center running Windows HPC Server 2008 achieved more than 200 TF/s on LINPACK and ranked 10 th in Nov 2008 Top500 list  Commercial applications in areas such as CFD, Structural analysis, Seismic modeling, finance etc  Allows scientists to use a familiar desktop interface for HPC

3 Motivation  Historically Unix has been OS of choice for HPC  More research and development experience in HPC in the Unix world  An MPI library suite on Windows allows organizations to tap into idle-time computing power of their Windows desktops  Users can use a familiar desktop interface for HPC  Implementing a high-performance, portable MPI library is not trivial.  Performance vs maintainability –Good performance involves using OS-specific features which usually requires careful design for maintainability  Portability vs maintainability –A portable MPI library involves several OS-specific components. –More code typically results poor maintenance due to complicated dependencies among components.

4 Background (MPICH2)  A high-performance, widely portable implementation of the MPI standard  MPI 2.2 compliant  Supports Windows and Unix  Modular design supports largely common code base  Framework extensible by plugging in custom network devices, Process Managers and other tools  MPICH2 suite includes a high performance multi-network communication subsystem called Nemesis

5 Background (Nemesis)  A Scalable, multi-network communication subsystem available as a CH3 channel.  An efficient implementation available on Windows (MPICH2 1.1x series onward)  Default channel in MPICH2 1.3x series on Windows; earlier on Unix  Uses efficient shared memory operations, lock-free algorithms and optimized memory-copy routines  Low shared memory latency, ns  Modular Design allows developers to plug-in custom communication modules – Large Message Transfer (LMT) modules, Internode communication modules etc

6 Implementing MPI  An MPI library suite usually includes –a library that implements the MPI standard –the runtime components that support it.  Typically the following OS services are required to implement an MPI library suite –Thread services – Create threads, mutex locks etc –Intranode communication services - e.g. shared memory services –Internode communication services – e.g. TCP/IP sockets –Process management services – Create/Suspend/Kill processes, interprocess communication mechanisms (e.g. pipes)

7 Implementing MPI on Windows Vs Unix  MPICH2’s modular architecture maintains a largely common code base –MPI layer (Algorithms for collectives etc) –CH3 device (Implementing the ADI3 interface)  Portability across Windows and Unix is achieved using –OS abstraction layer (eg: For shared memory services, TCP/IP services) –Portable libraries (eg: Open Portable Atomics library) –OS specific plug-in modules (eg: Large Message Transfer module in Nemesis)  A high-performance implementation of MPI should incorporate OS- specific features provided. These features differ on Unix and Windows wrt –Asynchronous progress support –Process Management services –Intranode communication services –User thread support

8 Asynchronous Progress  Asynchronous progress –User initiates operation –OS performs progress on the operation –OS notifies user when operation completes  Allows other user operations while OS performs progress on the user initiated operation  Contrast with non-blocking progress –user polls for availability of resource –user performs non-blocking operation –again polls for availability of the resource, if needed  No portable support for asynchronous progress on Unix systems for network communication.  Windows supports asynchronous progress using I/O Completion Ports

9 Asynchronous Progress (I/O Completion Ports)  Supports asynchronous progress of user-initiated OS requests  Provides an efficient threading model for processing multiple asynchronous I/O requests on a multiprocessor system  Nemesis uses I/O Completion Ports for internode communication (An older version used non-blocking progress engine based on select() )  Asynchronous progress in Nemesis –Nemesis posts a request to the kernel (e.g.: read on a TCP/IP socket) –When request completes, the kernel queues a completion packet on the completion port –Nemesis processes this packet when it waits (MPI_Wait()) on the request  MPI process can proceed with user computation while kernel processes the request  Available on some Unix systems (e.g. Solaris). However not portable across Unix systems  Several advantages over select()/poll()

10 Asynchronous progress (contd) MPI application MPICH2 Operating System 1.MPI_Isend 2.Initiate tcp send req 3.1MPI_Isend returns 5. MPI_Test 6. poke iocp 7.req complete 8. Update status 3.2 Progress on req 4.1 User computation I/O Completion Object 4.2 Queue req completion status Completion Queue

11 Asynchronous progress (contd) The figure illustrates a possible scenario where user computation and MPI communication overlap with asynchronous progress 1. MPI application invokes MPI_Isend() 2. MPICH2 initiates a TCP send() 3.1. MPI_Isend() call returns 3.2. At the same time OS makes progress on the TCP send request 4.1 MPI application performs user computation 4.2. At the same time OS completes the TCP send request and queues a completion packet into the I/O Completion Port (IOCP) 5. MPI application invokes MPI_Test() to check the status of the MPI request 6. MPICH2 pokes the IOCP 7. MPICH2 processes completed requests 8. MPI_Test() call returns with updated status of the MPI send request

12 Experimental Evaluation  National Center for Supercomputing Applications (NCSA) Intel 64 Cluster – Abe  Dell PowerEdge 1955  1200 nodes/9600 cores  2.33 GHz Dual Quad Core Intel 64 (Clovertown) processors  4x32 KB L1 Cache, 2x4 MB L2 Cache, 8GB DDR2 RAM  Gigabit Ethernet, Infiniband  Windows and Linux nodes available  Lustre, NTFS

13 NCSA Intel 64 Cluster Abe (contd)  Windows nodes –Windows Server 2008 HPC Edition SP2 –Visual Studio 2008 C/C++ compiler suite  Unix nodes –Red Hat Enterprise Linux 4 (Linux ) –Intel C/C compiler  Job Scheduler –Unix nodes – Torque –Windows nodes - Windows Server 2008 HPC Job Scheduler  MPICH2 compiled with aggressive optimization & error checking disabled for user arguments

14 Experimental Evaluation (Asynchronous Progress)  Asynchronous progress (I/O completion ports) Vs Non-blocking progress (select) for Nemesis netmod on Windows  Compared overlap of MPI communication with User computation  Nemesis uses Asynchronous progress by default on Windows  Modified NetMPI (NetPIPE benchmark) { MPI_Irecv(); MPI_Send(); MPI_Wait(); } { MPI_Irecv(); … MPI_Isend(); … do{ MPI_Testall(); User_computation(≈250ns); }while(!complete);

15 Experimental Evaluation (Asynchronous Progress) Time spent in the MPI library (MPI time) vs Time spent in user computation (Compute time). Asynchronous progress (using iocp - Completion ports) is compared with non-blocking progress (select())

16 Experimental Evaluation (Asynchronous Progress)  Less time spent inside MPI library with asynchronous progress  More user computation done with asynchronous progress  In asynchronous progress MPI library delegates requests to the OS  In non-blocking progress MPI library performs (after checking for availability) requests on behalf of the user  Asynchronous progress performs better than non-blocking progress

17 Process Management  Provide a mechanism to launch MPI processes and setting the runtime environment for the processes to communicate with each other.  Launch process locally on a node – OS services like fork()  Launch process remotely without 3 rd party Job schedulers. –On Unix – ssh (sshd typically runs on Unix nodes) –On Windows - No such mechanism available  MPICH2 suite includes SMPD (Simple MultiPurpose Daemon) –server daemons are launched on each node –Installed as a Windows service on windows nodes when user installs MPICH2  SMPD also supports Windows Server 2008 HPC Job Scheduler for launching MPI processes

18 Process Management (contd)  User authentication –On Unix – ssh protocol manages user credentials –On Windows – No such mechanism available  SMPD – manages user credentials for Windows users  SMPD also supports user authentication using Active Directory or Windows Server 2008 HPC pack.  SMPD communication protocol is node architecture agnostic –allows users to run MPI jobs across Unix and Windows nodes

19 Intranode communication  Typically using shared memory  Shared memory performance degrades when CPUs don’t share data cache.  Windows allows users to directly read from the virtual memory of another process  Direct remote memory access performance not good for small messages  Nemesis on Windows uses an OS-specific plug-in module for large message transfers (Read Remote Virtual Memory module - RRVM)  Nemesis on Windows uses lock-free shared memory queues for small messages (low latency) and direct remote memory access for large messages (high bandwidth)

20 Default LMT vs RRVM LMT module RTS CTS + cookie Rank 0Rank 1 data RTS + cookie data DONE Rank 0Rank 1 Shared memory LMT RRVM LMT Recv posted

21 RRVM LMT module (contd)  Rank 0 sending data >LMT threshold to Rank 1  LMT shared memory protocol –Rank 0 sends RTS (Request to Send) –Matching Recv() posted on Rank 1 –Rank 1 allocates buffer and sends CTS (Clear to Send) with cookie containing information about the buffer –Rank 0 transfers data to buffer –Rank 1 transfers data from the buffer  LMT RRVM protocol –Rank 0 sends RTS with cookie containing information about user buffer –Matching Recv() posted on Rank 1 –Rank 1 reads data from user buffer –Rank 1 sends a DONE indicating end of transfer

22 Experimental Evaluation (Intranode communication)  Nemesis intranode communication performance using lock-free shared memory queues (shm) Vs Direct remote memory access on Windows  Nemesis intranode communication performance on Windows Vs Linux  Direct intranode remote memory access implemented by a large message transfer plug-in Nemesis module (RRVM module) on Windows  Nemesis performs large message transfer for >16KB with RRVM module on Windows and >64KB on Linux  OSU micro-benchmarks used for measuring latency and bandwidth  Two cases considered –Communicating processes share a 4MB CPU data cache –Communicating processes don’t share a data cache

23 RRVM vs SHM Bandwidth on Windows Intranode communication bandwidth using RRVM module Vs lock-free shared memory queues on Windows, with and without shared CPU data caches

24 Intranode latency: Unix vs Windows Intranode communication latency on Windows and Unix with and without shared CPU data caches

25 Intranode bandwidth: Unix vs Windows Intranode communication bandwidth on Windows and Unix with and without shared CPU data caches

26 Experimental Evaluation (Intranode communication)  RRVM module outperforms shared memory queues when processes don’t share a data cache  Shared memory queues outperform RRVM for some message sizes when communicating processes share a data cache  RRVM provides good overall performance  Jagged graph –At 16KB the double-buffering communication algorithm runs out of L1 cache –>64KB Nemesis switches to LMT protocol –At 2MB the double-buffering algorithm runs out of L2 cache –More tuning required  0 byte latency of 240ns on Unix and 275ns on Windows  Nemesis on Windows performs significantly better than Unix for large message sizes when processes don’t share data caches

27 Internode communication  MPI communication between processes on different nodes  Nemesis network modules (netmods) perform internode communication  Network modules are easily pluggable into the Nemesis framework. –Separate network modules available to support TCP/IP sockets, IB etc  A Windows-specific TCP/IP netmod is available for Nemesis –Uses an asynchronous progress engine for managing TCP/IP socket operations –Performance much better than older netmod using non-blocking progress –Tuned for good performance on Windows

28 Internode communication performance  Nemesis Internode communication on Unix Vs Windows using TCP  Windows has an OS-specific network module for Nemesis  OSU micro-benchmarks used for measuring latency and bandwidth

29 Internode bandwidth Internode MPI communication bandwidth for Nemesis

30 Thread Support  User requests a particular level of thread support, library notifies the level supported  MPICH2 supports MPI_THREAD_MULTIPLE  Since multiple threads can interact with MPI library at the same time thread locking mechanisms are required –On Unix – a POSIX threads library –Windows – OS-specific implementation  Choice of intraprocess locks Vs interprocess locks (costly) affects performance  Shared memory communication typically involves interprocess locks that are costly  MPICH2 (Nemesis) uses lock-free shared memory operations and hence only requires intraprocess locks

31 Cost of supporting threads  Cost of supporting MPI_THREAD_MULTIPLE vs MPI_THREAD_SINGLE for MPICH2 on Windows Vs Unix  OSU micro-benchmark for measuring latency –Modified to use MPI_Init_thread() instead of MPI_Init() –Both cases only one user thread –Latency overhead = Percentage increase in latency over MPI_THREAD_SINGLE when MPI_THREAD_MULTIPLE is enabled  Supporting thread safety –On Unix, Pthread process private mutex locks –On Windows, Critical Sections  Interprocess locks typically used for shared memory communication is costly –On Unix, pthread_mutexattr_setpshared(PTHREAD_PROCESS_SHARED/ PTHREAD_PROCESS_PRIVATE) –On Windows, Mutexes Vs Critical Sections

32 Cost of supporting threads ( contd ) Latency Overhead ={ (Latency mpi_thread_multiple - Latency mpi_thread_single )/Latency mpi_thread_single } * 100

33 Cost of supporting threads ( contd )  Overhead for supporting threads is lower on Windows than Unix  Older versions used interprocess locks on Windows (SHM channel)  Nemesis uses intraprocess locks on both Unix and Windows for guaranteeing thread safety  Nemesis uses lock-free shared memory queues –i.e., No interprocess locks required for MPI communication –Intraprocess locks sufficient for guaranteeing thread safety

34 Related Work  Microsoft MPI (MSMPI) –  Intel MPI –  Open MPI –  Deino MPI –  MPI.NET –Provides C# bindings for Microsoft.NET environment –

35 Conclusions  Good intranode communication performance (0-byte latency of ns)  Asynchronous progress improves performance on Windows  Internode communication performance gap; needs tuning  Cost of supporting user threads less on Windows than Pthread library on Linux  A modular architecture with clearly defined interfaces between modules makes life easy  Implementing a high-performance and portable MPI library requires careful design and leveraging OS-specific features

36 Future work  Topology aware intranode communication –MPICH2 can use node topology information to select between using lock-free shared memory queues and Direct remote memory access  Fix internode communication performance gap –More tuning is required –Allow multiple user threads to concurrently work with the asynchronous progress engine  MPI Collectives performance  MPI Application Scalability and performance  Support for NetworkDirect API –A Nemesis network module for NetworkDirect –Network Direct service provider interface exposes advanced capabilities of modern high speed networks (e.g., Infiniband)