Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.
1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. The experimental resources for this research were.
CHAPTER 9: Input / Output
Parallel Programming Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
Parallel Programming with Java
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Emalayan Vairavanathan
CHAPTER 9: Input / Output
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Parallel Computing Through MPI Technologies Author: Nyameko Lisa Supervisors: Prof. Elena Zemlyanaya, Prof Alexandr P. Sapozhnikov and Tatiana F. Sapozhnikov.
Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and Computer Science, Argonne National Laboratory Computer.
Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,
Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.
Example: Sorting on Distributed Computing Environment Apr 20,
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
Interconnection network network interface and a case study.
ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
MPI Adriano Cruz ©2003 NCE/UFRJ e Adriano Cruz NCE e IM - UFRJ Summary n References n Introduction n Point-to-point communication n Collective.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Introduction Super-computing Tuesday
Parallel I/O Optimizations
Introduction.
Introduction.
MPJ: A Java-based Parallel Computing System
On the Role of Burst Buffers in Leadership-Class Storage Systems
Presentation transcript:

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University of Chicago University of Illinois, Urbana Champaign

Ultra-scale High-end Computing Processor speeds no longer doubling every months –High-end Computing systems growing in parallelism Energy usage and heat dissipation are major issues now –Energy usage is proportional to V 2 F –Lots of slow cores use lesser energy than one fast core Consequence: –HEC systems rely less on the performance of a single core –Instead, extract parallelism out of a massive number of low- frequency/low-power cores –E.g., IBM Blue Gene/L, IBM Blue Gene/P, SiCortex Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

IBM Blue Gene/P System Second Generation of the Blue Gene supercomputers Extremely energy efficient design using low-power chips –Four 850MHz cores on each PPC450 processor Connected using five specialized networks –Two of them (10G and 1G Ethernet) are used for File I/O and system management –Remaining three (3D Torus, Global Collective network, Global Interrupt network) are used for MPI communication Point-to-point communication goes through the torus network Each node has six outgoing links at 425 MBps (total of 5.1 GBps) Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Blue Gene/P Software Stack Three Software Stack Layers: –System Programming Interface (SPI) Directly above the hardware Most efficient, but very difficult to program and not portable ! –Deep Computing Messaging Framework (DCMF) Portability layer built on top of SPI Generalized message passing framework Allows different stacks to be built on top –MPI Built on top of DCMF Most portable of the three layers Based off of MPICH2 (integrated into MPICH2 as of 1.1a1) Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Issues with Scaling MPI on the BG/P Large scale systems such as BG/P provide the capacity needed for achieving a Petaflop or higher performance This system capacity has to be translated to capability for end users Depends on MPI’s ability to scale to large number of cores –Pre- and post-data-communication processing in MPI Simple computations can be expensive on modestly fast 850 MHz CPUs Algorithmic Issues –Consider an O(N) algorithm with a small proportionality constant –“Acceptable” on 100 processors; Brutal on 100,000 processors Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

MPI Internal Processing Overheads Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008) Application MPI Application MPI Pre- and Post-data- communication overheads Application MPI Application MPI

Presentation Outline Introduction Issues with Scaling MPI on Blue Gene/P Experimental Evaluation –MPI Stack Computation Overhead –Algorithmic Inefficiencies Concluding Remarks Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Basic MPI Stack Overhead Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008) Application MPI Application DCMF MPI Application DCMF

Basic MPI Stack Overhead (Results) Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Request Allocation and Queuing Blocking vs. Non-blocking point-to-point communication –Blocking: MPI_Send() and MPI_Recv() –Non-blocking: MPI_Isend(), MPI_Irecv() and MPI_Waitall() Non-blocking communication potentially allows for better overlap of computation with communication, but… –…requires allocation, initialization and queuing/de-queuing of MPI_Request handles What are we measuring? –Latency test using MPI_Send() and MPI_Recv() –Latency test using MPI_Irecv(), MPI_Isend() and MPI_Waitall() Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Request Allocation and Queuing Overhead Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Derived Datatype Processing Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008) MPI Buffers

Overheads in Derived Datatype Processing Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Copies with Unaligned Buffers For 4-byte integer copies: –Buffer alignments of 0-4 means that the entire integer is in the same double word  to access an integer, you only need to fetch one double word –Buffer alignments of 5-7 means that the integer spans two double word boundaries  to access an integer, you need to fetch two double words Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008) Double Word Integer

Buffer Alignment Overhead Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Thread Communication Multiple threads calling MPI can corrupt the stack MPI uses locks to serialize access to the stack –Current locks are coarse grained  protect the entire MPI call –Implies these locks serialize communication for all threads Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008) MPI Four MPI processes Four threads in one MPI process

Overhead of Thread Communication Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Presentation Outline Introduction Issues with Scaling MPI on Blue Gene/P Experimental Evaluation –MPI Stack Computation Overhead –Algorithmic Inefficiencies Concluding Remarks Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Tag and Source Matching Search time in most implementations is linear with respect to the number of requests posted Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008) Source = 1 Tag = 1 Source = 1 Tag = 2 Source = 2 Tag = 1 Source = 0 Tag = 0

Overheads in Tag and Source Matching Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Unexpected Message Overhead Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Multi-Request Operations Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Presentation Outline Introduction Issues with Scaling MPI on Blue Gene/P Experimental Evaluation –MPI Stack Computation Overhead –Algorithmic Inefficiencies Concluding Remarks Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Concluding Remarks Systems such as BG/P provide the capacity needed for achieving a Petaflop or higher performance System capacity has to be translated to end-user capability –Depends on MPI’s ability to scale to large number of cores We studied the non-data-communication overheads in MPI on BG/P –Identified several bottleneck possibilities within MPI –Stressed these bottlenecks with benchmarks –Analyzed the reasons behind such overheads Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)

Thank You! Contact: Pavan Balaji: Anthony Chan: William Gropp: Rajeev Thakur: Rusty Lusk: Project Website: Pavan Balaji, Argonne National Laboratory EuroPVM/MPI (09/08/2008)