Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.

Slides:

Advertisements

Similar presentations

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Advertisements

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration Xiang Zhang 1,2, Zhigang Huo 1, Jie Ma 1, Dan Meng 1 1. National Research Center.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

OpenFOAM on a GPU-based Heterogeneous Cluster

Post-Copy Live Migration of Virtual Machines Michael R. Hines, Umesh Deshpande, Kartik Gopalan Computer Science, Binghamton University(SUNY) SIGOPS 09’

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

Outline Introduction Image Registration High Performance Computing Desired Testing Methodology Reviewed Registration Methods Preliminary Results Future.

Point-to-Point Communication Self Test with solution.

Implementing an OpenMP Execution Environment on InfiniBand Clusters Jie Tao ¹, Wolfgang Karl ¹, and Carsten Trinitis ² ¹ Institut für Technische Informatik.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.

MPI Point-to-Point Communication CS 524 – High-Performance Computing.

1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.

Non-Blocking I/O CS550 Operating Systems. Outline Continued discussion of semaphores from the previous lecture notes, as necessary. MPI Types What is.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

IFS Benchmark with Federation Switch John Hague, IBM.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.

7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.

Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Introduction to Parallel Programming with C and MPI at MCSR Part 1 The University of Southern Mississippi April 8, 2010.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.

MPI Communications Point to Point Collective Communication Data Packaging.

Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Charm Workshop CkDirect: Charm++ RDMA Put Presented by Eric Bohm CkDirect Team: Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele.

1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

Message Passing Interface (MPI) 2 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Stela: Enabling Stream Processing Systems to Scale-in and Scale-out On- demand Le Xu ∗, Boyang Peng†, Indranil Gupta ∗ ∗ Department of Computer Science,

Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.

Implementation and Optimization of MPI point-to-point communications on SMP-CMP clusters with RDMA capability.

MPI Point to Point Communication

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

Supporting Fault-Tolerance in Streaming Grid Applications

Blocking / Non-Blocking Send and Receive Operations

An Introduction to Parallel Programming with MPI

Department of Computer Science University of California, Santa Barbara

Chirag Dekate Department of Computer Science

Implementing an OpenMP Execution Environment on InfiniBand Clusters

Barriers implementations

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science The University of Arizona Tucson, AZ Zheng Gu, Matthew Small, Xin Yuan Department of Computer Science Florida State University Tallahassee, FL

Motivation - Need for an on-line protocol selection scheme: Optimal protocol for a communication routine: application and architecture specific Existing approaches o Off-line: Protocol selection at program compilation time o Static: One protocol per application Difficult to adapt to program’s runtime characteristics 5/9/112

Contributions - On-line protocol selection algorithm - Protocol cost model Employed by the on-line protocol selection algorithm to estimate the total execution time per protocol - Sender-initiated Post-copy protocol A novel protocol to complement the existing set of protocols 5/9/113

On-line Protocol Selection Algorithm - Selects the optimal communication protocol for a communication phase dynamically - Protocol selection algorithm split into two phases: Phase 1: Execution time estimation per protocol Phase 2 (optimization): Buffer usage profiling - System works with four protocols 5/9/114

On-line Protocol Selection Algorithm 5/9/115 Rank 1Rank 3 …Rank nRank 2 Execution of phase 1 of a sample application: n tasks m MPI calls per task

5/9/116 Phase 1 (Estimating Execution Times) On-line Protocol Selection Algorithm Start of phase Rank 1Rank 3 …Rank nRank 2

5/9/117 Rank 1Rank 3 …Rank nRank 2 MPI Call 1 On-line Protocol Selection Algorithm Start of phase Phase 1 (Estimating Execution Times)

5/9/118 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

5/9/119 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

5/9/1110 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

5/9/1111 On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase t t t Optimal Protocol = min(t) Protocol Selection Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

5/9/1112 On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase Optimal Protocol = min(t) Protocol Selection t t t - Execution time linear in # MPI calls per phase Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2

Point-to-Point Protocols - Our system uses the following protocols  Existing Protocols (Yuan et al. 2009):  Pre-copy  Sender-initiated Rendezvous  Receiver-initiated Rendezvous New protocol  Post-copy - Protocols categorized based on: Message size Arrival patterns of the communicating tasks 5/9/1113

SenderReceiver Pre-copy Protocol Time 5/9/1114 MPI Call Data Operation

SenderReceiver MPI_Send 5/9/1115 Time MPI Call Data Operation Pre-copy Protocol

SenderReceiver MPI_Send Local buffer copy 5/9/1116 Time MPI Call Data Operation Pre-copy Protocol

SenderReceiver MPI_Send Local buffer copy RDMA Write Request 5/9/1117 Time MPI Call Data Operation Pre-copy Protocol

SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1118 Time MPI Call Data Operation Request Pre-copy Protocol

RDMA Read SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1119 Time MPI Call Data Operation Request Data Pre-copy Protocol

RDMA Read SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write ACK 5/9/1120 RDMA Write Time MPI Call Data Operation Request Data Pre-copy Protocol

RDMA Read SenderReceiver Sender Idle MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1121 RDMA Write Time MPI Call Data Operation Request ACK Data Pre-copy Protocol

SenderReceiver 5/9/1122 Time MPI Call Data Operation Post-copy Protocol

SenderReceiver MPI_Send 5/9/1123 Time MPI Call Data Operation Post-copy Protocol

SenderReceiver RDMA Write 5/9/1124 Time MPI Call Data Operation MPI_Send Request + Data Post-copy Protocol

SenderReceiver MPI_Barrier 5/9/1125 Time MPI Call Data Operation MPI_Send RDMA Write Request + Data Post-copy Protocol

SenderReceiver MPI_Recv 5/9/1126 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol

Local buffer copy SenderReceiver MPI_Recv 5/9/1127 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol

Local buffer copy SenderReceiver MPI_Recv ACK 5/9/1128 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol

Local buffer copy SenderReceiver Sender Idle MPI_Recv MPI_Barrier 5/9/1129 ACK Time MPI Call Data Operation MPI_Send Request + Data RDMA Write MPI_Barrier Post-copy Protocol - Sender spends significantly less idle time compared to Pre-copy

Protocol Cost Model 5/9/ Supports five basic MPI operations:  MPI_Send  MPI_Recv  MPI_Isend  MPI_Irecv  MPI_Wait - Important terms: t memreg - Buffer registration time t memcopy - Buffer memory copy time t rdma_read - Buffer RDMA Read time t rdma_write - Buffer RDMA Write time t func_delay - Constant book-keeping time

5/9/1131 SenderReceiver MPI_Irecv MPI_Wait MPI_Isend MPI_Wait Post-copy Protocol Cost Model: Sender Early t memreg t rdma_write t func_delay t memcopy t func_delay

5/9/1132 SenderReceiver MPI_Irecv t memreg t rdma_write t func_delay MPI_Wait MPI_Isend MPI_Wait t func_delay t memcopy t func_delay Receiver = Total time t memcopy + 2 x t func_delay Sender = Total time t memreg + t rdma_write + 2 x t func_delay Post-copy Protocol Cost Model: Sender Early

5/9/1133 SenderReceiver MPI_Irecv MPI_Wait MPI_Isend MPI_Wait Post-copy Protocol Cost Model: Receiver Early t memreg t rdma_write t func_delay t memcopy t func_delay t wait_delay

5/9/1134 SenderReceiver MPI_Irecv t memreg t rdma_write t func_delay MPI_Wait MPI_Isend MPI_Wait t func_delay t memcopy t func_delay Sender = Total time t memreg + t rdma_write + 2 x t func_delay Receiver = Total time t wait_delay + t memcopy + 2 x t func_delay t wait_delay Post-copy Protocol Cost Model: Receiver Early

5/9/ MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... Optimization: Buffer Usage Profiling - Example code snippet:

5/9/1136 Rank 1Rank 3Rank nRank 2 Start of phase Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling

5/9/1137 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) MPI_Recv (Buff 2) MPI_Send (Buff 3) Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling

5/9/1138 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) MPI_Recv (Buff 2) MPI_Send (Buff 3) MPI_Recv (Buff 1) Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling

5/9/1139 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) Phase 2 (Buffer usage profiling) MPI_Recv (Buff 2) MPI_Send (Buff 3) MPI_Recv (Buff 1) Buff 1Buff 2Buff 3 MPI Call 1MPI Call 2MPI Call 3 MPI Call 2MPI Call 3 Optimization: Buffer Usage Profiling

... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... 5/9/1140 Optimization: Buffer Usage Profiling - Conversion of synchronous calls to asynchronous calls

... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... 5/9/ MPI_Isend(buff1,..., req1); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Wait(req1,...); MPI_Recv(buff1,...);... Buffer Usage Profile Optimization: Buffer Usage Profiling - Conversion of synchronous calls to asynchronous calls

Performance Evaluation - Test Cluster:  Intel Xeon Processors (64 bit)  8-core 2.33 GHz  8 GB System Memory  16 nodes  Infiniband Interconnect - Software: MVAPICH 2 - Benchmarks :  Sparse Matrix  CG  Sweep3D  Microbenchmarks 5/9/1142

Performance Evaluation 5/9/ Single communication phase per application

5/9/ System chose optimal protocol for each phase dynamically Performance Evaluation

5/9/ Real and modeled execution times for Sparse Matrix Application Modeling accuracy: 95% to 99% Modeling overhead: less than 1% of total execution time Real Modeled

Summary - Our system for on-line protocol selection was successfully tested for real and microbenchmarks. - Protocol cost model: high accuracy with negligible overhead. - Sender-initiated Post-copy protocol was successfully implemented. 5/9/1146

Questions? 5/9/1147

Thank You! 5/9/1148