Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science The University of Arizona Tucson, AZ Zheng Gu, Matthew Small, Xin Yuan Department of Computer Science Florida State University Tallahassee, FL
Motivation - Need for an on-line protocol selection scheme: Optimal protocol for a communication routine: application and architecture specific Existing approaches o Off-line: Protocol selection at program compilation time o Static: One protocol per application Difficult to adapt to program’s runtime characteristics 5/9/112
Contributions - On-line protocol selection algorithm - Protocol cost model Employed by the on-line protocol selection algorithm to estimate the total execution time per protocol - Sender-initiated Post-copy protocol A novel protocol to complement the existing set of protocols 5/9/113
On-line Protocol Selection Algorithm - Selects the optimal communication protocol for a communication phase dynamically - Protocol selection algorithm split into two phases: Phase 1: Execution time estimation per protocol Phase 2 (optimization): Buffer usage profiling - System works with four protocols 5/9/114
On-line Protocol Selection Algorithm 5/9/115 Rank 1Rank 3 …Rank nRank 2 Execution of phase 1 of a sample application: n tasks m MPI calls per task
5/9/116 Phase 1 (Estimating Execution Times) On-line Protocol Selection Algorithm Start of phase Rank 1Rank 3 …Rank nRank 2
5/9/117 Rank 1Rank 3 …Rank nRank 2 MPI Call 1 On-line Protocol Selection Algorithm Start of phase Phase 1 (Estimating Execution Times)
5/9/118 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
5/9/119 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
5/9/1110 t protocol On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
5/9/1111 On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase t t t Optimal Protocol = min(t) Protocol Selection Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
5/9/1112 On-line Protocol Selection Algorithm Start of phase MPI Call 1 MPI Call 2 MPI Call m End of phase Optimal Protocol = min(t) Protocol Selection t t t - Execution time linear in # MPI calls per phase Phase 1 (Estimating Execution Times) Rank 1Rank 3 …Rank nRank 2
Point-to-Point Protocols - Our system uses the following protocols Existing Protocols (Yuan et al. 2009): Pre-copy Sender-initiated Rendezvous Receiver-initiated Rendezvous New protocol Post-copy - Protocols categorized based on: Message size Arrival patterns of the communicating tasks 5/9/1113
SenderReceiver Pre-copy Protocol Time 5/9/1114 MPI Call Data Operation
SenderReceiver MPI_Send 5/9/1115 Time MPI Call Data Operation Pre-copy Protocol
SenderReceiver MPI_Send Local buffer copy 5/9/1116 Time MPI Call Data Operation Pre-copy Protocol
SenderReceiver MPI_Send Local buffer copy RDMA Write Request 5/9/1117 Time MPI Call Data Operation Pre-copy Protocol
SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1118 Time MPI Call Data Operation Request Pre-copy Protocol
RDMA Read SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1119 Time MPI Call Data Operation Request Data Pre-copy Protocol
RDMA Read SenderReceiver MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write ACK 5/9/1120 RDMA Write Time MPI Call Data Operation Request Data Pre-copy Protocol
RDMA Read SenderReceiver Sender Idle MPI_Recv MPI_Send MPI_Barrier Local buffer copy RDMA Write 5/9/1121 RDMA Write Time MPI Call Data Operation Request ACK Data Pre-copy Protocol
SenderReceiver 5/9/1122 Time MPI Call Data Operation Post-copy Protocol
SenderReceiver MPI_Send 5/9/1123 Time MPI Call Data Operation Post-copy Protocol
SenderReceiver RDMA Write 5/9/1124 Time MPI Call Data Operation MPI_Send Request + Data Post-copy Protocol
SenderReceiver MPI_Barrier 5/9/1125 Time MPI Call Data Operation MPI_Send RDMA Write Request + Data Post-copy Protocol
SenderReceiver MPI_Recv 5/9/1126 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol
Local buffer copy SenderReceiver MPI_Recv 5/9/1127 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol
Local buffer copy SenderReceiver MPI_Recv ACK 5/9/1128 Time MPI Call Data Operation MPI_Send RDMA Write MPI_Barrier Request + Data Post-copy Protocol
Local buffer copy SenderReceiver Sender Idle MPI_Recv MPI_Barrier 5/9/1129 ACK Time MPI Call Data Operation MPI_Send Request + Data RDMA Write MPI_Barrier Post-copy Protocol - Sender spends significantly less idle time compared to Pre-copy
Protocol Cost Model 5/9/ Supports five basic MPI operations: MPI_Send MPI_Recv MPI_Isend MPI_Irecv MPI_Wait - Important terms: t memreg - Buffer registration time t memcopy - Buffer memory copy time t rdma_read - Buffer RDMA Read time t rdma_write - Buffer RDMA Write time t func_delay - Constant book-keeping time
5/9/1131 SenderReceiver MPI_Irecv MPI_Wait MPI_Isend MPI_Wait Post-copy Protocol Cost Model: Sender Early t memreg t rdma_write t func_delay t memcopy t func_delay
5/9/1132 SenderReceiver MPI_Irecv t memreg t rdma_write t func_delay MPI_Wait MPI_Isend MPI_Wait t func_delay t memcopy t func_delay Receiver = Total time t memcopy + 2 x t func_delay Sender = Total time t memreg + t rdma_write + 2 x t func_delay Post-copy Protocol Cost Model: Sender Early
5/9/1133 SenderReceiver MPI_Irecv MPI_Wait MPI_Isend MPI_Wait Post-copy Protocol Cost Model: Receiver Early t memreg t rdma_write t func_delay t memcopy t func_delay t wait_delay
5/9/1134 SenderReceiver MPI_Irecv t memreg t rdma_write t func_delay MPI_Wait MPI_Isend MPI_Wait t func_delay t memcopy t func_delay Sender = Total time t memreg + t rdma_write + 2 x t func_delay Receiver = Total time t wait_delay + t memcopy + 2 x t func_delay t wait_delay Post-copy Protocol Cost Model: Receiver Early
5/9/ MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... Optimization: Buffer Usage Profiling - Example code snippet:
5/9/1136 Rank 1Rank 3Rank nRank 2 Start of phase Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling
5/9/1137 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) MPI_Recv (Buff 2) MPI_Send (Buff 3) Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling
5/9/1138 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) MPI_Recv (Buff 2) MPI_Send (Buff 3) MPI_Recv (Buff 1) Phase 2 (Buffer usage profiling) Optimization: Buffer Usage Profiling
5/9/1139 Rank 1Rank 3Rank nRank 2 Start of phase MPI_Send (Buff 1) Phase 2 (Buffer usage profiling) MPI_Recv (Buff 2) MPI_Send (Buff 3) MPI_Recv (Buff 1) Buff 1Buff 2Buff 3 MPI Call 1MPI Call 2MPI Call 3 MPI Call 2MPI Call 3 Optimization: Buffer Usage Profiling
... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... 5/9/1140 Optimization: Buffer Usage Profiling - Conversion of synchronous calls to asynchronous calls
... MPI_Send(buff1,...); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Recv(buff1,...);... 5/9/ MPI_Isend(buff1,..., req1); MPI_Recv(buff2,...); MPI_Send(buff3,...); MPI_Wait(req1,...); MPI_Recv(buff1,...);... Buffer Usage Profile Optimization: Buffer Usage Profiling - Conversion of synchronous calls to asynchronous calls
Performance Evaluation - Test Cluster: Intel Xeon Processors (64 bit) 8-core 2.33 GHz 8 GB System Memory 16 nodes Infiniband Interconnect - Software: MVAPICH 2 - Benchmarks : Sparse Matrix CG Sweep3D Microbenchmarks 5/9/1142
Performance Evaluation 5/9/ Single communication phase per application
5/9/ System chose optimal protocol for each phase dynamically Performance Evaluation
5/9/ Real and modeled execution times for Sparse Matrix Application Modeling accuracy: 95% to 99% Modeling overhead: less than 1% of total execution time Real Modeled
Summary - Our system for on-line protocol selection was successfully tested for real and microbenchmarks. - Protocol cost model: high accuracy with negligible overhead. - Sender-initiated Post-copy protocol was successfully implemented. 5/9/1146
Questions? 5/9/1147
Thank You! 5/9/1148