Optimizing MPI collectives for SMP clusters

Optimizing MPI collectives for SMP clusters
Nizhni Novgorod State University Faculty of Computational Mathematics and Cybernetics Optimizing MPI collectives for SMP clusters

Project Optimizing Performance of MPI open-source implementations for Linux on POWER processor clusters Increasing efficiency of parallel applications: running on POWER clusters, under Linux, developed using open-source implementations of MPI

Analyzing MPI realizations
The main target are collective operations because they are most time-consuming procedures in MPI Rolf Rabenseifer. Automatic MPI Counter Profiling. 42nd GUG Conf.

Performance evaluation model

Alltoall Ring Algorithm
3 1 3 1 3 1 2 2 2 Step 1 Step 2 Step 3

Hockney model Ttransfer =  +  * n
Hockney model allows to estimate cost of message transfer using following parameters  - latency (time to preparing data for transfer),  - time for transferring 1 byte of data between two processors (i.e. 1/  is the network bandwidth), n – message size (bytes) Ttransfer =  +  * n

Cost of alltoall Ring algorithm
T = (p-1)* + (p-1)/p*n* + n + l + s  – latency (or startup time) per message, independent of message size,  – transfer time per byte, n – the number of bytes transferred, n – node contention overhead (when more than one node tries to send large messages to the same node), l – link contention overhead (when more than one communication uses the same links in the network), s – switch contention overhead (when the amount of data passing through the switch overflows the switch capacity to handle) Good for clusters with single-processor nodes

SMP-cluster Currently only two levels of cluster architecture is considered: data transfer inside SMP-node over shared memory, data transfer between SMP-nodes over network Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

Challenges Variable cost of point-to-point operations
Increased role of processes placement topology on network hosts Ineffective use of shared memory in some realizations

Point-to-point operations

Applying Hockney model. Shared memory vs. Network…
POWER5 shared memory sh_mem  7*10-6 sh_mem  8.4*10-10 Myrinet network  4*10-5 network  2.6*10-8 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

Applying Hockney model. Shared memory vs. Network
P-III Xeon shared memory sh_mem  1.3*10-5 sh_mem  8.3*10-9 Gigabit Ethernet network  5.9*10-5 network  1.9*10-8 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

Applying Hockney model. Simultaneous transfers over network
Gigabit Ethernet 1 pair 2 pairs 3 pairs 4pairs network 5,88E-05 7,18E-05 8,94E-05 10,3E-05 network 1,93E-08 3,30E-08 4,52E-08 5,74E-08 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

Applying Hockney model. Simultaneous transfers over shared memory
P-III Xeon shared memory 2 data flows 4 data flows 6 data flows 8 data flows network 1,3E-05 1,4E-05 network 0,83E-08 1,28E-08 1,96E-08 2,56E-08 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

Collective operations and processes placement

Bcast operation. Binomial tree algorithm
k k-th process 1-st step transfer 4 2 1 2-nd step transfers 6 5 3 3-rd step transfers

Bcast binomial tree algorithm. Two-level cluster architecture…
Node 1 Processes within SMP-node can interact through shared memory Processes running on different nodes must use network for data transfer Node 2 Node 3

Standard process numeration k k-th process 3 6 1-st step transfer 2-nd step transfers 3-rd step transfers 2 5 1 4 On each step we send data over network

More efficient process numeration k k-th process 1 6 3-rd step transfers 1-st step transfer 2-nd step transfers 4 5 2 3 On 3-rd step we transfer data only over shared memory

Optimized algorithm use binomial tree algorithm for transferring message to all networks nodes, use binomial tree algorithm for transferring message to all processes on every SMP-node k k-th process 1-st stage – transfer over network 6 3 2 1 2-nd stage – transfer over shared memory 5 4

Bcast binomial tree algorithm. Two-level cluster architecture
Test results 26% Acceleration

Bcast operation Existing algorithms: Binomial tree algorithm,
Scatter-gather algorithm, Scatter-ring algorithm

Bcast operation. Different processes placement…
process topology process topology 4 2 6 Network 1 5 3 7 2 1 3 Network 4 6 5 7

Bcast operation. Different algorithms performance

Estimating the cost of collective operations

Estimating the cost of collective communication algorithm…
Presumptions: All cluster hosts are identical, Network connections between cluster hosts are symmetric

Incoming data: Costs of point-to-point operations over network depending on number of simultaneous transfers, Costs of point-to-point operations over shared memory depending on number of simultaneous transfers

Calculating number of steps Determining for each step: which processes take part in transfers, which resources are used in each transfer, cost of each transfer Cost of algorithm is assumed as sum of maximum costs on each step

Estimating the cost of collective communication algorithm

Effective use of shared memory

Using shared memory. Standard algorithms
In case of transferring the same data from one process to several others data is transferred for each process successively by separate operations, For transferring between each process pair separate shared “memory window” is used Node CPU0 CPU1 RAM CPU2 CPU3

Using shared memory. Binomial tree Bcast algorithm
Operation cost (number of shared memory transfers) TBcast = (p-1) * (sh_mem + sh_mem * n) where p – number of processes, n – message size Node CPU0 CPU1 - step 1 transfer RAM - step 2 transfers CPU2 CPU3

Using shared memory. Optimized algorithms
In case of transferring the same data from one process to several others data is transferred for every process by using only one operation Single shared memory window or set of windows (with number of windows = number of processes) is used for data transfer Node Node CPU0 CPU1 CPU0 CPU1 RAM RAM CPU2 CPU3 CPU2 CPU3

Using shared memory. Optimized Bcast algorithm
Operation cost (number of shared memory transfers) TBcast = p/2 * (sh_mem + sh_mem * n) where p – number of processes, n – message size Node CPU0 CPU1 - step 1 transfers RAM CPU2 CPU3

Using shared memory. Comparing algorithms performance…
Theoretical estimation – 33% faster

Using shared memory. Comparing algorithms performance…
Test results – 31% faster

Summary Effective realization should take into account:
Variable cost of point-to-point operations, Processes placement on network hosts, Relative costs of existing algorithms, Effective realization should use resources of hardware as much as possible

Optimized bcast algorithm. Estimated performance

Optimized bcast algorithm. Experimental data

Publications SCICOMP 11, Edinburgh, Scotland, 2005
European Power.org Community Conference, Barcelona, 2005 JSCC Power.org technical seminar, Moscow, 2005 Microsoft technologies in programming theory and practice, UNN, 2005

Research group Gergel V.P., professor
Grishagin V.A., associate professor Belov S.A., associate professor Linev A.V. Gergel A.V. Grishagin A.V. Kurylev A.L. Senin A.V. This work is partly supported by IBM Faculty Awards for Innovation Program

Contacts 603950, Nighni Novgorod Gagarina av., 23,
Nizhni Novgorod State University, Applied Mathematics and Cybernetics faculty Tel: +7 (8312)

Thank you for attention
Questions, Remarks, Comments

Optimizing MPI collectives for SMP clusters

Similar presentations

Presentation on theme: "Optimizing MPI collectives for SMP clusters"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing MPI collectives for SMP clusters

Similar presentations

Presentation on theme: "Optimizing MPI collectives for SMP clusters"— Presentation transcript:

Similar presentations

About project

Feedback