Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing MPI collectives for SMP clusters

Similar presentations


Presentation on theme: "Optimizing MPI collectives for SMP clusters"— Presentation transcript:

1 Optimizing MPI collectives for SMP clusters
Nizhni Novgorod State University Faculty of Computational Mathematics and Cybernetics Optimizing MPI collectives for SMP clusters

2 Project Optimizing Performance of MPI open-source implementations for Linux on POWER processor clusters Increasing efficiency of parallel applications: running on POWER clusters, under Linux, developed using open-source implementations of MPI

3 Analyzing MPI realizations
The main target are collective operations because they are most time-consuming procedures in MPI Rolf Rabenseifer. Automatic MPI Counter Profiling. 42nd GUG Conf.

4 Performance evaluation model

5 Alltoall Ring Algorithm
3 1 3 1 3 1 2 2 2 Step 1 Step 2 Step 3

6 Hockney model Ttransfer =  +  * n
Hockney model allows to estimate cost of message transfer using following parameters  - latency (time to preparing data for transfer),  - time for transferring 1 byte of data between two processors (i.e. 1/  is the network bandwidth), n – message size (bytes) Ttransfer =  +  * n

7 Cost of alltoall Ring algorithm
T = (p-1)* + (p-1)/p*n* + n + l + s  – latency (or startup time) per message, independent of message size,  – transfer time per byte, n – the number of bytes transferred, n – node contention overhead (when more than one node tries to send large messages to the same node), l – link contention overhead (when more than one communication uses the same links in the network), s – switch contention overhead (when the amount of data passing through the switch overflows the switch capacity to handle) Good for clusters with single-processor nodes

8 SMP-cluster Currently only two levels of cluster architecture is considered: data transfer inside SMP-node over shared memory, data transfer between SMP-nodes over network Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

9 Challenges Variable cost of point-to-point operations
Increased role of processes placement topology on network hosts Ineffective use of shared memory in some realizations

10 Point-to-point operations

11 Applying Hockney model. Shared memory vs. Network…
POWER5 shared memory sh_mem  7*10-6 sh_mem  8.4*10-10 Myrinet network  4*10-5 network  2.6*10-8 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

12 Applying Hockney model. Shared memory vs. Network
P-III Xeon shared memory sh_mem  1.3*10-5 sh_mem  8.3*10-9 Gigabit Ethernet network  5.9*10-5 network  1.9*10-8 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

13 Applying Hockney model. Simultaneous transfers over network
Gigabit Ethernet 1 pair 2 pairs 3 pairs 4pairs network 5,88E-05 7,18E-05 8,94E-05 10,3E-05 network 1,93E-08 3,30E-08 4,52E-08 5,74E-08 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

14 Applying Hockney model. Simultaneous transfers over shared memory
P-III Xeon shared memory 2 data flows 4 data flows 6 data flows 8 data flows network 1,3E-05 1,4E-05 network 0,83E-08 1,28E-08 1,96E-08 2,56E-08 Node 1 Node 2 CPU0 CPU0 CPU1 CPU1 RAM Network RAM CPU2 CPU2 CPU3 CPU3

15 Collective operations and processes placement

16 Bcast operation. Binomial tree algorithm
k k-th process 1-st step transfer 4 2 1 2-nd step transfers 6 5 3 3-rd step transfers

17 Bcast binomial tree algorithm. Two-level cluster architecture…
Node 1 Processes within SMP-node can interact through shared memory Processes running on different nodes must use network for data transfer Node 2 Node 3

18 Bcast binomial tree algorithm. Two-level cluster architecture…
Standard process numeration k k-th process 3 6 1-st step transfer 2-nd step transfers 3-rd step transfers 2 5 1 4 On each step we send data over network

19 Bcast binomial tree algorithm. Two-level cluster architecture…
More efficient process numeration k k-th process 1 6 3-rd step transfers 1-st step transfer 2-nd step transfers 4 5 2 3 On 3-rd step we transfer data only over shared memory

20 Bcast binomial tree algorithm. Two-level cluster architecture…
Optimized algorithm use binomial tree algorithm for transferring message to all networks nodes, use binomial tree algorithm for transferring message to all processes on every SMP-node k k-th process 1-st stage – transfer over network 6 3 2 1 2-nd stage – transfer over shared memory 5 4

21 Bcast binomial tree algorithm. Two-level cluster architecture
Test results 26% Acceleration

22 Bcast operation Existing algorithms: Binomial tree algorithm,
Scatter-gather algorithm, Scatter-ring algorithm

23 Bcast operation. Different processes placement…
process topology process topology 4 2 6 Network 1 5 3 7 2 1 3 Network 4 6 5 7

24 Bcast operation. Different algorithms performance

25 Estimating the cost of collective operations

26 Estimating the cost of collective communication algorithm…
Presumptions: All cluster hosts are identical, Network connections between cluster hosts are symmetric

27 Estimating the cost of collective communication algorithm…
Incoming data: Costs of point-to-point operations over network depending on number of simultaneous transfers, Costs of point-to-point operations over shared memory depending on number of simultaneous transfers

28 Estimating the cost of collective communication algorithm…
Calculating number of steps Determining for each step: which processes take part in transfers, which resources are used in each transfer, cost of each transfer Cost of algorithm is assumed as sum of maximum costs on each step

29 Estimating the cost of collective communication algorithm…

30 Estimating the cost of collective communication algorithm

31 Effective use of shared memory

32 Using shared memory. Standard algorithms
In case of transferring the same data from one process to several others data is transferred for each process successively by separate operations, For transferring between each process pair separate shared “memory window” is used Node CPU0 CPU1 RAM CPU2 CPU3

33 Using shared memory. Binomial tree Bcast algorithm
Operation cost (number of shared memory transfers) TBcast = (p-1) * (sh_mem + sh_mem * n) where p – number of processes, n – message size Node CPU0 CPU1 - step 1 transfer RAM - step 2 transfers CPU2 CPU3

34 Using shared memory. Optimized algorithms
In case of transferring the same data from one process to several others data is transferred for every process by using only one operation Single shared memory window or set of windows (with number of windows = number of processes) is used for data transfer Node Node CPU0 CPU1 CPU0 CPU1 RAM RAM CPU2 CPU3 CPU2 CPU3

35 Using shared memory. Optimized Bcast algorithm
Operation cost (number of shared memory transfers) TBcast = p/2 * (sh_mem + sh_mem * n) where p – number of processes, n – message size Node CPU0 CPU1 - step 1 transfers RAM CPU2 CPU3

36 Using shared memory. Comparing algorithms performance…
Theoretical estimation – 33% faster

37 Using shared memory. Comparing algorithms performance…
Test results – 31% faster

38 Summary Effective realization should take into account:
Variable cost of point-to-point operations, Processes placement on network hosts, Relative costs of existing algorithms, Effective realization should use resources of hardware as much as possible

39 Optimized bcast algorithm. Estimated performance

40 Optimized bcast algorithm. Experimental data

41 Publications SCICOMP 11, Edinburgh, Scotland, 2005
European Power.org Community Conference, Barcelona, 2005 JSCC Power.org technical seminar, Moscow, 2005 Microsoft technologies in programming theory and practice, UNN, 2005

42 Research group Gergel V.P., professor
Grishagin V.A., associate professor Belov S.A., associate professor Linev A.V. Gergel A.V. Grishagin A.V. Kurylev A.L. Senin A.V. This work is partly supported by IBM Faculty Awards for Innovation Program

43 Contacts 603950, Nighni Novgorod Gagarina av., 23,
Nizhni Novgorod State University, Applied Mathematics and Cybernetics faculty Tel: +7 (8312)

44 Thank you for attention
Questions, Remarks, Comments


Download ppt "Optimizing MPI collectives for SMP clusters"

Similar presentations


Ads by Google