Download presentation

Presentation is loading. Please wait.

Published byRyan Viel Modified over 4 years ago

1
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP 2000, SDSC, La Jolla

2
GWDG O. Haan, Matrix Transpose Results, SCICOMP 20002 Overview Hybrid Programming Model Distributed Matrix Transpose Performance Measurements Summary of Results

3
GWDG O. Haan, Matrix Transpose Results, SCICOMP 20003 Architecture of Scalable Parallel Computers Two level hierarchy cluster of SMP nodes distributed memory high speed interconnect SMP nodes with multiple processors shared memory bus or switch connected

4
GWDG O. Haan, Matrix Transpose Results, SCICOMP 20004 Programming Models message passing over all processors MPI implementation for shared memory multiple access to switch adapters SP: 4-way Winterhawk2 + 8-way Nighthawk - shared memory over all processors virtual global address space SP: - hybrid message passing - shared memory message passing between nodes shared memory within nodes SP: +

5
GWDG O. Haan, Matrix Transpose Results, SCICOMP 20005 Hybrid Programming Model SPMD program with MPI tasks OpenMP threads within each task communication between MPI tasks

6
GWDG O. Haan, Matrix Transpose Results, SCICOMP 20006 Example of Hybrid Program program hybrid_example include “ mpif.h “ com = MPI_COMM_WORLD call MPI_INIT(ierr) call MPI_COMM_SIZE(com,nk,ierr) call MPI_COMM_RANK(com,my_task,ierr) kp = OMP_GET_NUM_PROCS() !$OMP PARALLEL PRIVATE(my_thread) my_thread = OMP_GET_THREAD_NUM() call work(my_thread,kp,my_task,nk,thread_res) !$OMP END PARALLEL do i = 0, kp-1 node_res = node_res + thread_res(i) end do call MPI_REDUCE(node_res,glob_res,1, : MPI_REAL,MPI_SUM,0,com,ierr) call MPI_FINALIZE(ierr) stop end

7
GWDG O. Haan, Matrix Transpose Results, SCICOMP 20007 Hybrid Programming vs. Pure Message Passing + works on all SP configuration coarser internode communication granularity faster intranode communication - larger programming effort additional synchronization steps reduced reuse of cached data the net score depends on the problem

8
GWDG O. Haan, Matrix Transpose Results, SCICOMP 20008 Distributed Matrix Transpose

9
GWDG O. Haan, Matrix Transpose Results, SCICOMP 20009 3-step Transpose n1 x n2 matrix A( i1, i2 ) --> n2 x n1 matrix B( i2, i1 ) decompose n1, n2 in local and global parts: n1 = n1l * np n2 = n2l * np write matrices A, B as 4-dim arrays: A( i1l, i1g, i2l ; i2g ), B( i2l, i2g, i1l ; i1g ) step 1 : local reorder A( i1l, i1g, i2l ; i2g ) -> a1( i1l, i2l, i1g ; i2g ) step 2 : global reorder a1( i1l, i2l, i1g ; i2g ) -> a2( i1l, i2l, i2g ; i1g ) step 3 : local transpose a2( i1l, i2l, i2g ; i1g ) -> B( i2l, i2g, i1l ; i1g )

10
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200010 Local Steps: Copy with Reorder data in memory: speed limited by performance of bus and memory subsystems Winterhawk2 : all processors share the same bus bandwidth : 1.6 GB/s data in cache: speed limited by processor performance Winterhawk2 : one load plus one store per cycle bandwidth : 8 MB / (1/375) s =3 GB / s

11
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200011 Copy: Data in Memory

12
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200012 Copy : Prefetch

13
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200013 Copy : Data in Cache

14
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200014 Global Reorder a1( *, *, i1g ; i2g ) -> a2( *, *, i2g ; i1g ) global reorder on np processors in np steps p0 p1 p2 step0 step1 step2

15
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200015 Performance Modelling Hardware model: nk nodes with kp procs each np = nk * kp is total procs count Switch model:nk concurrent links between nodes latency tlat, bandwidth c execution model for Hybrid: reorder on nk nodes: nk steps with n1*n2 / nk**2 data per node execution model for MPI: reorder on np processors: np steps with n1*n2 / np**2 data per node switch links shared between kp procs

16
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200016 Performance Modelling Hybrid timing model: MPI timing model:

17
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200017 Timing of Global Reorder (internode part)

18
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200018 Timing of Global Reorder (internode part)

19
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200019 Timing of Global Reorder

20
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200020 Timing of Transpose

21
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200021 Scaling of Transpose

22
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200022 Timing of Transpose Steps

23
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200023 Summary of Results: Hardware Memory access in Winterhawk2 is not adaquate: copy rate of 400 MB/s = 50 Mwords/s peak CPU rate of 6000 Mflops/s a factor of 100 between computational speed and memory speed Sharing of switch link by 4 processors degrades communication speed: bandwidth smaller by more than a factor of 4 ( factor of 4 expected ) latency larger by nearly a factor of 4 ( factor of 1 expected )

24
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200024 Summary of Results: Hybrid vs. MPI hybrid OpenMP / MPI programming is profitable for distributed matrix tranpose : 1000 x 1000 matrix on 16 nodes : 2.3 times faster 10000 x 10000 matrix on 16 nodes : 1.1 times faster Competing influences : MPI programming enhances use of cached data Hybrid programming has lower communication latency and coarser communication granularity

25
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200025 Summary of Results: Use of Transpose in FFT 2-dim complex array of size Execution time on nk nodes : where r : computational speed per node c : transpose speed per node effective execution speed per node :

26
GWDG O. Haan, Matrix Transpose Results, SCICOMP 200026 Summary of Results: Use of Transpose in FFT- Example SP r = 4 * 200 Mflop/s = 800 Mflop/s c depends on n, nk and programming model nk = 16 n = 10**6 10**9 hybrid c = 5.67.8 Mword/s MPI c = 2.57.0 Mword/s effective execution speed per node hybrid =208338 Mflop/s MPI =108317 Mflop/s

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google