High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future Grid

Background Computing nodes on clusters/Grid are shared by multiple applications To obtain good performance, HPC applications should struggle with Background processes Dynamic changing available nodes Large latencies on the Grid

Performance limiting factor: background processes Other processes may run on background Network daemons, interactive shells, etc. Many typical applcations are written in synchronous style In such applications, delay of a single node degrades the overall performance

Performance limitng factor: Large latencies on the Grid In the future Grid environments, bandwidth will accommodate HPC applications Large latencies will remain to be obstacles Synchronous applications suffer from large latencies >100ms

Available nodes change dynamically Many HPC applications assumes that computing nodes are fixed If applications support dynamically changing nodes, we can harness computing resources more efficiently!

Goal of this work An LU factorization algorithm that Tolerates background processes & large latencies Supports dynamically changing nodes Overlapping multiple iterationsWritten in the Phoenix model Data mapping for dynamically changing nodes A fast HPC application on non-dedicated clusters and Grid

Outline of this talk The Phoenix model Our LU Algorithm Overlapping multiple iterations Data mapping for dynamically changing nodes Performance of our LU and HPL Related work Summary

Phoenix model [Taura et al. 03] A message passing model for dynamically changing environments Concept of virtual nodes Virtual nodes as destinations of messages Virtual nodes Physical nodes

Overview of our LU Like typical implementations, Based on message passing The matrix is decomposed into small blocks A block is updated by its owner node Unlike typical implementations, Asynchronous data-driven style for overlapping multiple iterations Cyclic-like data mapping for any & dynamically changing number of nodes (Currently, pivoting is not performed)

LU factorization for (k=0; k<B; k++) { A k,k =fact(A k,k ); for (i=k+1; i<B; i++) A i,k =update_L(A i,k,A k,k ); for (j=k+1; j<B; j++) A k,j =update_U(A k,j,A k,k ); for (i=k+1; i<B; i++) for (j=k+1; j<B; j++) A i,j =A i,j – A i,k x A k,j ; } L part U part Trail part Diagonal

Na ï ve implementation and its problem Iterations are separated Not tolerant to latencies/background processes! time k th iteration (k+1) th iteration(k+2) th iteration # of executable tasks DiagonalUL trail

Latency Hiding Techniques Overlapping iterations hides latencies Diagonal/L/U parts is advanced If computations of trail parts are separated, only adjacent two iterations are overlapped There is room for further improvement time

Overlapping multiple iterations for more tolerance We overlap multiple iterations by computing all blocks, including trail parts asynchronously Data driven style & prioritized task scheduling are used time

Prioritized task scheduling We assign a priority to updating task of each block k-th update of block A i,j has a priority of min(i-S, j-S, k) (smaller number is higher) where S is a desired overlap depth We can control overlapping by changing the value of S

Typical data mapping and its problem Two dimensional block cyclic distribution P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 matrix Good load balance and small communication, but The number of nodes must be fixed and factored into two small numbers How to support dynamically changing nodes?

Our data mapping for dynamically changing nodes Permutation is common among all nodes A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 Original matrix A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 Random Permutation Permuted matrix A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17

Dynamically joining nodes A new node sends a steal message to one of nodes The receiver abandons some virual nodes, and sends blocks to the new node The new node undertakes virtual nodes and blocks For better load balance, stealing process is repeated originalpermutedoriginalpermuted

Experimental environments (1) 112 nodes IBM BladeCenter Cluster Dual 2.4GHz Xeon: 70 nodes + Dual 2.8GHz Xeon: 42 nodes 1 CPU per node is used Slower CPU (2.4GHz) determines the overall performance Gigabit ethernet

Experimental environments (2) High performance Linpack (HPL) is by Petitet et al. GOTO BLAS is made by Kazushige Goto (UT-Austin) Ours (S=0): don ’ t overlap explicitly Ours (S=1): overlap with an adjacent iteration Ours (S=5): overlap multiple (5) iterations

Scalability Ours(S=5) achieves 190 GFlops with 108 nodes 65 times speedup Matrix size N=61440 Block size NB=240 Overlap depth S=0 or 5 x72 x65

Tolerance to background processes (1) We run LU/HPL with background processes We run 3 background processes per randomely chosen node The background processes are short term They move to other random nodes every 10 secs

Tolerance to background processes (2) HPL slows down heavily Ours(S=0) and Ours(S=1) also suffer By overlapping multiple iterations (S=5), Our LU becomes more tolerant ! 108 nodes for computation N=46080 -31% -16% -36% -26%

Tolerance to large latencies (1) We emulate the future Grid environment with high bandwidth & large latencies Experiments are done on a cluster Large latencies are emulated by software +0ms, +200ms, +500ms

Tolerance to large latencies (2) S=0 suffers by 28% Overlapping of iterations makes our LU more tolerant Both S=1 and S=5 work well -28% -19% 108 nodes for computation N=46080 -20%

Performance with joining nodes (1) 16 nodes at first, then 48 nodes are added dynamically 16 64

Performance with joining nodes (2) Flexibility to the number of nodes is useful to obtain higher performance Comared with Fixed-64, Dynamic suffers migration overhead etc. N=30720 S=5 x1.9 faster

Related Work Dyn-MPI [Weatherly et al. 03] An extended MPI library that supports dynamically changing nodes Dyn-MPIOur approach Redist methodSynchronousAsynchronous Distribution of 2D matrix Only the first dimension Arbitrary (Left for the programmers)

Summary An LU implementation suitable for non-dedicated clusters and the Grid Scalable Support dynamically changing nodes Tolerate background processes & large latencies

Future Work Perform pivoting More data dependencies are introduced Is our LU still tolerant? Improve dynamic load balancing Choose better target nodes for stealing Take care of CPU speeds Apply our approach to other HPC applications CFD applications

Thank you!

Typical task scheduling Each node updates blocks synchronously Not tolerant to background processes 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Our task scheduling to tolerate delays (1) Each block is updated asynchronously Blocks may have different iteration numbers 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 2 2 2 0 1 2 2 2 2 2 0 1 2 2 2 2 2 0 1 2 2 2 2 2 0 1 2 2 2 2 2 0 1 2 2 2 2 2 0 1 2 2 1 0 0 002 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 1 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 2 3 3 2

Our task scheduling to tolerate delays (2) Not only allow skew, but make skew explicitly Intoduce prioritized task scheduling Give higher priority to upper left blocks 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 4 2 2 2 2 2 2 3 4 1 1 1 1 1 2 3 4 1 1 1 1 1 2 3 4 1 1 1 1 1 2 3 4 1 1 1 1 1 2 3 4 1 1 1 1 1 2 3 4 5 5 5 5 5 5 5 4 4 4 4 4 4 5 3 3 3 3 3 4 5 2 2 2 2 3 4 5 2 2 2 2 3 4 5 2 2 2 2 3 4 5 2 2 2 2 3 4 5 6 6 6 6 6 6 5 5 5 5 5 6 4 4 4 4 5 6 3 3 3 4 5 6 3 3 3 4 5 6 3 3 3 4 5 6 Target skew = 3 (Similar to pipeline depth)

Performance with joining processes (3) procs added suffer from migration good peak speed longer tail end

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

Similar presentations

Presentation on theme: "High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

Similar presentations

Presentation on theme: "High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future."— Presentation transcript:

Similar presentations

About project

Feedback