# Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/20111.

## Presentation on theme: "Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/20111."— Presentation transcript:

Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/20111

Outline (1/2) Sequential general LU factorization (GETRF) and Lower Bounds Definitions and Lower Bounds LAPACK algorithm Communication cost Summary Sequential banded LU factorization (GBTRF) and Lower Bounds Definitions and Lower Bounds Banded format LAPACK algorithm Communication cost Summary Sequential LU Summary 12/02/20112

Outline (2/2) Parallel LU definitions and Lower bounds Parallel Cholesky algorithms (Saad, Schultz 85) SPIKE Cholesky algorithm (Sameh85) Parallel banded LU factorization (PGBTRF) ScaLAPACK algorithm Communication cost Summary Parallel banded LU and Cholesky Summary Future Work General Summary 12/02/20113

GETRF – Definitions and Lower Bounds Variables: n - size of the matrix r - block size (panel width) i- current panel number M - size of fast memory fits into pattern of 3-nested loops and has usual lower bounds: 12/02/20114

GETRF - Communication assumptions BLAS2 LU on (m x n) matrix takes TRSM on (n x m) with LL (n x n) takes GEMM in (m x n) - (m x k) (k x n) takes 12/02/20115 m n m n n n P L U n m n n n m U LL -1 A m n m k k n A L U m m A

GETRF – LAPACK algorithm 12/02/20116 For each panel block: 1)Factorize panel (n x r) 2)Permute matrix 3)Compute U update (TRSM) of size r x (n-ir) with LL of size r x r 4)Compute GEMM update of size: (n-ir) x (n-ir) - ((n-ir) x r ) * (r x (n-ir))

GETRF – LAPACK algorithm (1/4) 12/02/20117 Factorize panel P Words: Total words : n- (i-1)r r r r r P L U n- (i-1)r

GETRF – LAPACK algorithm (2/4) 12/02/20118 Permute matrix with pivot information from panel Words: Total words :

GETRF – LAPACK algorithm (3/4) 12/02/20119 Permute matrix with pivot information from panel Words: Total words : r n-ir r r r n-ir U LL -1 A

GETRF – LAPACK algorithm (4/4) 12/02/201110 Permute matrix with pivot information from panel Words: Total words : n-ir r r A L U A

GETRF – Communication cost 12/02/201111 Communication cost Simplified in the big O notation we get:

GETRF - General LU Summary General LU lower bounds are: LAPACK LU algorithm gives : 12/02/201112

GBTRF - Banded LU factorization Variables: n - size of the matrix b- matrix bandwidth r - block size (panel width) M - size of fast memory Also fits into 3-nested loops lower bounds: 12/02/201113

Banded Format GBTRF uses a special banded format Packed data format that stores mostly data and very few non-zeros columns map to columns ; diagonals map to rows easy to retrieve a square block from original A by using lda – 1 12/02/201114

Banded Format 12/02/201115 Conceptual Actual Because of format the update of U and of the Schur complement get split into multiple stages for the parts of the band matrix near the edges of the storage array

GBTRF Algorithm For each panel block 1)Factorize panel of size b x r 2)Permute rest of matrix affected by panel 3)Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r) 4)Compute U update (TRSM) of size r x r with LL of size (r x r) 5)Compute 4 GEMM updates of sizes: (b-2r) x (b-2r) + ((b-2r) x r ) * (r x (b-2r)) (b-2r) x r + ((b-2r) x r ) * (r x r) r x (b-2r) + (r x r) * (r x (b-2r)) r x r + (r x r) * (r x r) 12/02/201116

GBTRF – LAPACK algorithm (1/8) 12/02/201117 Factorize panel P Words: Total words : b rrr b r

GBTRF – LAPACK algorithm (2/8) 12/02/201118 Apply permutations Words: Total words :

GBTRF – LAPACK algorithm (3/8) 12/02/201119 Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r) Words: Total words : r b – 2r r rr

GBTRF – LAPACK algorithm (4/8) 12/02/201120 Compute U update (TRSM) of size r x r with LL of size (r x r) Words: Total words : r rr r r r

GBTRF – LAPACK algorithm (5/8) 12/02/201121 Compute GEMM update of size (b-2r)x(b-2r) + ((b-2r) x r)*(r x (b-2r)) Words: Total words : b – 2r r

GBTRF – LAPACK algorithm (6/8) 12/02/201122 Compute GEMM update of size Words: Total words : b – 2r r r

GBTRF – LAPACK algorithm (7/8) 12/02/201123 Compute GEMM update of size Words: Total words : b – 2r rrr r r

GBTRF – LAPACK algorithm (8/8) 12/02/201124 Compute GEMM update of size Words: Total words : r rr rr

GBTRF communication cost 12/02/201125 A full cost would be: If we choose r < b/3 this simplifies the leading terms to: Since r < b the other option is b/3 < r < b which gives in this case we get:

GBTRF - Banded LU Summary Banded LU lower bounds are: LAPACK banded LU algorithm gives : 12/02/201126

Sequential Summary 12/02/201127

Parallel banded LU - Definitions Variables: n - size of the matrix p- number of processors b- matrix bandwidth M - size of fast memory 12/02/201128

Parallel banded LU – Lower Bounds Assuming banded matrix is distributed in a 1D layout across n Lower Bounds 12/02/201129 P(i-1)P(i)

Parallel banded algorithms – (Saad 85) In (Saad, Schultz 85) we are presented with a computation and communication analysis for banded Cholesky (LL T ) solvers on a 1D ring, 2D torus and n-D hypercube as well as a pipelined approach While this is a different computation from LU, Cholesky can be viewed as a minimum cost for LU since it does not require pivoting nor the computation of the U but is also used for Gaussian Elimination Since most parallel banded algorithms also increase the amount of computation done that will also be compared between the algorithms in terms of multiplicative factors to the leading term. 12/02/201130

Parallel banded algorithms – RIGBE 12/02/201131

Parallel banded algorithms – BIGBE 12/02/201132

Parallel banded algorithms – HBGE 12/02/201133 Same algorithm as BIGGE but the 2D grid is embedded in the Hypercube to allow for faster communication costs

Parallel banded algorithms – WFGE 12/02/201134 Uses the 2D cyclic layout and then performs operations diagonally

Parallel banded algorithms – (Saad 85) Parallel band LU lower bounds: Banded Cholesky algorithms : 12/02/201135

Parallel banded algorithms – SPIKE (1/3) Another parallel banded implementation is presented in the SPIKE Algorithm (Lawrie, Sameh 84) which is a Cholesky solver which is just a special case of Gaussian Elimination This algorithm for factorization and solver is extended to a pivoting LU implementation in (Sameh 05) 12/02/201136

Parallel banded algorithms – SPIKE (2/3) 12/02/201137

Parallel banded algorithms – SPIKE (3/3) 12/02/201138 parallel band LU Lower Bounds SPIKE Cholesky algorithm

PGBTRF – Data Layout Adopts same banded layout as sequential with a slightly higher bandwidth storage (4b instead of 3b) and 1D block distribution 12/02/201139 n P1 P2P3P4 2b

PGBTRF – Algorithm Description from ScaLAPACK code 1) Compute Fully Independent band LU factorizations of the submatrices located in local memory. 2) Pass the upper triangular matrix from the end of the local storage on to the next processor. 3) From local factorization and upper triangular matrix form a reduced blocked bidiagonal system and store extra data in Af (extra storage) 4) Solve reduced blocked bidiagonal system to compute extra factors and store in Af 12/02/201140

PGBTRF – Communication cost 12/02/201141 Parallel band LU lower bounds: ScaLAPACK band LU algorithm:

Parallel Summary Lower Bounds (Saad85) SPIKE ScaLAPACK 12/02/201142

Future Work Checking the lower bounds and implementation details of applying CALU to the panel in the LAPACK algorithm Investigate parallel band LU lower bounds for an exact cost Heterogeneous analysis of implemented MAGMA sgbtrf and lower bounds for a heterogeneous model Looking at Nested Dissection as another Divide and Conquer method for parallel banded LU Analysis of cost of applying a parallel banded algorithm to the sequential model to see if we can reduce the communication by increasing computation 12/02/201143

General Summary 12/02/201144

Questions? 12/02/201145

Download ppt "Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/20111."

Similar presentations