# Block LU Factorization Lecture 24 MA471 Fall 2003.

## Presentation on theme: "Block LU Factorization Lecture 24 MA471 Fall 2003."— Presentation transcript:

Block LU Factorization Lecture 24 MA471 Fall 2003

Example Case 1)Suppose we are faced with the solution of a linear system Ax=b 2)Further suppose: 1)A is large (dim(A)>10,000) 2)A is dense 3)A is full 4)We have a sequence of different b vectors.

Problems Suppose we are able to compute the matrix – –It costs N 2 doubles to store the matrix –E.g. for N=100,000 we require 76.3 gigabytes of storage for the matrix alone. –32 bit processors are limited to 4 gigabytes of memory –Most desktops (even 64 bit) do not have 76.3 gigabytes –What to do?

Divide and Conquer P0P1P2P3 P4P5P6P7 P8P9P10P11 P12P13P14P15 One approach is to assume we have a square number of processors. We then divide the matrix into blocks – storing one block per processor.

Back to the Linear System We are now faced with LU factorization of a distributed matrix. This calls for a modified LU routine which acts on blocks of the matrix. We will demonstrate this algorithm for one level. i.e. we need to construct matrices L,U such that A=LU and we only store single blocks of A,L,U on any processor.

Constructing the Block LU Factorization A00A01A02 A10A11A12 A20A21A22 = L0000 L1010 L2001 * U00U01U02 0?11?12 0?21?22 First we LU factorize A00 and look for the above block factorization. However, we need to figure out what each of the entries are: A00 = L00*U00 (compute by L00, U00 by LU factorization) A01 = L00*U01 => U01 = L00\A01 A02 = L00*U02 => U02 = L00\A02 A10 = L10*U00 => L10 = A10/U00 A20 = L20*U00 => L20 = A20/U00 A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01..

cont A00 = L00*U00 (compute by L00, U00 by LU factorization) A01 = L00*U01 => U01 = L00\A01 A02 = L00*U02 => U02 = L00\A02 A10 = L10*U00 => L10 = A10/U00 A20 = L20*U00 => L20 = A20/U00 A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01 A12 = L10*U02 + ?12 => ?12 = A12 – L10*U02 A21 = L20*U01 + ?21 => ?21 = A21 – L20*U01 A22 = L20*U02 + ?22 => ?22 = A22 – L20*U02 In the general case: Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

Summary First Stage A00A01A02 A10A11A12 A20A21A22 = L0000 L1010 L2001 * U00U01U02 0?11?12 0?21?22 First step: LU factorize uppermost block diagonal Second step: a) compute U0n = L00\A0n n>0 b) compute Ln0 = An0/U00 n>0 Third step: compute ?nm = Anm – Ln0*U0m, (n,m>0)

Now Factorize Lower SE Block ?11?12 ?21?22 = L110 L211 * U11U12 0??22 We repeat the previous algorithm this time on the two by two SE block.

End Result A00A01A02 A10A11A12 A20A21A22 = L0000 L10L110 L20L21L22 * U00U01U02 0U11U12 00U22

Matlab Version

Parallel Algorithm P0P1P2 P3P4P5 P6P7P8 P0: A00 = L00*U00 (compute by L00, U00 by LU factorization) P1: U01 = L00\A01 P2: U02 = L00\A02 P3: L10 = A10/U00 P6: L20 = A20/U00 P4: A11 <- A11 – L10*U01 P5: A12 <- A12 – L10*U02 P7: A21 <- A21 – L20*U01 P8: A22 <- A22 – L20*U02 In the general case: Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

Parallel Communication L00 U00 U01U02 L10A11A12 L20A21A22 P0: L00,U00 =lu(A) P1: U01 = L00\A01 P2: U02 = L00\A02 P3: L10 = A10/U00 P6: L20 = A20/U00 P4: A11 <- A11 – L10*U01 P5: A12 <- A12 – L10*U02 P7: A21 <- A21 – L20*U01 P8: A22 <- A22 – L20*U02 In the general case: Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

Communication Summary P0: L00,U00 =lu(A) P1: U01 = L00\A01 P2: U02 = L00\A02 P3: L10 = A10/U00 P6: L20 = A20/U00 P4: A11 <- A11 – L10*U01 P5: A12 <- A12 – L10*U02 P7: A21 <- A21 – L20*U01 P8: A22 <- A22 – L20*U02 P0: sends L00 to P1,P2 sends U00 to P3,P6 P1: sends U01 to P4,P7 P2: sends U02 to P5,P8 P3: sends L10 to P4,P5 P4: sends L20 to P7,P8 P0P1P2 P3P4P5 P6P7P8 L00 U00 U01U02 L10A11A12 L20A21A22

Upshot Notes: 1)I added an MPI_Barrier purely to separate the LU factorization and the backsolve. 2)In terms of efficiency we can see that quite a bit of time is spent in MPI_Wait compared to compute time. 3)The compute part of this code can be optimized much more – making the parallel efficiency even worse. a b (a) P0: sends L00 to P1,P2 sends U00 to P3,P6 (b) P1: sends U01 to P4,P7 (c) P2: sends U02 to P5,P8 (d) P3: sends L10 to P4,P5 (e) P4: sends L20 to P7,P8 c d e (f) P4: sends L11 to P5 sends U11 to P7 (g) P1: sends U12 to P8 (h) P3: sends L21 to P8 f 1 st stage: g h

Block Back Solve After factorization we are left with the task of using the distributed L and U to compute the backsolve: U00 L00 U01U02 L10 U11 L11 U12 L20L21 U22 L22 Block distribution of L and U P0P1P2 P3P4P5 P6P7P8

Recall Given an LU factorization of A namely, L,U such that A=LU Then we can solve Ax=b by y=L\b x=U\y

Distributed Back Solve L0000 L10L110 L20L21L22 = y0 y1 y2 b0 b1 b2 P0: solve L00*y0 = b0 send: y0 to P3,P6 P3: send: L10*y0 to P4 P4: solve L11*y1 = b1-L10*y0 send: y1 to P7 P6: send: L20*y0 to P8\ P7: send: L21*y1 to P8 P8: solve L22*y2 = b2-L20*y0-L21*y1 Results: y0 on P0, y1 on P4, y2 on P8 P0P1P2 P3P4P5 P6P7P8

Matlab Code

Back Solve After the factorization we computed a solution to Ax=b This consists of two distributed block triangular systems to solve

Barrier Between Back Solves This time I inserted an MPI_Barrier call between the backsolves. This highlights the serial nature of the backsolves..

Example Code http://www.math.unm.edu/~timwar/MA471F03/blocklu.m http://www.math.unm.edu/~timwar/MA471F03/parlufact2.c