Presentation is loading. Please wait.

Presentation is loading. Please wait.

Block LU Factorization Lecture 24 MA471 Fall 2003.

Similar presentations


Presentation on theme: "Block LU Factorization Lecture 24 MA471 Fall 2003."— Presentation transcript:

1 Block LU Factorization Lecture 24 MA471 Fall 2003

2 Example Case 1)Suppose we are faced with the solution of a linear system Ax=b 2)Further suppose: 1)A is large (dim(A)>10,000) 2)A is dense 3)A is full 4)We have a sequence of different b vectors.

3 Problems Suppose we are able to compute the matrix – –It costs N 2 doubles to store the matrix –E.g. for N=100,000 we require 76.3 gigabytes of storage for the matrix alone. –32 bit processors are limited to 4 gigabytes of memory –Most desktops (even 64 bit) do not have 76.3 gigabytes –What to do?

4 Divide and Conquer P0P1P2P3 P4P5P6P7 P8P9P10P11 P12P13P14P15 One approach is to assume we have a square number of processors. We then divide the matrix into blocks – storing one block per processor.

5 Back to the Linear System We are now faced with LU factorization of a distributed matrix. This calls for a modified LU routine which acts on blocks of the matrix. We will demonstrate this algorithm for one level. i.e. we need to construct matrices L,U such that A=LU and we only store single blocks of A,L,U on any processor.

6 Constructing the Block LU Factorization A00A01A02 A10A11A12 A20A21A22 = L0000 L1010 L2001 * U00U01U02 0?11?12 0?21?22 First we LU factorize A00 and look for the above block factorization. However, we need to figure out what each of the entries are: A00 = L00*U00 (compute by L00, U00 by LU factorization) A01 = L00*U01 => U01 = L00\A01 A02 = L00*U02 => U02 = L00\A02 A10 = L10*U00 => L10 = A10/U00 A20 = L20*U00 => L20 = A20/U00 A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01..

7 cont A00 = L00*U00 (compute by L00, U00 by LU factorization) A01 = L00*U01 => U01 = L00\A01 A02 = L00*U02 => U02 = L00\A02 A10 = L10*U00 => L10 = A10/U00 A20 = L20*U00 => L20 = A20/U00 A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01 A12 = L10*U02 + ?12 => ?12 = A12 – L10*U02 A21 = L20*U01 + ?21 => ?21 = A21 – L20*U01 A22 = L20*U02 + ?22 => ?22 = A22 – L20*U02 In the general case: Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

8 Summary First Stage A00A01A02 A10A11A12 A20A21A22 = L0000 L1010 L2001 * U00U01U02 0?11?12 0?21?22 First step: LU factorize uppermost block diagonal Second step: a) compute U0n = L00\A0n n>0 b) compute Ln0 = An0/U00 n>0 Third step: compute ?nm = Anm – Ln0*U0m, (n,m>0)

9 Now Factorize Lower SE Block ?11?12 ?21?22 = L110 L211 * U11U12 0??22 We repeat the previous algorithm this time on the two by two SE block.

10 End Result A00A01A02 A10A11A12 A20A21A22 = L0000 L10L110 L20L21L22 * U00U01U02 0U11U12 00U22

11 Matlab Version

12 Parallel Algorithm P0P1P2 P3P4P5 P6P7P8 P0: A00 = L00*U00 (compute by L00, U00 by LU factorization) P1: U01 = L00\A01 P2: U02 = L00\A02 P3: L10 = A10/U00 P6: L20 = A20/U00 P4: A11 <- A11 – L10*U01 P5: A12 <- A12 – L10*U02 P7: A21 <- A21 – L20*U01 P8: A22 <- A22 – L20*U02 In the general case: Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

13 Parallel Communication L00 U00 U01U02 L10A11A12 L20A21A22 P0: L00,U00 =lu(A) P1: U01 = L00\A01 P2: U02 = L00\A02 P3: L10 = A10/U00 P6: L20 = A20/U00 P4: A11 <- A11 – L10*U01 P5: A12 <- A12 – L10*U02 P7: A21 <- A21 – L20*U01 P8: A22 <- A22 – L20*U02 In the general case: Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

14 Communication Summary P0: L00,U00 =lu(A) P1: U01 = L00\A01 P2: U02 = L00\A02 P3: L10 = A10/U00 P6: L20 = A20/U00 P4: A11 <- A11 – L10*U01 P5: A12 <- A12 – L10*U02 P7: A21 <- A21 – L20*U01 P8: A22 <- A22 – L20*U02 P0: sends L00 to P1,P2 sends U00 to P3,P6 P1: sends U01 to P4,P7 P2: sends U02 to P5,P8 P3: sends L10 to P4,P5 P4: sends L20 to P7,P8 P0P1P2 P3P4P5 P6P7P8 L00 U00 U01U02 L10A11A12 L20A21A22

15 Upshot Notes: 1)I added an MPI_Barrier purely to separate the LU factorization and the backsolve. 2)In terms of efficiency we can see that quite a bit of time is spent in MPI_Wait compared to compute time. 3)The compute part of this code can be optimized much more – making the parallel efficiency even worse. a b (a) P0: sends L00 to P1,P2 sends U00 to P3,P6 (b) P1: sends U01 to P4,P7 (c) P2: sends U02 to P5,P8 (d) P3: sends L10 to P4,P5 (e) P4: sends L20 to P7,P8 c d e (f) P4: sends L11 to P5 sends U11 to P7 (g) P1: sends U12 to P8 (h) P3: sends L21 to P8 f 1 st stage: g h

16 Block Back Solve After factorization we are left with the task of using the distributed L and U to compute the backsolve: U00 L00 U01U02 L10 U11 L11 U12 L20L21 U22 L22 Block distribution of L and U P0P1P2 P3P4P5 P6P7P8

17 Recall Given an LU factorization of A namely, L,U such that A=LU Then we can solve Ax=b by y=L\b x=U\y

18 Distributed Back Solve L0000 L10L110 L20L21L22 = y0 y1 y2 b0 b1 b2 P0: solve L00*y0 = b0 send: y0 to P3,P6 P3: send: L10*y0 to P4 P4: solve L11*y1 = b1-L10*y0 send: y1 to P7 P6: send: L20*y0 to P8\ P7: send: L21*y1 to P8 P8: solve L22*y2 = b2-L20*y0-L21*y1 Results: y0 on P0, y1 on P4, y2 on P8 P0P1P2 P3P4P5 P6P7P8

19 Matlab Code

20 Back Solve After the factorization we computed a solution to Ax=b This consists of two distributed block triangular systems to solve

21 Barrier Between Back Solves This time I inserted an MPI_Barrier call between the backsolves. This highlights the serial nature of the backsolves..

22 Example Code


Download ppt "Block LU Factorization Lecture 24 MA471 Fall 2003."

Similar presentations


Ads by Google