Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.

Similar presentations


Presentation on theme: "6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester."— Presentation transcript:

1 6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester

2 6/22/2005ICS'20052 Preliminary: Parallel Sparse LU Factorization LU factorization with partial pivoting: used for solving a linear system Ax = b (PA=LU). Applications: Device/circuit simulation, fluid dynamics,... … In the Newton’s method for solving non-linear systems Challenges for parallel sparse LU factorization: Runtime data structure variation Non-uniform computation/communication patterns ⇒ Irregular

3 6/22/2005ICS'20053 Existing Solvers and Their Portability Shared memory solvers: SuperLU [Li, Demmel et al. 1999], WSMP [Gupta 2000], PARDISO [Schenk & Gärtner 2004] Message passing solvers: S+ [Shen et al. 2000], MUMPS [Amestoy et al. 2001], SuperLU_DIST [Li & Demmel 2004] Existing message passing solvers are portable, but perform poorly on platforms with slow message passing Mostly designed for parallel computers with fast interconnect Performance portability is desirable Large variation in the characteristics of available platforms

4 6/22/2005ICS'20054 Example Message Passing Platforms Three platforms running MPI Regatta-shmem, Regatta-TCP/IP, PC cluster Per-CPU peak BLAS-3 performance is 971 MFLOPS on Regatta and 1382 MFLOPS on a PC

5 6/22/2005ICS'20055 Parallel Sparse LU Factorization on the Three Platforms Performance of S+ [Shen et al. 2000] We investigate communication reduction techniques to improve the performance on platforms with slow comm.

6 6/22/2005ICS'20056 Data Structure and Computation Steps for each column block K (1 → N) Perform Factor(K); Perform SwapScale(K); Perform Update(K); endfor Processor mapping: 1-D cyclic 2-D cyclic (more scalable) Column block K Row block K

7 6/22/2005ICS'20057 Large Diagonal Batch Pivoting Large diagonal batch pivoting Locate the largest elements for all columns in a block using one round of communication Use them as pivoting elements may be numerically unstable We check the error and fall back to original pivoting if necessary Previous approaches [Duff and Koster 1999, 2001; Li & Demmel 2004] use it in iterative methods Batch pivoting to reduce comm.

8 6/22/2005ICS'20058 Speculative Batch Pivoting Large diagonal batch pivoting fails the numerical stability test frequently Speculative batch pivoting Collect candidate pivot rows (for all columns in a block) at one processor using one gather communication Perform factorization at that processor to determine the pivots Error checking and fall back to original pivoting if necessary Both batch pivoting strategies Require additional computation May slightly weaken the numerical stability

9 6/22/2005ICS'20059 Performance on Regatta-shmem Virtually no performance benefits LD – large diagonal; SBP – speculative batch pivoting TP – threshold pivoting [Duff et al. 1986]

10 6/22/2005ICS'200510 Performance on Platforms with Slower Message Passing PC cluster: Improvement of SBP is 28-292% for a set of 8 test matrices Regatta-TCP/IP: The improvement is up to 48%

11 6/22/2005ICS'200511 Application Adaptation Communication-reduction techniques Effective on platforms with relatively slow message passing Ineffective on first-class platforms their by-products (e.g., additional computation) may not be worthwhile Sampling-based adaptation Collect application statistics in sampling phase Coupled with platform characteristics, to adaptively determine whether candidate techniques should be employed

12 6/22/2005ICS'200512 Adaptation on Regatta-shmem The “Adaptive” version: Disables the comm-reduction techniques for most matrices Achieves similar numerical stability as the “Original” version

13 6/22/2005ICS'200513 Adaptation on the PC Cluster The “Adaptive” version: Employs the comm-reduction techniques for all matrices Performs close to the TP+SBP version

14 6/22/2005ICS'200514 Conclusion Contributions: Propose communication-reduction techniques to improve the LU factorization performance on platforms with relatively slow message passing Runtime sampling-based adaptation to automatically choose the appropriate version of the application http://www.cs.rochester.edu/~kshen/research/s+/


Download ppt "6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester."

Similar presentations


Ads by Google