6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
1 High performance Computing Applied to a Saltwater Intrusion Numerical Model E. Canot IRISA/CNRS J. Erhel IRISA/INRIA Rennes C. de Dieuleveult IRISA/INRIA.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Numerical Algorithms Matrix multiplication
Communication Pattern Based Node Selection for Shared Networks
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods Nonsymmetric systems Graph theoretic tools Sparse LU with partial pivoting Supernodal factorization.
1/26 Design of parallel algorithms Linear equations Jari Porras.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
1 A Domain Decomposition Analysis of a Nonlinear Magnetostatic Problem with 100 Million Degrees of Freedom H.KANAYAMA *, M.Ogino *, S.Sugimoto ** and J.Zhao.
PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.
Introduction to Numerical Analysis I MATH/CMPSC 455 PA=LU.
A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Amesos Sparse Direct Solver Package Ken Stanley, Rob Hoekstra, Marzio Sala, Tim Davis, Mike Heroux Trilinos Users Group Albuquerque 3 Nov 2004.
Using Partitioning in the Numerical Treatment of ODE Systems with Applications to Atmospheric Modelling Zahari Zlatev National Environmental Research Institute.
Progress report on the alignment of the tracking system A. Bonissent D. Fouchez A.Tilquin CPPM Marseille Mechanical constraints from optical measurement.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
JAVA AND MATRIX COMPUTATION
Amesos Interfaces to sparse direct solvers October 15, :30 – 9:30 a.m. Ken Stanley.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
Linear Systems Dinesh A.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Symmetric-pattern multifrontal factorization T(A) G(A)
Unit #1 Linear Systems Fall Dr. Jehad Al Dallal.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
PIVOTING The pivot or pivot element is the element of a matrix, or an array, which is selected first by an algorithm (e.g. Gaussian elimination, simplex.
A computational loop k k Integration Newton Iteration
for more information ... Performance Tuning
Department of Computer Science University of California, Santa Barbara
Model Order Reduction for Parametric Systems
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
CSCE569 Parallel Computing
Hardware Counter Driven On-the-Fly Request Signatures
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
A computational loop k k Integration Newton Iteration
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester

6/22/2005ICS'20052 Preliminary: Parallel Sparse LU Factorization LU factorization with partial pivoting: used for solving a linear system Ax = b (PA=LU). Applications: Device/circuit simulation, fluid dynamics,... … In the Newton’s method for solving non-linear systems Challenges for parallel sparse LU factorization: Runtime data structure variation Non-uniform computation/communication patterns ⇒ Irregular

6/22/2005ICS'20053 Existing Solvers and Their Portability Shared memory solvers: SuperLU [Li, Demmel et al. 1999], WSMP [Gupta 2000], PARDISO [Schenk & Gärtner 2004] Message passing solvers: S+ [Shen et al. 2000], MUMPS [Amestoy et al. 2001], SuperLU_DIST [Li & Demmel 2004] Existing message passing solvers are portable, but perform poorly on platforms with slow message passing Mostly designed for parallel computers with fast interconnect Performance portability is desirable Large variation in the characteristics of available platforms

6/22/2005ICS'20054 Example Message Passing Platforms Three platforms running MPI Regatta-shmem, Regatta-TCP/IP, PC cluster Per-CPU peak BLAS-3 performance is 971 MFLOPS on Regatta and 1382 MFLOPS on a PC

6/22/2005ICS'20055 Parallel Sparse LU Factorization on the Three Platforms Performance of S+ [Shen et al. 2000] We investigate communication reduction techniques to improve the performance on platforms with slow comm.

6/22/2005ICS'20056 Data Structure and Computation Steps for each column block K (1 → N) Perform Factor(K); Perform SwapScale(K); Perform Update(K); endfor Processor mapping: 1-D cyclic 2-D cyclic (more scalable) Column block K Row block K

6/22/2005ICS'20057 Large Diagonal Batch Pivoting Large diagonal batch pivoting Locate the largest elements for all columns in a block using one round of communication Use them as pivoting elements may be numerically unstable We check the error and fall back to original pivoting if necessary Previous approaches [Duff and Koster 1999, 2001; Li & Demmel 2004] use it in iterative methods Batch pivoting to reduce comm.

6/22/2005ICS'20058 Speculative Batch Pivoting Large diagonal batch pivoting fails the numerical stability test frequently Speculative batch pivoting Collect candidate pivot rows (for all columns in a block) at one processor using one gather communication Perform factorization at that processor to determine the pivots Error checking and fall back to original pivoting if necessary Both batch pivoting strategies Require additional computation May slightly weaken the numerical stability

6/22/2005ICS'20059 Performance on Regatta-shmem Virtually no performance benefits LD – large diagonal; SBP – speculative batch pivoting TP – threshold pivoting [Duff et al. 1986]

6/22/2005ICS' Performance on Platforms with Slower Message Passing PC cluster: Improvement of SBP is % for a set of 8 test matrices Regatta-TCP/IP: The improvement is up to 48%

6/22/2005ICS' Application Adaptation Communication-reduction techniques Effective on platforms with relatively slow message passing Ineffective on first-class platforms their by-products (e.g., additional computation) may not be worthwhile Sampling-based adaptation Collect application statistics in sampling phase Coupled with platform characteristics, to adaptively determine whether candidate techniques should be employed

6/22/2005ICS' Adaptation on Regatta-shmem The “Adaptive” version: Disables the comm-reduction techniques for most matrices Achieves similar numerical stability as the “Original” version

6/22/2005ICS' Adaptation on the PC Cluster The “Adaptive” version: Employs the comm-reduction techniques for all matrices Performs close to the TP+SBP version

6/22/2005ICS' Conclusion Contributions: Propose communication-reduction techniques to improve the LU factorization performance on platforms with relatively slow message passing Runtime sampling-based adaptation to automatically choose the appropriate version of the application