Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Numerical Solution of Linear Equations
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
Lecture 19: Parallel Algorithms
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
1 Linear Triangular System L – lower triangular matrix, nonsingular Lx=b L: nxn nonsingular lower triangular b: known vector b(1) = b(1)/L(1,1) For i=2:n.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Systems of Linear Equations
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Rayan Alsemmeri Amseena Mansoor. LINEAR SYSTEMS Jacobi method is used to solve linear systems of the form Ax=b, where A is the square and invertible.
Numerical Algorithms Matrix multiplication
Linear Algebraic Equations
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 20 Solution of Linear System of Equations - Iterative Methods.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
1/26 Design of parallel algorithms Linear equations Jari Porras.
ECIV 520 Structural Analysis II Review of Matrix Algebra.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS240A: Conjugate Gradients and the Model Problem.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
ECIV 301 Programming & Graphics Numerical Methods for Engineers REVIEW II.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Mujahed AlDhaifallah (Term 342) Read Chapter 9 of the textbook
Systems of Linear Equations
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 7 - Systems of Equations CVEN 302 June 17, 2002.
Parallelizing Gauss-Seidel Solver/Pre-conditioner Aim: To parallelize a Gauss-Seidel Solver, which can be used as a pre-conditioner for the finite element.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS240A: Conjugate Gradients and the Model Problem.
Linear Systems – Iterative methods
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Numerical Methods.  LU Decomposition is another method to solve a set of simultaneous linear equations  For most non-singular matrix [A] that one could.
Lecture 9 Numerical Analysis. Solution of Linear System of Equations Chapter 3.
Numerical Algorithms Chapter 11.
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
Simultaneous Linear Equations
Spring Dr. Jehad Al Dallal
Solving Linear Systems Ax=b
Gauss-Siedel Method.
Lecture 19 MA471 Fall 2003.
Autar Kaw Benjamin Rigsby
MATLAB EXAMPLES Matrix Solution Methods
Metode Eliminasi Pertemuan – 4, 5, 6 Mata Kuliah : Analisis Numerik
Numerical Algorithms • Parallelizing matrix multiplication
CSCE569 Parallel Computing
Conjugate Gradient Method
Numerical Analysis Lecture14.
Linear Systems Numerical Methods.
Simultaneous Linear Equations
Introduction to Scientific Computing II
Dense Linear Algebra (Data Distributions)
Parallel Programming in C with MPI and OpenMP
Lecture 8 Matrix Inverse and LU Decomposition
Linear Algebra Lecture 16.
Ax = b Methods for Solution of the System of Equations (ReCap):
Presentation transcript:

Parallel Jacobi Algorithm Steven Dong Applied Mathematics

Overview  Parallel Jacobi Algorithm  Different data distribution schemes Row-wise distribution Column-wise distribution Cyclic shifting Global reduction  Domain decomposition for solving Laplacian equations  Related MPI functions for parallel Jacobi algorithm and your project.

Linear Equation Solvers  Direct solvers  Gauss elimination  LU decomposition  Iterative solvers  Basic iterative solvers Jacobi Gauss-Seidel Successive over-relaxation  Krylov subspace methods Generalized minimum residual (GMRES) Conjugate gradient

Sequential Jacobi Algorithm D is diagonal matrix L is lower triangular matrix U is upper triangular matrix

Parallel Jacobi Algorithm: Ideas  Shared memory or distributed memory:  Shared-memory parallelization very straightforward  Consider distributed memory machine using MPI  Questions to answer in parallelization:  Identify concurrency  Data distribution (data locality) How to distribute coefficient matrix among CPUs? How to distribute vector of unknowns? How to distribute RHS?  Communication: What data needs to be communicated?  Want to:  Achieve data locality  Minimize the number of communications  Overlap communications with computations  Load balance

Row-wise Distribution  Assume dimension of matrix can be divided by number of CPUs  Blocks of m rows of coefficient matrix distributed to different CPUs;  Vector of unknowns and RHS distributed similarly = A x b m m … n A: n x n m = n/P P is number of CPUs

Data to be Communicated  Already have all columns of matrix A on each CPU;  Only part of vector x is available on a CPU; Cannot carry out matrix vector multiplication directly;  Need to communicate the vector x in the computations. = A x b m m … n Cpu 0 Cpu 1

How to Communicate Vector X?  Gather partial vector x on each CPU to form the whole vector; Then matrix-vector multiplication on different CPUs proceed independently. (textbook)  Need MPI_Allgather() function call;  Simple to implement, but  A lot of communications  Does not scale well for a large number of processors. RED REDREDRED

How to Communicate X?  Another method: Cyclic shift  Shift partial vector x upward at each step;  Do partial matrix-vector multiplication on each CPU at each step;  After P steps (P is the number of CPUs), the overall matrix- vector multiplication is complete.  Each CPU needs only to communicate with neighboring CPUs  Provides opportunities to overlap communication with computations  Detailed illustration …

a11a12a13a14 a21a22a23a24 a31a32a33a34 a41a42a43a44 x1 x2 x3 x4 a11*x1 + a12*x2 + a13*x3 + a14*x4 a21*x1 + a22*x2 + a23*x3 + a24*x4 a31*x1 + a32*x2 + a33*x3 + a34*x4 a41*x1 + a42*x2 + a43*x3 + a44*x4 a11a12a13a14 a21a22a23a24 a31a32a33a34 a41a42a43a44 x2 x3 x4 x1 a11a12a13a14 a21a22a23a24 a31a32a33a34 a41a42a43a44 x3 x4 x1 x2 a11a12a13a14 a21a22a23a24 a31a32a33a34 a41a42a43a44 x4 x1 x2 x3 a11*x1 + a12*x2 + a13*x3 + a14*x4 a21*x1 + a22*x2 + a23*x3 + a24*x4 a31*x1 + a32*x2 + a33*x3 + a34*x4 a41*x1 + a42*x2 + a43*x3 + a44*x4 a11*x1 + a12*x2 + a13*x3 + a14*x4 a21*x1 + a22*x2 + a23*x3 + a24*x4 a31*x1 + a32*x2 + a33*x3 + a34*x4 a41*x1 + a42*x2 + a43*x3 + a44*x4 a11*x1 + a12*x2 + a13*x3 + a14*x4 a21*x1 + a22*x2 + a23*x3 + a24*x4 a31*x1 + a32*x2 + a33*x3 + a34*x4 a41*x1 + a42*x2 + a43*x3 + a44*x4 (1) (2) (3)(4) Cpu 0 Cpu 1 Cpu 2 Cpu 3

Overlap Communications with Computations  Communications:  Each CPU needs to send its own partial vector x to upper neighboring CPU;  Each CPU needs to receive data from lower neighboring CPU  Overlap communications with computations: Each CPU does the following:  Post non-blocking requests to send data to upper neighbor to to receive data from lower neighbor; This returns immediately  Do partial computation with data currently available;  Check non-blocking communication status; wait if necessary;  Repeat above steps

Stopping Criterion  Computing norm requires information of the whole vector;  Need a global reduction (SUM) to compute the norm using MPI_Allreduce or MPI_Reduce.

Column-wise Distribution  Blocks of m columns of coefficient matrix A are distributed to different CPUs;  Blocks of m rows of vector x and b are distributed to different CPUs; mm… n A x = b

Data to be Communicated  Already have coefficient matrix data of m columns, and a block of m rows of vector x;  So a partial A*x can be computed on each CPU independently.  Need communication to get whole A*x; mm… n A x = b

How to Communicate  After getting partial A*x, can do global reduction (SUM) using MPI_Allreduce to get the whole A*x. So a new vector x can be calculated.  Another method: Cyclic shift  Shift coefficient matrix left-ward and vector of unknowns upward at each step;  Do a partial matrix-vector multiplication, and subtract it from the RHS;  After P steps (P is number of CPUs), matrix-vector multiplication is completed and subtracted from RHS; Can compute new vector x.  Detailed illustration …

b1 - a11*x1 - a12*x2 - a13*x3 - a14*x4 b2 - a21*x1 - a22*x2 - a23*x3 - a24*x4 b3 - a31*x1 - a32*x2 - a33*x3 - a34*x4 b4 - a41*x1 - a42*x2 - a43*x3 - a44*x4 (1) (2) (3) (4) b1 - a11*x1 - a12*x2 - a13*x3 - a14*x4 b2 - a21*x1 - a22*x2 - a23*x3 - a24*x4 b3 - a31*x1 - a32*x2 - a33*x3 - a34*x4 b4 - a41*x1 - a42*x2 - a43*x3 - a44*x4 b1 - a11*x1 - a12*x2 - a13*x3 - a14*x4 b2 - a21*x1 - a22*x2 - a23*x3 - a24*x4 b3 - a31*x1 - a32*x2 - a33*x3 - a34*x4 b4 - a41*x1 - a42*x2 - a43*x3 - a44*x4 b1 - a11*x1 - a12*x2 - a13*x3 - a14*x4 b2 - a21*x1 - a22*x2 - a23*x3 - a24*x4 b3 - a31*x1 - a32*x2 - a33*x3 - a34*x4 b4 - a41*x1 - a42*x2 - a43*x3 - a44*x4 a13a14a11a12 a23a24a21a22 a33a34a31a32 a43a44a41a42 x3 x4 x1 x2 = b1 b2 b3 b4 a14a11a12a13 a24a21a22a23 a34a31a32a33 a44a41a42a43 x4 x1 x2 x3 = b1 b2 b3 b4 a11a12a13a14 a21a22a23a24 a31a32a33a34 a41a42a43a44 x1 x2 x3 x4 = b1 b2 b3 b4 a12a13a14a11 a22a23a24a21 a32a33a34a31 a42a43a44a41 x2 x3 x4 x1 = b1 b2 b3 b4

Solving Diffusion Equation  How do we solve it in parallel in practice?  Need to do domain decomposition.

Domain Decomposition  Column-wise decomposition  Boundary points depend on data from neighboring CPU  During each iteration, need send own boundary data to neighbors, and receive boundary data from neighboring CPUs.  Interior points depend only on data residing on the same CPU (local data). x y cpu 0cpu 1 …

Overlap Communication with Computations  Compute boundary points and interior points at different stages;  Specifically:  At the beginning of an iteration, post non-blocking send to and receive from requests for communicating boundary data with neighboring CPUs;  Update values on interior points;  Check communication status (should complete by this point), wait if necessary;  Boundary data received, update boundary points;  Begin next iteration, repeat above steps.

Other Domain Decompositions 2D decomposition1D decomposition

Related MPI Functions for Parallel Jacobi Algorithm  MPI_Allgather()  MPI_Isend()  MPI_Irecv()  MPI_Reduce()  MPI_Allreduce()

MPI Programming Related to Your Project  Parallel Jacobi Algorithm  Compiling MPI programs  Running MPI programs  Machines: