Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. 2004.

Slides:



Advertisements
Similar presentations
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 19: Parallel Algorithms
CSCI-455/552 Introduction to High Performance Computing Lecture 25.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
CSCI-455/552 Introduction to High Performance Computing Lecture 26.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Systems of Linear Equations
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Lesson 8 Gauss Jordan Elimination
Numerical Algorithms • Matrix multiplication
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Design of parallel algorithms
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
1/26 Design of parallel algorithms Linear equations Jari Porras.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
ECON 1150 Matrix Operations Special Matrices
Slide Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley.
Presentation by: H. Sarper
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Yasser F. O. Mohammad Assiut University Egypt. Previously in NM Introduction to NM Solving single equation System of Linear Equations Vectors and Matrices.
CSCI-455/552 Introduction to High Performance Computing Lecture 13.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Copyright © 2011 Pearson Education, Inc. Solving Linear Systems Using Matrices Section 6.1 Matrices and Determinants.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
1.3 Matrices and Matrix Operations. A matrix is a rectangular array of numbers. The numbers in the arry are called the Entries in the matrix. The size.
Numerical Algorithms.Matrix Multiplication.Gaussian Elimination.Jacobi Iteration.Gauss-Seidel Relaxation.
Linear Systems Dinesh A.
STROUD Worked examples and exercises are in the text Programme 5: Matrices MATRICES PROGRAMME 5.
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
CSCI-455/552 Introduction to High Performance Computing Lecture 15.
STROUD Worked examples and exercises are in the text PROGRAMME 5 MATRICES.
Lecture 9 Numerical Analysis. Solution of Linear System of Equations Chapter 3.
Numerical Algorithms Chapter 11.
MTH108 Business Math I Lecture 20.
ECE 3301 General Electrical Engineering
Lecture 22: Parallel Algorithms
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
Elementary Row Operations Gaussian Elimination Method
Review of Matrix Algebra
Numerical Analysis Lecture14.
Numerical Analysis Lecture10.
Matrix Addition and Multiplication
To accompany the text “Introduction to Parallel Computing”,
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Presentation transcript:

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 1 Chapter 11 Numerical Algorithms

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 2 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 3 Matrices — A Review An n x m matrix

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 4 Matrix Addition Involves adding corresponding elements of each matrix to form the result matrix. Given the elements of A as ai,j and the elements of B as bi,j, each element of C is computed as

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 5 Matrix Multiplication Multiplication of two matrices, A and B, produces the matrix C whose elements, ci,j (0 <= i < n, 0 <= j < m), are computed as follows: where A is an n x l matrix and B is an l x m matrix.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 6 Matrix multiplication, C = A x B

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 7 Matrix-Vector Multiplication c = A x b Matrix-vector multiplication follows directly from the definition of matrix-matrix multiplication by making B an n x1 matrix (vector). Result an n x 1 matrix (vector).

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 8 Relationship of Matrices to Linear Equations A system of linear equations can be written in matrix form: Ax = b Matrix A holds the a constants x is a vector of the unknowns b is a vector of the b constants.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 9 Implementing Matrix Multiplication Sequential Code Assume throughout that the matrices are square (n x n matrices). The sequential code to compute A x B could simply be for (i = 0; i < n; i++) for (j = 0; j < n; j++) { c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } This algorithm requires n 3 multiplications and n 3 additions, leading to a sequential time complexity of O(n 3 ). Very easy to parallelize.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 10 Parallel Code With n processors (and n x n matrices), can obtain: Time complexity of O(n 2 ) with n processors Each instance of inner loop independent and can be done by a separate processor Time complexity of O(n) with n 2 processors One element of A and B assigned to each processor. Cost optimal since O(n 3 ) = n x O(n 2 ) = n 2 x O(n)]. Time complexity of O(log n) with n 3 processors By parallelizing the inner loop. Not cost-optimal since O(n 3 )¹n3xO(log n)). O(log n) lower bound for parallel matrix multiplication.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 11 Partitioning into Submatrices Suppose matrix divided into s2 submatrices. Each submatrix has n/ s x n/s elements. Using notation Ap,q as submatrix in submatrix row p and submatrix column q: for (p = 0; p < s; p++) for (q = 0; q < s; q++) { Cp,q = 0; /* clear elements of submatrix */ for (r = 0; r < m; r++) /* submatrix multiplication &*/ Cp,q = Cp,q + Ap,r * Br,q; /*add to accum. submatrix*/ } The line Cp,q = Cp,q + Ap,r * Br,q; means multiply submatrix Ap,r and Br,q using matrix multiplication and add to submatrix Cp,q using matrix addition. Known as block matrix multiplication.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 12 Block Matrix Multiplication

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 13

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 14 Direct Implementation One processor to compute each element of C – n^2 processors would be needed. One row of elements of A and one column of elements of B needed. Some of same elements sent to more than one processor. Can use submatrices.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 15 Performance Improvement Using tree construction n numbers can be added in log n steps using n processors: Computational time complexity of O(log n) using n3 processors.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 16

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 17

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 18 Recursive Implementation Apply same algorithm on each submatrix recursivly. Excellent algorithm for a shared memory systems because of locality of operations.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 19 Recursive Algorithm

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 20 Mesh Implementations Cannon’s algorithm Fox’s algorithm (not in textbook but similar complexity) Systolic array All involve using processor arranged a mesh and shifting elements of the arrays through the mesh. Accumulate the partial sums at each processor.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 21 Mesh Implementations Cannon’s Algorithm Uses a mesh of processors with wraparound connections (a torus) to shift the A elements (or submatrices) left and the B elements (or submatrices) up. 1.Initially processor Pi,j has elements ai,j and bi,j (0 <= i < n, 0 <= k < n). 2. Elements are moved from their initial position to an “aligned” position. The complete ith row of A is shifted i places left and the complete jth column of B is shifted j places upward. This has the effect of placing the element ai,j+i and the element bi+j,j in processor Pi,j,. These elements are a pair of those required in the accumulation of ci,j. 3.Each processor, Pi,j, multiplies its elements. 4. The ith row of A is shifted one place right, and the jth column of B is shifted one place upward. This has the effect of bringing together the adjacent elements of A and B, which will also be required in the accumulation. 5. Each processor, Pi,j, multiplies the elements brought to it and adds the result to the accumulating sum. 6. Step 4 and 5 are repeated until the final result is obtained (n - 1 shifts with n rows and n columns of elements).

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 22 Movement of A and B elements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 23 Step 2 — Alignment of elements of A and B

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 24 Step 4 - One-place shift of elements of A and B

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 25

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 26

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 27 Solving a System of Linear Equations Objective is to find values for the unknowns, x0, x1, …, xn-1, given values for a0,0, a0,1, …, an-1,n-1, and b0, …, bn.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 28 Solving a System of Linear Equations Dense matrices Gaussian Elimination - parallel time complexity O(n 2 ) Sparse matrices By iteration - depends upon iteration method and number of iterations but typically O(log n) Jacobi iteration Gauss-Seidel relaxation (not good for parallelization) Red-Black ordering Multigrid

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 29 Gaussian Elimination Convert general system of linear equations into triangular system of equations. Then be solved by Back Substitution. Uses characteristic of linear equations that any row can be replaced by that row added to another row multiplied by a constant. Starts at the first row and works toward the bottom row. At the ith row, each row j below the ith row is replaced by row j + (row i) (-aj,i/ ai,i). The constant used for row j is -aj,i/ai,i. Has the effect of making all the elements in the ith column below the ith row zero because

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 30 Gaussian elimination

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 31 Partial Pivoting If ai,i is zero or close to zero, we will not be able to compute the quantity -aj,i/ai,i. Procedure must be modified into so-called partial pivoting by swapping the ith row with the row below it that has the largest absolute element in the ith column of any of the rows below the ith row if there is one. (Reordering equations will not affect the system.) In the following, we will not consider partial pivoting.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 32 Sequential Code Without partial pivoting: for (i = 0; i < n-1; i++) /* for each row, except last */ for (j = i+1; j < n; j++) { /*step thro subsequent rows */ m = a[j][i]/a[i][i]; /* Compute multiplier */ for (k = i; k < n; k++) /*last n-i-1 elements of row j*/ a[j][k] = a[j][k] - a[i][k] * m; b[j] = b[j] - b[i] * m; /* modify right side */ } The time complexity is O(n 3 ).

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 33 Parallel Implementation

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 34 Analysis Communication n - 1 broadcasts performed sequentially. ith broadcast contains n - i + 1 elements. Time complexity of O(n 2 ) ( see textbook) Computation After row broadcast, each processor Pj beyond broadcast processor Pi will compute its multiplier, and operate upon n - j + 2 elements of its row. Ignoring the computation of the multiplier, there are n - j + 2 multiplications and n - j + 2 subtractions. Time complexity of O(n2) (see textbook). Efficiency will be relatively low because all the processors before the processor holding row i do not participate in the computation again.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 35 Pipeline implementation of Gaussian elimination

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 36 Strip Partitioning Poor processor allocation! Processors do not participate in computation after their last row is processed.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 37 Cyclic-Striped Partitioning An alternative which equalizes the processor workload:

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M Pearson Education Inc. All rights reserved. 38 Skip Iterative Method-see text