Dense Matrix Algorithms CS 524 – High-Performance Computing.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Elementary Linear Algebra Anton & Rorres, 9th Edition
Lecture 19: Parallel Algorithms
CSCI-455/552 Introduction to High Performance Computing Lecture 25.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
1 Linear Triangular System L – lower triangular matrix, nonsingular Lx=b L: nxn nonsingular lower triangular b: known vector b(1) = b(1)/L(1,1) For i=2:n.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CHAPTER ONE Matrices and System Equations
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Parallel System Performance CS 524 – High-Performance Computing.
Numerical Algorithms • Matrix multiplication
MF-852 Financial Econometrics
Linear Algebraic Equations
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Data Locality CS 524 – High-Performance Computing.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
Design of parallel algorithms
Chapter 2 Matrices Definition of a matrix.
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
1/26 Design of parallel algorithms Linear equations Jari Porras.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Friday, November 03, 2006 “The greatest obstacle to discovery is not ignorance, but the illusion of knowledge.” -D. Boorstin.
Data Locality CS 524 – High-Performance Computing.
Parallel System Performance CS 524 – High-Performance Computing.
Design of parallel algorithms Matrix operations J. Porras.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.
CE 311 K - Introduction to Computer Methods Daene C. McKinney
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
ECON 1150 Matrix Operations Special Matrices
By: David McQuilling; Jesus Caban Deng Li Numerical Linear Algebra.
Chap. 2 Matrices 2.1 Operations with Matrices
MATH 250 Linear Equations and Matrices
CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Yasser F. O. Mohammad Assiut University Egypt. Previously in NM Introduction to NM Solving single equation System of Linear Equations Vectors and Matrices.
By: David McQuilling and Jesus Caban Numerical Linear Algebra.
8.1 Matrices & Systems of Equations
Matrices CHAPTER 8.1 ~ 8.8. Ch _2 Contents  8.1 Matrix Algebra 8.1 Matrix Algebra  8.2 Systems of Linear Algebra Equations 8.2 Systems of Linear.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Copyright © Cengage Learning. All rights reserved. 7 Linear Systems and Matrices.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Systems Dinesh A.
Lecture 9 Architecture Independent (MPI) Algorithm Design
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
Section 2.1 Determinants by Cofactor Expansion. THE DETERMINANT Recall from algebra, that the function f (x) = x 2 is a function from the real numbers.
Matrices. Variety of engineering problems lead to the need to solve systems of linear equations matrixcolumn vectors.
Matrices and systems of Equations. Definition of a Matrix * Rectangular array of real numbers m rows by n columns * Named using capital letters * First.
Numerical Computation Lecture 6: Linear Systems – part II United International College.
A very brief introduction to Matrix (Section 2.7) Definitions Some properties Basic matrix operations Zero-One (Boolean) matrices.
Numerical Algorithms Chapter 11.
Matrices Introduction.
MTH108 Business Math I Lecture 20.
ECE 3301 General Electrical Engineering
Lecture 22: Parallel Algorithms
Parallel Matrix Operations
Numerical Algorithms • Parallelizing matrix multiplication
CSCE569 Parallel Computing
Review of Matrix Algebra
Dense Linear Algebra (Data Distributions)
To accompany the text “Introduction to Parallel Computing”,
Presentation transcript:

Dense Matrix Algorithms CS 524 – High-Performance Computing

CS 524 (Wi 2003/04)- Asim LUMS2 Definitions p = number of processors (0 to p-1) n = dimension of array/matrix (0 to n-1) q = number of blocks along one dimension (0 to q-1) t c = computation time for one flop t s = communication startup time t w = communication transfer time per word Interconnection network: crossbar switch with bi- directional links

CS 524 (Wi 2003/04)- Asim LUMS3 Uniform Striped Partitioning

CS 524 (Wi 2003/04)- Asim LUMS4 Checkerboard Partitioning

CS 524 (Wi 2003/04)- Asim LUMS5 Matrix Transpose (MT) A T (i, j) = A(j, i) for all I and j Sequential run-time do i = 0, n-1 do j = 0, n-1 B(i, j) = A(j, i) end do  Run time is (n 2 – n)/2 or n 2 /2

CS 524 (Wi 2003/04)- Asim LUMS6 MT - Checkerboard Partitioning (1)

CS 524 (Wi 2003/04)- Asim LUMS7 MT – Checkerboard Partitioning (2)

CS 524 (Wi 2003/04)- Asim LUMS8 MT – Striped Partitioning

CS 524 (Wi 2003/04)- Asim LUMS9 Matrix-Vector Multiplication (MVM) MVM: y = Ax do i = 0, n-1 do j = 0, n-1 y(i) = y(i) + A(i, j)*x(j) end do Sequential algorithm requires n 2 multiplications and additions  Assuming one flop takes t c time, sequential run time is 2t c n 2

CS 524 (Wi 2003/04)- Asim LUMS10 Row-wise Striping – p = n (1)

CS 524 (Wi 2003/04)- Asim LUMS11 Row-wise Striping – p = n (2) Data partitioning: P i has row i of A and element i of x Communication: Each processor broadcasts its element of x Computation: Each processor perform n additions and multiplications Parallel run time: T p = 2nt c + p(t s + t w ) = 2nt c + n(t s + t w ) Algorithm is cost-optimal as both parallel and serial cost is O(n 2 )

CS 524 (Wi 2003/04)- Asim LUMS12 Row-wise Striping – p < n Data partitioning: Each processor has n/p rows of A and corresponding n/p elements of x Communication: Each processor broadcasts its elements of x Computation: Each processor perform n 2 /p additions and multiplications Parallel run time: T p = 2t c n 2 /p+ p[t s + (n/p)t w ] Algorithm is cost-optimal for p = O(n)

CS 524 (Wi 2003/04)- Asim LUMS13 Checkerboard Partitioning – p = n 2 (1)

CS 524 (Wi 2003/04)- Asim LUMS14 Checkerboard Partitioning – p = n 2 (2) Data partitioning: Each processor has one element of A; only processors in last column have one element of x Communication  One element of x from last column to diagonal processor  Broadcast from diagonal processor to all processors in column  Global sum of y from all processors in row to last processor Computation: one multiplication + addition Parallel run time: T p = 2t c + 3(t s + t w ) Algorithm is cost-optimal as serial and parallel cost is O(n 2 ) For bus network, communication time is 3n(t s + t w ); system is not cost-optimal as cost is O(n 3 )

CS 524 (Wi 2003/04)- Asim LUMS15 Checkerboard Partitioning – p < n 2 Data partitioning: Each processor has n/√p x n/√p elements of A; processors in last column have n/√p elements of x Communication  n/√p elements of x from last column to diagonal processor  Broadcast from diagonal processor to all processors in column  Global sum of y from all processors in row to last processor Computation: n 2 /p multiplications + additions Parallel run time: T p = 2t c n 2 /p+ 3 (t s + t w n/√p) Algorithm is cost-optimal only if p = O(n 2 )

CS 524 (Wi 2003/04)- Asim LUMS16 Matrix-Matrix Multiplication (MMM) C = A x B, n x n square matrices Block matrix multiplication: algebraic operations on sub-matrices or blocks of matrices. This view of MMM aids parallelization. do i = 0, q-1 do j = 0, q-1 do k = 0, q-1 C i,j = C i,j + A i,k x B k,j end do end do end do Number of multiplications + additions = n 3. Sequential run time = 2t c n 3

CS 524 (Wi 2003/04)- Asim LUMS17 Checkerboard Partitioning – q = √p Data partitioning: P i,j has A i,j and B i,j blocks of A and B of dimension n/√p x n/√p Communication: Each processor broadcasts its submatrix A i,j to all processors in row; each processor broadcasts its submatrix B i,j to all processors in column Computation: Each processor performs n*n/√p* n/√p = n 3 /p multiplications + additions Parallel run time: T p = 2t c n 3 /p + 2√p[t s + (n 2 /p)t w ] Algorithm is cost-optimal only if p = O(n 2 )

CS 524 (Wi 2003/04)- Asim LUMS18 Cannon’s Algorithm (1) Memory-efficient version of the checkerboard partitioned block MMM  At any time, each processor has one block of A and B  Blocks are cycled after each computation in such a way that after √p computations the multiplication is done for C i,j  Initial distribution of matrices is same as checkerboard partitioning Communication  Initial: block A i,j is moved left by i steps (with wraparound); block B i,j is moved up by j steps (with wraparound)  Subsequent √p-1 : block A i,j is moved left by one step; block B i,j moved up by one step (both with wraparound) After √p computation and communication steps the multiplication is complete for C i,j

CS 524 (Wi 2003/04)- Asim LUMS19 Cannon’s Algorithm (2)

CS 524 (Wi 2003/04)- Asim LUMS20 Cannon’s Algorithm (3)

CS 524 (Wi 2003/04)- Asim LUMS21 Cannon’s Algorithm (4) Communication  √p point-to-point communications of size n 2 /p along rows  √p point-to-point communications of size n 2 /p along columns Computation: over √p steps, each processors performs n 3 /p multiplications + additions Parallel run time: T p = 2t c n 3 /p + 2√p[t s + (n 2 /p)t w ] Algorithm is cost-optimal if p = O(n 2 )

CS 524 (Wi 2003/04)- Asim LUMS22 Fox’s Algorithm (1) Another memory-efficient version of the checkerboard partitioned block MMM  Initial distribution of matrices is same as checkerboard partitioning  At any time, each processor has one block of A and B Steps (repeated √p times) 1. Broadcast A i,i to all processors in the row 2. Multiply block of A received with resident block of B 3. Send the block of B up one step (with wraparound) 4. Select block A i,(j+1)mod√p and broadcast to all processors in row. Go to 2.

CS 524 (Wi 2003/04)- Asim LUMS23 Fox’s Algorithm (2)

CS 524 (Wi 2003/04)- Asim LUMS24 Fox’s Algorithm (3) Communication  √p broadcasts of size n 2 /p along rows  √p point-to-point communications of size n 2 /p along columns Computation: Each processor performs n 3 /p multiplications + additions Parallel run time: T p = 2t c n 3 /p + 2√p[t s + (n 2 /p)t w ] Algorithm is cost-optimal if p = O(n 2 )

CS 524 (Wi 2003/04)- Asim LUMS25 Solving a System of Linear Equations System of linear equations, Ax = b  A is dense n x n matrix of coefficients  b is n x 1 vector of RHS values  x is n x 1 vector of unknowns Solving x is usually done in two stages  First, Ax = b is reduced to Ux = y, where U is an unit upper triangular matrix [U(i,j) = 0 if i > j; otherwise U(i,j) ≠ 0 and U(i,i) = 1 for 0 ≤ i < n]. This stage is called Gaussian elimination.  Second, the unknowns are solved in reverse order starting from x(n-1). This stage is called back-substitution.

CS 524 (Wi 2003/04)- Asim LUMS26 Gaussian Elimination (1) do k = 0, n-1 do j = k+1, n-1 A(k, j) = A(k, j)/A(k, k) end do y(k) = b(k)/A(k, k) A(k, k) = 1 do i = k+1, n-1 do j = k+1, n-1 A(i, j) = A(i, j) – A(i, k)*A(k, j) end do b(i) = b(i) – A(i, k)*y(k) A(i, k) = 0 end do

CS 524 (Wi 2003/04)- Asim LUMS27 Gaussian Elimination (2) Computations  Approximately n 2 /2 divisions  Approximately n 3 /3 – n 2 /2 multiplications + subtractions Approx. sequential run time: T s = 2t c n 3 /3

CS 524 (Wi 2003/04)- Asim LUMS28 Striped Partitioning – p = n (1) Data partitioning: Each processor has one row of matrix A Communication during k (outermost loop)  broadcast of active part of kth (size: n–k–1) row to processors k+1 to n-1 Computation during iteration k (outermost loop)  n – k -1 divisions at processor P k  n –k -1 multiplications + subtractions for processors P i (k < i < n) Parallel run time: T p = (3/2)n(n-1)t c + nt s + 0.5n(n-1)t w Algorithm is not cost-optimal since serial and parallel costs are O(n 3 )

CS 524 (Wi 2003/04)- Asim LUMS29 Striped Partitioning – p = n (2)

CS 524 (Wi 2003/04)- Asim LUMS30 Striped Partitioning – p = n (3)

CS 524 (Wi 2003/04)- Asim LUMS31 Pipelined Version (Striped Partitioning) In the non-pipelined or synchronous version, outer loop k is executed in order.  When P k is performing the division step, all other processors are idle  When performing the elimination step, only processors k+1 to n-1 are active; rest are idle In pipelined version, the division step, communication, and elimination step are overlapped.  Each processor: communicates, if it has data to communicate; computes, if it has computations to be done; or waits, if none of these can be done.  Cost-optimal for linear array, mesh and hypercube interconnection networks that have directly-connected processors.

CS 524 (Wi 2003/04)- Asim LUMS32 Pipelined Version (2)

CS 524 (Wi 2003/04)- Asim LUMS33 Pipelined Version (3)

CS 524 (Wi 2003/04)- Asim LUMS34 Striped Partitioning – p < n (1)

CS 524 (Wi 2003/04)- Asim LUMS35 Striped Partitioning – p < n (2)

CS 524 (Wi 2003/04)- Asim LUMS36 Checkerboard Partitioning – p = n 2 (1)

CS 524 (Wi 2003/04)- Asim LUMS37 Checkerboard Partitioning – p = n 2 (2) Data partitioning: P i,j has element A(i, j) of matrix A Communication during iteration k (outermost loop)  Broadcast of A(k, k) to processor (k, k+1) to (k, n-1) in the kth row  Broadcast of modified A(i,k) along ith row for k ≤ i < n  Broadcast of modified A(k,j) along jth column for k ≤ j < n Computation during iteration k (outermost loop)  One division at P k,k  One multiplication + subtraction at processors P i,j (k < i,j< n) Parallel run time: T p = (3/2)n(n-1)t c + n[t s + 0.5(n-1)t w ] Algorithm is cost-optimal since serial and parallel costs are O(n 3 )

CS 524 (Wi 2003/04)- Asim LUMS38 Back-Substitution Solution of Ux = y, where U is unit upper triangular matrix do k = n-1, 0 x(k) = y(k) do i = k-1, 0 y(i) = y(i) – x(k)*U(i,k) end do Computation: approx. n 2 /2 multiplications + subtractions Parallel algorithm is similar to that for the Gaussian elimination stage