Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.

Slides:

Advertisements

Similar presentations

Divide-and-Conquer CIS 606 Spring 2010.

Advertisements

Introduction to Algorithms Quicksort

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CSCI-455/552 Introduction to High Performance Computing Lecture 25.

Lecture 3: Parallel Algorithm Design

Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.

Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CSCI-455/552 Introduction to High Performance Computing Lecture 11.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

4.1 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M.

CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.

Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.

Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.

Numerical Algorithms Matrix multiplication

Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.

Numerical Algorithms • Matrix multiplication

Maths for Computer Graphics

CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.

1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

Design of parallel algorithms

CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Chapter 5, CLR Textbook Algorithms on Grids of Processors.

Design of parallel algorithms Matrix operations J. Porras.

Dense Matrix Algorithms CS 524 – High-Performance Computing.

CSCI-455/552 Introduction to High Performance Computing Lecture 22.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

Row 1 Row 2 Row 3 Row m Column 1Column 2Column 3 Column 4.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Chapter 6 Matrices and Determinants Copyright © 2014, 2010, 2007 Pearson Education, Inc Matrix Operations and Their Applications.

CSCI-455/552 Introduction to High Performance Computing Lecture 6.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

ADA: 4.5. Matrix Mult.1 Objective o an extra divide and conquer example, based on a question in class Algorithm Design and Analysis (ADA) ,

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

1 How to Multiply Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. integers, matrices, and polynomials.

PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic

Numerical Algorithms.Matrix Multiplication.Gaussian Elimination.Jacobi Iteration.Gauss-Seidel Relaxation.

Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.

Lecture 9 Architecture Independent (MPI) Algorithm Design

PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.

CSCI-455/552 Introduction to High Performance Computing Lecture 21.

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.

Numerical Algorithms Chapter 11.

12-1 Organizing Data Using Matrices

Matrix Operations.

Matrix Operations.

Divide and Conquer Mergesort Quicksort Binary Search Selection

Parallel Matrix Operations

Numerical Algorithms • Parallelizing matrix multiplication

Parallel Programming in C with MPI and OpenMP

Numerical Algorithms Quiz questions

Matrix Addition and Multiplication

To accompany the text “Introduction to Parallel Computing”,

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Topic: Divide and Conquer

Data Parallel Pattern 6c.1

Presentation transcript:

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing Matrix Multiplication (c) Prentice-Hall Inc., All rights reserved. Modified 5/11/05 by T. O’Neil for 3460:4/577, Fall 2005, Univ. of Akron. Previous revision: 7/28/03. Barry Wilkinson and Michael Allen

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Implementing Matrix Multiplication Sequential Code Assume throughout that the matrices are square (n  n matrices). The sequential code to compute A  B could simply be for (i=0; i<n; i++) for (j=0; j<n; j++) { c[i][j] = 0; for (k=0; k<n; k++) c[i][j] = c[i][j] + a[i][k]* b[k][j]; } This algorithm requires n 3 multiplications and n 3 additions, leading to a sequential time complexity of O(n 3 ). Very easy to parallelize.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Parallel Code With n processors (and n  n matrices), can obtain: Time complexity of O(n 2 ) with n processors Each instance of inner loop independent and can be done by a separate processor. Cost optimal since O(n 3 )=n  O(n 2 ). Time complexity of O(n) with n 2 processors One element of A and B assigned to each processor. Cost optimal since O(n 3 )=n 2  O(n). Time complexity of O(log n) with n 3 processors By parallelizing the inner loop. Not cost-optimal since O(n 3 )  n 3  O(log n). O(log n) lower bound for parallel matrix multiplication.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Partitioning into Submatrices Suppose matrix divided into s 2 submatrices. Each submatrix has n/s  n/s elements. Using notation A p,q as submatrix in submatrix row p and submatrix column q: for (p=0; p<s; p++) for (q=0; q<s; q++) { /* clear elements of submatrix */ C p,q = 0; /* submatrix mult and */ /* add to accum submatrix */ for (r=0; r<m; r++) C p,q = C p,q + A p,r * B r,q ; }

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Partitioning into Submatrices (cont) for (p=0; p<s; p++) for (q=0; q<s; q++) { C p,q = 0; for (r=0; r<m; r++) C p,q = C p,q + A p,r * B r,q ; } The line C p,q =C p,q +A p,r *B r,q ; means multiply submatrix A p,r by B r,q using matrix multiplication and add to submatrix C p,q using matrix addition. Known as block matrix multiplication.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Block Matrix Multiplication

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Submatrix multiplication Matrices

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Submatrix multiplication Multiplying A 0,0  B 0,0 to obtain C 0,0

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Direct Implementation One processor needed to compute each element of C – n 2 processors would be needed. One row of elements of A and one column of elements of B needed. Some of same elements sent to more than one processor. Can use submatrices.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Performance Improvement Using tree construction n numbers can be added in log n steps using n processors. Computational time complexity of O(log n) using n 3 processors.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Recursive Implementation Apply same algorithm on each submatrix recursively. Excellent algorithm for a shared memory system because of locality of operations.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Recursive Algorithm mat_mult(A pp,B pp,s) { /* if submatrix has one elt multiply */ if (s==1) C = A*B; /* continue to make recursive calls */ else { s = s/2; /* no elts in each row/column */ P0 = mat_mult(A pp,B pp,s); P1 = mat_mult(A pq,B qp,s); P2 = mat_mult(A pp,B pq,s); P3 = mat_mult(A pq,B qq,s);

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Recursive Algorithm (cont) P4 = mat_mult(A qp,B pp,s); P5 = mat_mult(A qq,B qp,s); P6 = mat_mult(A qp,B pq,s); P7 = mat_mult(A qq,B qq,s); /* add submatrix products to form */ /* submatrices of final matrix */ C pp = P0 + P1; C pq = P2 + P3; C qp = P4 + P5; C qq = P6 + P7; } return(C);/* return final matrix */ }

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Two-dimensional array (mesh) interconnection network Also three-dimensional – used in some large high performance systems

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Mesh Implementations Cannon’s algorithm Fox’s algorithm Not in textbook but similar complexity Systolic array All involve using processors arranged in a mesh and shifting elements of the arrays through the mesh. Accumulate the partial sums at each processor.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Mesh Implementations Cannon’s Algorithm Uses a mesh of processors with wraparound connections (a torus) to shift the A elements (or submatrices) left and the B elements (or submatrices) up. 1.Initially processor P i,j has elements a i,j and b i,j (0  i,k<n). 2.Elements are moved from their initial position to an “aligned” position. The complete i th row of A is shifted i places left and the complete j th column of B is shifted j places upward. This has the effect of placing the element a i,j+i and the element b i+j,j in processor P i,j. These elements are a pair of those required in the accumulation of c i,j.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Mesh Implementations Cannon’s Algorithm (cont) 3.Each processor P i,j multiplies its elements. 4.The i th row of A is shifted one place left, and the j th column of B is shifted one place upward. This has the effect of bringing together the adjacent elements of A and B, which will also be required in the accumulation. 5.Each processor P i,j multiplies the elements brought to it and adds the result to the accumulating sum. 6.Steps 4 and 5 are repeated until the final result is obtained (n-1 shifts with n rows and n columns of elements).

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Movement of A and B elements

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Step 2 – Alignment of elements of A and B

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Step 4 – One-place shift of elements of A and B

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Cannon’s Algorithm Example Matrix elements from A in red, from B in blue

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Cannon’s Algorithm Example (cont) After realignment; accumulator values on right

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Cannon’s Algorithm Example (cont) After first shift

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Cannon’s Algorithm Example (cont) After second (final) shift

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Systolic array

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Pseudocode Each processor P i,j repeatedly performs the same algorithm (after c is initialized to zero): recv(&a, P i,j-1 );/* receive from left */ recv(&b, P i-1,j );/* receive from above */ c = c + a * b;/* accumulate value */ send(&a, P i,j+1 );/* send right */ send(&b, P i+1,j );/* send downwards */ which accumulates the required summations.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Matrix-Vector Multiplication

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Other Matrix Multiplication Methods Strassen’s method (1969) Suppose we wish to multiply n  n matrices A and B to produce C. Divide each of these into four n/2  n/2 submatrices as follows:

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Strassen’s Method (cont) A traditional recursive approach executes in O(n 3 ) time so no faster. Strassen discovered a different recursive approach that executes in O(n log 7 ) time: 1.Divide the input matrices into quarters as above. 2.Use O(n 2 ) scalar additions and subtractions to recursively compute seven matrix products Q 1 through Q 7. 3.Compute the desired submatrices of C by adding and/or subtracting various combinations of the Q i matrices, using O(n 2 ) scalar operations.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Strassen’s Method (cont) To this end set Q 1 = (A 11 + A 22 )·(B 11 + B 22 ) Q 2 = (A 21 + A 22 )·B 11 Q 3 = A 11 ·(B 12 – B 22 ) Q 4 = A 22 ·(B 21 – B 11 ) Q 5 = (A 11 + A 12 )·B 22 Q 6 = (A 21 – A 11 )·(B 21 + B 22 ) Q 7 = (A 12 – A 22 )·(B 21 + B 22 ) As indicated above can be done recursively.

Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel Computers, Strassen’s Method (cont) Now C 11 = Q 1 + Q 4 – Q 5 + Q 7 C 21 = Q 2 + Q 4 C 12 = Q 3 + Q 5 C 22 = Q 1 + Q 3 – Q 2 + Q 6