Parallel Processing1 Parallel Processing (CS 676) Lecture: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived.

Slides:



Advertisements
Similar presentations
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Advertisements

Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
MPI Collective Communications
MPI Basics Introduction to Parallel Programming and Cluster Computing University of Washington/Idaho State University MPI Basics Charlie Peck Earlham College.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
1 Process Groups & Communicators  Communicator is a group of processes that can communicate with one another.  Can create sub-groups of processes, or.
Virtual Topologies Self Test with solution. Self Test 1.When using MPI_Cart_create, if the cartesian grid size is smaller than processes available in.
Reference: / MPI Program Structure.
MPI_Gatherv CISC372 Fall 2006 Andrew Toy Tom Lynch Bill Meehan.
HPDC Spring MPI 11 CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs. 2 – 3:20 p.m Message Passing Interface.
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
1/44 MPI Programming Hamid Reza Tajozzakerin Sharif University of technology.
1 Parallel Computing—Higher-level concepts of MPI.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Its.unc.edu 1 Derived Datatypes Research Computing UNC - Chapel Hill Instructor: Mark Reed
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Tuesday, October 31, 2006 “Data expands to fill the space available for storage.” -Parkinson’s Law.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
1 MPI Datatypes l The data in a message to sent or received is described by a triple (address, count, datatype), where l An MPI datatype is recursively.
Parallel Programming with MPI Matthew Pratola
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Director of Contra Costa College High Performance Computing Center
2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
1 Why Derived Data Types  Message data contains different data types  Can use several separate messages  performance may not be good  Message data.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
CS4402 – Parallel Computing
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
Parallel Programming with MPI By, Santosh K Jena..
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
MPI Workshop - III Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3.
An Introduction to MPI (message passing interface)
MPI Advanced edition Jakub Yaghob. Initializing MPI – threading int MPI Init(int *argc, char ***argv, int required, int *provided); Must be called as.
Grouping Data and Derived Types in MPI. Grouping Data Messages are expensive in terms of performance Grouping data can improve the performance of your.
MPI Groups, Communicators and Topologies. Groups and communicators In our case studies, we saw examples where collective communication needed to be performed.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
MPI Derived Data Types and Collective Communication
Lecture 7 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Introduction to MPI Programming Ganesh C.N.
Research Staff Passing Structures in MPI Week 3 of 3
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.
Parallel Programming in C with MPI and OpenMP
Parallel Programming with MPI and OpenMP
MPI Groups, Communicators and Topologies
CSCE569 Parallel Computing
September 4, 1997 Parallel Processing (CS 676) Lecture 8: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived.
September 4, 1997 Parallel Processing (CS 730) Lecture 7: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
CS 5334/4390 Spring 2017 Rogelio Long
Parallel Matrix Operations
CSCE569 Parallel Computing
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Message Passing Programming Based on MPI
September 4, 1997 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
5- Message-Passing Programming
September 4, 1997 Parallel Processing (CS 730) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Presentation transcript:

Parallel Processing1 Parallel Processing (CS 676) Lecture: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived from chapters 6,7 in Pacheco

Parallel Processing2 Introduction Objective: To introduce MPI commands for creating types and communicators. To discuss performance models and considerations in MPI and ways of reducing communication. Topics –MPI Datatypes and packing Revised version of Get_Data Matrix transposition –Creating communicators Topologies Grids –Matrix Multiplication Fox’s algorithm –Performance Model

Parallel Processing3 Derived Types Due to latency of communication it is usually a good idea to package up several elements into a single message MPI_Send and MPI_Recv, allow message to be given by a start address, basic type, and a count. –This allows multiple data elements to be sent in one message –Requires elements to be of the same type –Must be contiguous A generalized type –{(t 0,d 0 ),…,(t n-1,d n-1 )} –t i is an existing type –d i is a displacement

Parallel Processing4 Functions for Creating MPI Types int MPI_Type_struct(int count, int block_lengths[], MPI_Aint displacements[], MPI_Datatype typelist[], MPI_Datatype new_mpi_t) MPI_Address( void* location, MPI_Aint* address) int MPI_Type_commit(MPI_Datatype* new_mpi_t)

Parallel Processing5 Other Derived Datatype Constructors int MPI_Type_vector(int count, int block_length, int stride, MPI_Datatype element_type, MPI_Datatype new_mpi_t) int MPI_Type_contiguous(int count, MPI_Datatype old_type, MPI_Datatype new_mpi_t) int MPI_Type_indexed(int count, int block_lengths[], int displacements[], MPI_Datatype old_type, MPI_Datatype new_mpi_t)

Parallel Processing6 Transpose float A[10][10]; /* stored in row-major order. */ /* Send 3 rd row of A from process 0 to process 1. */ If (my_rank == 0) { MPI_Send(&(A[2][0]), 10, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); } else { MPI_Recv(&(A[2][0]), 10, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); } /* Doesn’t work for columns, since not contiguous. */

Parallel Processing7 Transpose float A[10][10]; /* stored in row-major order. */ /* Send 3 rd column of A from process 0 to process 1. */ MPI_Type_vector(10, 1, 10, MPI_FLOAT, &column_mpi_t); MPI_Type_commit(&column_mpi_t); If (my_rank == 0) { MPI_Send(&(A[0][2]), 1, column_mpi_t, 1, 0, MPI_COMM_WORLD); } else { MPI_Recv(&(A[0][2]), 1, column_mpi_t, 0, 0, MPI_COMM_WORLD, &status); }

Parallel Processing8 Upper Triangular Matrix float A[n][n]; /* Complete matrix */ float T[n][n]; /* Upper triangle. */ int displacements[n]; int block_lengths[n]; MPI_Datatype index_mpi_t; for(i = 0; i < n; i++) { block_lengths[i] = n-i; displacements[i] = (n+1)*i; } MPI_Type_indexed(n, block_lengths, displacments, MPI_FLOAT, &index_mpi_t); MPI_Type_commit(&index_mpi_t); if (my_rank == 0) { MPI_Send(A, 1, index_mpi_t, 1, 0, MPI_COMM_WORLD); else MPI_Recv(T, 1, index_mpi_t, 0, 0, MPI_COMM_WORLD, &status);

Parallel Processing9 Type Matching When can a receiving process match the data sent by a sending process? –MPI_Send(message, send_count, send_mpi_t, 1, 0, MPI_COMM_WORLD) –MPI_Recv(message, recv_count, recv_mpi_t, 0, 0,MPI_COMM_WORLD,&status) Given a derived type {(t 0,d 0 ),…,(t n-1,d n-1 )} –Displacements do not matter –Type signatures {t 0,…,t n-1 } and {u 0,…,u m-1 } must be compatible –n  m, t i = u i for i=0,…,n-1 –For collective communications sending and receiving types must be identical

Parallel Processing10 Type Matching Example For type column_mpi_t (column of 10  10 array of floats) –{(MPI_FLOAT,0), (MPI_FLOAT,10*sizeof(float)), – (MPI_FLOAT,20*sizeof(float)),…,(MPI_FLOAT,90*sizeof(float))} –Signature is {MPI_FLOAT,…,MPI_FLOAT}, MPI_FLOAT 10 times –Example: Can send column to row float A[10][10]; /* stored in row-major order. */ If (my_rank == 0) MPI_Send(&(A[0][0]), 1, column_mpi_t, 1, 0, MPI_COMM_WORLD); else if (my_rank == 0) MPI_Recv(&(A[0][0]),10, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status);

Parallel Processing11 Pack and Unpack MPI_Pack and MPI_Unpack allow a user to copy non-contiguous memory locations into a contiguous buffer and to copy a contiguous buffer into non-contiguous memory locations int MPI_Pack(void* pack_data, int in_count, MPI_Datatype datatype, void* buffer, int buffer_size, int* position, MPI_Comm comm) –On input copy data starting at location &buffer + position –On output position points to first location in the buffer after pack_data int MPI_Unpack(void* buffer, int size, int* position,void* unpack_data, int count,MPI_Datatype datatype, MPI_Comm comm) –Data starting at location &buffer + *position is copied into memory referenced by unpack_data –count data elements of type datatype are copied into unpack_data –position is updated to point to location in buffer after the data just copied Messages constructed with MPI_Pack should be communicated with datatype argument MPI_PACKED

Parallel Processing12 Deciding which Method to Use Creating a derived type has overhead associated with it. Depends on number of times type will be used. Can avoid system buffering with Pack/Unpack Can use variable length messages with Pack/Unpack

Parallel Processing13 Variable Length Messages float* entries; int* column_subscripts; int nonzeros; int position; int row_number; char buffer[HUGE]; if (my_rank == 0) { position = 0; MPI_Pack(&nonzeros,1,MPI_INT,buffer,HUGE,&position,MPI_COMM_WORLD); MPI_Pack(row_number,1,MPI_INT,buffer,HUGE,&position,MPI_COMM_WORLD); MPI_Pack(entries,nonzeros,MPI_FLOAT,buffer,HUGE, &position,MPI_COMM_WORLD); MPI_Pack(column_subscripts,nonzeros,MPI_INT,buffer,HUGE, &position,MPI_COMM_WORLD); MPI_Send(buffer,position,MPI_PACKED, 1,0,MPI_COMM_WORLD); }

Parallel Processing14 Variable Length Messages (cont) else { MPI_Recv(buffer,HUGE,MPI_PACKED, 0,0, MPI_COMM_WORLD, &status); position = 0; MPI_UnPack(buffer,HUGE,&position, &nonzeros, MPI_INT,MPI_COMM_WORLD); MPI_UnPack(buffer,HUGE,&position, &row_number, MPI_INT, MPI_COMM_WORLD); entries = (float *) malloc(nonzeros*sizeof(float)); column_subscripts = (int *) malloc(nonzeros*sizeof(int)); MPI_UnPack(buffer,HUGE,&position, entries, MPI_FLOAT,nonzeros, MPI_COMM_WORLD); MPI_UnPack(buffer,HUGE,&position, &column_subscripts, MPI_INT, nonzeros, MPI_COMM_WORLD); }

Parallel Processing15 Communicators A mechanism to treat a subset of processes as a universe for communication (both point-to-point and collective) Types –intra-communicator –inter-communicator Components –group (ordered collection of processes) –context (unique identifier) –optional additional information such as topology Create new communicators from existing communicators

Parallel Processing16 Working with Groups, Contexts, and Communicators MPI_Comm_group(MPI_Comm comm, MPI_Group* group) MPI_Group_incl(MPI_Group old_group, int new_group_size, int ranks_in_old_group[]; MPI_Group* new_group) MPI_Comm_create(MPI_Comm old_comm, MPI_Group group, MPI_Comm* new_com)

Parallel Processing17 Creating a Communicator /* Create communicator out of first row of q^2 processes organized in a q  q grid in row-major order. */ MPI_Group group_world; MPI_Group first_row_group; MPI_Comm first_row_comm; int* process_ranks; process_ranks = (int*) malloc(q*sizeof(int)); for (proc = 0; proc < q; proc++) process_ranks[proc] = proc; MPI_Comm_Group(MPI_COMM_WORLD,&group_world); MPI_Group_incl(group_world,q,process_ranks,&first_row_group); MPI_Comm_create(MPI_COMM_WORLD,first_row_group,&first_row_comm);

Parallel Processing18 Using a Communicator /* Broadcast first block to all processes in the same row. */ int my_rank_in_first_row; float* A00; if (my_rank < q) { MPI_Comm_rank(first_row_comm,&my_rank_in_first_row); A_00 = (float *) malloc(n_bar*n_bar*sizeof(float)); if (my_rank_in_first_row == 0) { /* initialize A_00 */} MPI_Bcast(A_00,n_bar*n_bar,MPI_FLOAT,0,first_row_comm); }

Parallel Processing19 MPI_Comm_split int MPI_Comm_split(MPI_Comm old_comm, int split_key, int rank_key, MPI_Comm new_comm) MPI_Comm my_row_comm; int my_row; /* my_rank is in MPI_COMM_WORLD, q*q = p */ my_row = my_rank/q; MPI_Comm_split(MPI_COMM_WORLD,my_row,my_rank,&my_row_comm); /* Creates q new communicators. Processes with the same value of split_key form a new group. The rank in the new communicator is determined by rank_key. Order is preserved. If the same rank_key is used, then the choice is arbitrary. */

Parallel Processing20 Topologies Communicators can have attributes. One such attribute is a topology. A topology is a mechanism for associating a different addressing scheme with processes belonging to a group. Provides a virtual interconnection organization of processes that may be convenient for a particular algorithm. Types –Cartesian (grid) –Graph

Parallel Processing21 Working with Cartesian Topologies MPI_Cart_create(MPI_Comm old_comm, int number_of_dims, int dim_sizes[], int wrap_around[], int reorder, MPI_Comm* cart_comm) MPI_Cart_rank(MPI_Comm comm, int rank, int number_of_dims, int* rank) MPI_Cart_coords(MPI_Comm comm, int rank, int number_of_dims, int coordinates[]) MPI_Cart_sub(MPI_Comm cart_comm, int free_coords[], MPI_Comm* new_comm)

Parallel Processing22 Creating a Cartesian Topology /* create communicator with 2D grid topology. */ MPI_Comm grid_comm; int dim_sizes[2]; int wrap_around[2]; int reorder = 1; dim_sizes[0] = dim_sizes[1] = q; wrap_around[0] = wrap_around[1] = 1; MPI_Cart_create(MPI_COMM_WORLD, 2, dim_sizes,wrap_around,reorder, &grid_comm);

Parallel Processing23 Creating a Sub-Cartesian Topology int free_coords[2]; MPI_Comm row_comm; /* create communicator for each row of grid_comm */ free_coords[0] = 0; free_coords[1] = 1; MPI_Cart_sub(grid_comm, free_coords, &row_comm) /* create communicator for each column of grid_comm */ free_coords[0] = 1; free_coords[1] = 0; MPI_Cart_sub(grid_comm, free_coords, &col_comm)

Parallel Processing24 Cartesian Addressing int coordinates[2]; int my_grid_rank; MPI_Comm_rank(grid_comm, &my_grid_rank); MPI_Cart_coords(grid_comm, my_grid_rank,2, coordinates); /* inverse operation */ MPI_Cart_rank(grid_comm, coordinates, &my_grid_rank)

Parallel Processing25 Matrix Multiplication Let A, B be n  n matrices, and C = A*B void Serial_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { int i,j,k; for (i=0; i< n; i++) for (j=0; j< n;j++) { C[i][j] = 0.0; for (k=0; k < n;k++) C[i][j] = C[i][j] + A[i][k]*B[k][j]; }

Parallel Processing26 Parallel Matrix Multiplication /* distribute matrices by rows. */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { for each column of B { Allgather(column); Compute dot product of my row of A with column; } /* can distribute matrices by blocks of rows. Also B could be distributed by * columns */

Parallel Processing27 Cyclic Matrix Multiplication /* Arrange processors in a circle, storing rows of A and B in each process. C i.* = A i,0 * B 0,* + … + A i,n-1 * B n-1,* */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { i = rank; Blocal = ith row of B; Alocal = ith row of A; Clocal = 0; /* ith row of C */ dest = (i+1) % n; src = (i-1) % n; for (k=0;k<n;k++) { C i,* = C i,* + A i,i+kmod n * B i+k mod n,* send_recv(Blocal,dest,src); }

Parallel Processing28 Fox’s Matrix Multiplication Let A, B be p  p matrices, and C = A*B Organize processors into a q  q grid, q = sqrt(p) Store (i,j) block on processor (i,j) Broadcast elements of A as k = 0,…,q-1 Cyclically rotate elements of B.

Parallel Processing29 Example A00 B00 A00 B01 A00 B02 A11 B10 A11 B11 A11 B12 A22 B20 A22 B21 A22 B22 A01 B10 A01 B11 A01 B12 A12 B20 A12 B21 A12 B22 A20 B00 A20 B01 A20 B02 A02 B20 A02 B21 A02 B22 A10 B00 A10 B01 A10 B02 A21 B10 A21 B11 A21 B12

Parallel Processing30 Fox’s Matrix Multiplication /* Uses a block matrix allocation. Group processors in a q × q grid, where q = sqrt(p). Processor (i,j) stores A ij and initially B ij */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { i = my process row; j = my process column; dest = ((i-1) % q,j); src = ((i+1) % q,j); for (stage=0;stage < q; stage++) { k_bar = (i + stage) mod q; Broadcast A[i,k_bar] across process row i; C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j]; Send B[k_bar,j] to dest; Receive B[(k_bar+1) mod q,j] from source; }

Parallel Processing31 Variant of Fox’s Matrix Multiplication Let A, B be q  q matrices, and C = A*B Organize processors into a sqrt(p)  sqrt(p) grid Store (i,j+i mod q) block of A and (i+j mod q,j) block of B on processor (i,j) Cyclically rotate rows of A to the left. Cyclically rotate columns of B upward.

Parallel Processing32 Example A00 B00 A01 B11 A02 B22 A11 B10 A12 B21 A10 B02 A22 B20 A20 B01 A21 B12 A01 B10 A02 B21 A00 B02 A12 B20 A10 B01 A11 B12 A20 B00 A21 B11 A22 B22 A02 B20 A00 B01 A01 B12 A10 B00 A11 B11 A12 B22 A21 B10 A22 B21 A20 B02

Parallel Processing33 Variant of Fox’s Matrix Multiplication /* Uses a block matrix allocation. Group processors in a q × q grid, where q = sqrt(p). Processor (i,j) stores A i,i+j and initially B i+i,j */ void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n) { i = my process row; j = my process column; coldest = ((i-1) % q,j); colsrc = ((i+1) % q,j); rowdest = (i,(j-1) % q); rowsrc = (i,(j+1) % q); for (stage=0;stage < q; stage++) { k_bar = (i +j + stage) mod q; C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j]; Send_Recv A[i,k_bar] to/from rowdest,rowsrc; Send_Recv B[k_bar,j] to/from coldest, colsrc; }

Parallel Processing34 Performance Model Communication cost: C(n) =  +  n –  = latency –1/  = bandwidth Empirically determine  and  by measuring time to send/recv messages with different lengths –Least squares fit

Parallel Processing35 Analysis of Matrix Multiplication Let A, B be n  n matrices, and C = A*B Sequential cost –T(n)= an 3 +bn 2 +cn+d =  (n 3 ) –Least squares fit –T(n)  an 3

Parallel Processing36 Analysis of Parallel Matrix Multiplication (Allgather) Let A, B be n  n matrices, and C = A*B Let p = number of processors Store i th block of n/p rows of A, B, and C on process i Parallel computing time:  (n 3 /p + plog(p) + n 2 log(p)) Computation time: p(n/p  n  n/p) = n 3 /p Communication time: p(log(p)(  +  n 2 /p) [Allgather]

Parallel Processing37 Analysis of Parallel Matrix Multiplication (Cyclic) Let A, B be n  n matrices, and C = A*B Let p = number of processors Store i th block of n/p rows of A, B, and C on process i Parallel computing time:  (n 3 /p + p + n 2 ) Computation time: p(n/p  n/p  n ) = n 3 /p Communication time: p(  +  n 2 /p)

Parallel Processing38 Analysis of Parallel Matrix Multiplication (Fox) Let A, B be n  n matrices, and C = A*B Let p = q 2 number of processors organized in a q  q grid Store (i,j) th n/q  n/q block of A, B, and C on process (i,j) Parallel computing time: –  (n 3 /p + qlog(q) + log(q)n 2 /q) Computation time: q(n/q  n/q  n/q ) = n 3 /q 2 = n 3 /p Communication time: qlog(q)(  +  (n/q) 2 )

Parallel Processing39 Analysis of Parallel Matrix Multiplication (Fox Variant) Let A, B be n  n matrices, and C = A*B Let p = q 2 number of processors organized in a q  q grid Store (i,j) th n/q  n/q block of A, B, and C on process (i,j) Parallel computing time: –  (n 3 /p + q + n 2 /q) Computation time: q(n/q  n/q  n/q ) = n 3 /q 2 = n 3 /p Communication time: q(  +  (n/q) 2 )