1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.

Slides:

Advertisements

Similar presentations

1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.

Advertisements

Basic Communication Operations

Lecture 9: Group Communication Operations

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

Lecture 3: Parallel Algorithm Design

HST 952 Computing for Biomedical Scientists Lecture 10.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Peer-to-Peer Distributed Search. Peer-to-Peer Networks A pure peer-to-peer network is a collection of nodes or peers that: 1.Are autonomous: participants.

Parallel System Performance CS 524 – High-Performance Computing.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Simulating a CRCW algorithm with an EREW algorithm Efficient Parallel Algorithms COMP308.

Slide 1 Parallel Computation Models Lecture 3 Lecture 4.

Models of Parallel Computation

Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

Accelerated Cascading Advanced Algorithms & Data Structures Lecture Theme 16 Prof. Dr. Th. Ottmann Summer Semester 2006.

1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.

Communication operations Efficient Parallel Algorithms COMP308.

CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Parallel System Performance CS 524 – High-Performance Computing.

Design of parallel algorithms Matrix operations J. Porras.

1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.

Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.

1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.

1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.

Bulk Synchronous Parallel Processing Model Jamie Perkins.

1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.

A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.

Chapter 11 Broadcasting with Selective Reduction -BSR- Serpil Tokdemir GSU, Department of Computer Science.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.

Analysis of Algorithms

Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition) Pearson Education.

RAM, PRAM, and LogP models

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.

1 Multicasting in a Class of Multicast-Capable WDM Networks From: Y. Wang and Y. Yang, Journal of Lightwave Technology, vol. 20, No. 3, Mar From:

Parallel Algorithms. Parallel Models u Hypercube u Butterfly u Fully Connected u Other Networks u Shared Memory v.s. Distributed Memory u SIMD v.s. MIMD.

Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.

1 Leader Election in Rings. 2 A Ring Network Sense of direction left right.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.

06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.

MPI implementation – collective communication MPI_Bcast implementation.

Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Lecture 9 Architecture Independent (MPI) Algorithm Design

Basic Communication Operations Carl Tropper Department of Computer Science.

Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.

COMPUTATIONAL MODELS.

PRAM Algorithms.

Collective Communication Operations

Communication operations

COMP60621 Fundamentals of Parallel and Distributed Systems

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Parallel Sorting Algorithms

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007

2 Traditional PRAM Algorithm vs. Architecture Independent Parallel Algorithm Design Under the PRAM model, synchronization is ignored and thus is seen as for free, as PRAM processors work synchronously. It also ignores communication, as in the PRAM the cost of accessing the shared memory is as small as the cost of accessing local registers of the PRAM. But actually, the exchange of data can significantly impact the efficiency of parallel programs by introducing interaction delays during their execution. It takes roughly t s +mt w time for a simple exchange of an m-word message between two processes running on different nodes of an interconnection network with cut-through routing. ts: latency or the startup time for the data transfer tw: per-word transfer time, which is inversely proportional to the available bandwidth between the nodes.

3 Basic Communication Operations – One- to-all broadcast and all-to-one reduction Assume that p processes participate in the operation and the data to be broadcast or reduced contains m words. Since one-to-all broadcast or all-to-one reduction procedure involves log p point-to-point simple message transfers, each at a time cost of t s +mt w. Therefore, the total time taken by the procedure is T=(t s +mt w ) log p This is true for all interconnection network.

4 All-to-all Broadcast and Reduction Linear Array and Ring: P different messages circulate in the p-node ensemble. If communication is performed circularly in a single direction, then each node received all (p- 1) pieces of information from all other nodes in (p-1) steps. So the total time is: T=(t s +mt w )(p-1) 2-D Mesh: Based on linear array algorithm, treating each rows and columns of the mesh as linear arrays. Two phases: Phase one: each row of the mesh performs an all-to-all broadcast using the procedure for the linear array. In this phase, all nodes collect  p corresponding to the  p nodes of their respective rows. Each node consolidates this information into a single message of size m  p. The time for this phase is: T1= =(t s +mt w )(  p-1) Phase two: columnwise all-to-all broadcase of the consolidated messages. By the end of this phase, each node obtains all p pieces of m-word data originally resided on different nodes. The time for this phase is T2= =(t s +m  pt w )(  p-1) The time for entire all-to-all broadcast on a p-node two-dimensional square mesh is the sum of the times spent in the individual phases: T=2t s (  p-1)+mt w (p-1) Hypercube:

5 Traditional PRAM Algorithm vs. Architecture Independent Parallel Algorithm Design As an example of how traditional PRAM algorithm design differs from architecture independent parallel algorithm design, example algorithm for broadcasting in a parallel machine is introduced. Problem: In a parallel machine with p processors numbered 0,..., p − 1, one of them, say processor 0, holds a one-word message The problem of broadcasting involves the dissemination of this message to the local memory of the remaining p − 1 processors. The performance of a well-known exclusive PRAM algorithm for broadcasting is analyzed below in two ways under the assumption that no concurrent operations are allowed. One follows the traditional (PRAM) analysis that minimizes parallel running time. The other takes into consideration the issues of communication and synchronization. This leads to a modification of the PRAM-based algorithm to derive an architecture independent algorithm for broadcasting whose performance is consistent with observations of broadcasting operations on real parallel machines.

6 Broadcasting: PRAM Algorithm 1 Algorithm. Without loss of generality let us assume that p is a power of two. The message is broadcast in lg p rounds of communication by binary replication. In round i = 1,..., lg p, each processor j with index j < 2 i−1 sends the message it currently holds to processor j + 2 i−1 (on a shared memory system, this may mean copying information into a cell read by this processor). The number of processors with the message at the end of round i is thus 2 i. Analysis of Algorithm. Under the PRAM model the algorithm requires lg p communication rounds and so many parallel steps to complete. This cost, however, ignores synchronization which is for free, as PRAM processors work synchronously. It also ignores communication, as in the PRAM the cost of accessing the shared memory is as small as the cost of accessing local registers of the PRAM.

7 Broadcasting: PRAM Algorithm 1 Under the MPI cost model each communication round is assigned a cost of max {t s, t w · 1} as each processor in each round sends or receives at most one message containing the one-word message. The BSP cost of the algorithm is lg p · max {t w, t w · 1}, as there are lgp rounds of communication. As the communicated information by any processors is small in size, it is likely that latency issues prevail in the transmission time (ie bandwidth based cost t w · 1 is insignificant compared to the latency/synchronization reflecting term t s ). In high latency machines the dominant term would be t s lg p rather than t w lg p. Even though each communication round would last for at least t s time units, only a small fraction t w of it is used for actual communication. The remainder is wasted. It makes then sense to increase communication round utilization so that each processor sends the one-word message to as many processors as it can accommodate within a round. The total time is: lg p *(t s +t w )

8 Broadcasting: PRAM Algorithm 2 Input: p processors numbered 0...p − 1. Processor 0 holds a message of length equal to one word. Output: The problem of broadcasting involves the dissemination of this message to the remaining p − 1 processors. Algorithm 2. In one superstep, processor 0 sends the message to be broadcast to processors 1,..., p − 1 in turn (a “sequential”-looking algorithm). Analysis of Algorithm 2. The communication time of Algorithm 2 is 1 · max{t s, (p − 1) · t w } (in a single superstep, the message is replicated p − 1 times by processor 0). The total time is t s +(p-1)t w

9 Broadcasting: PRAM Algorithm 3 Algorithm 3 Both Algorithm 1 and Algorithm 2 can be viewed as extreme cases of an Algorithm 3. The main observation is that up to L/g words can be sent in a superstep at a cost of t s. Then, It makes sense for each processor to send L/g messages to other processors. Let k − 1 be the number of messages a processor sends to other processors in a broadcasting step. The number of processors with the message at the end of a broadcasting superstep would be k times larger than that in the start. We call k the degree of replication of the broadcast operation. Architecture independent Algorithm 3 In each round, every processor sends the message to k−1 other processors. In round i = 0, 1,..., each processor j with index j < ki sends the message to k − 1 distinct processors numbered j + k iּ l, where l = 1,..., k−1. At the end of round i (the (i+1)-st overall round), the message is broadcast to k i ·(k−1)+k i = k i+1 processors. The number of rounds required is the minimum integer r such that k r ≥ p, The number of rounds necessary for full dissemination is thus decreased to lg k p, and the total cost becomes lg k p max {t s, (k − 1)t w }. At the end of each superstep the number of processors possessing the message is k times more than that of the previous superstep. During each superstep each processor sends the message to exactly k−1 other processors. Algorithm 3 consists of a number of rounds between 1 (and it becomes Algorithm 2) and lg p (and it becomes Algorithm 1). The total time is: lg k p (t s +(k-1)t w )

10 Broadcasting: PRAM Algorithm 3 Broadcast (0, p, k) 1. my_pid = pid(); mask_pid = 1; 2. while (mask_pid < p) { 1. if (my_pid < mask_pid) for (i = 1, j = mask_pid;i < k; i++, j+ = mask_pid) { target_pid = my_pid + j; if (target_pid < p) mpi_put(target_pid,&M,&M, 0, sizeof(M)); (or mpi_send…) 1. } 2. else if ((my_pid >= mask_pid) and (my_pid < 2* mask_pid)) 1. mpi_get() or mpi_Recv… mask_pid = mask_pid ∗ k; }

11 Broadcasting n > p words: Algorithm 4 Now suppose that the message to be broadcast consists of not a single word but is of size n > p. Algorithm 4 may be a better choice than the previous algorithms as one of the processors sends or receives substantially more than n words of information. (nt w >>t s ) There is a broadcasting algorithm, call it Algorithm 4, that requires only two communication rounds and is optimal (for the communication model abstracted by t s and t w ) in terms of the amount of information (up to a constant) each processor sends or receives. Algorithm 4. Two-phase broadcasting The idea is to split the message into p pieces, have processor 0 send piece i to processor i in the first round and in the second round processor i replicates the i-th piece p − 1 times by sending each copy to each of the remaining p − 1 processors (see attached figure). The total time is: p times one-to-one + one all-to-all broadcast (t s +n/p*t w )(p-1)+(t s +n/p*t w )(p-1)=2(t s +n/p*t w )(p-1)

12