Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.

Slides:



Advertisements
Similar presentations
PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Advertisements

PERMUTATION CIRCUITS Presented by Wooyoung Kim, 1/28/2009 CSc 8530 Parallel Algorithms, Spring 2009 Dr. Sushil K. Prasad.
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Routing in a Parallel Computer. A network of processors is represented by graph G=(V,E), where |V| = N. Each processor has unique ID between 1 and N.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
PRAM (Parallel Random Access Machine)
Efficient Parallel Algorithms COMP308
Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Simulating a CRCW algorithm with an EREW algorithm Efficient Parallel Algorithms COMP308.
Reference: Message Passing Fundamentals.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Models of Parallel Computation
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
Dynamic Hypercube Topology Stefan Schmid URAW 2005 Upper Rhine Algorithms Workshop University of Tübingen, Germany.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Design of parallel algorithms Matrix operations J. Porras.
1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Parallel Programming in C with MPI and OpenMP
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
Matrix Multiplication Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.
RAM, PRAM, and LogP models
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
1 Job Scheduling for Grid Computing on Metacomputers Keqin Li Proceedings of the 19th IEEE International Parallel and Distributed Procession Symposium.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Data Structures and Algorithms in Parallel Computing Lecture 4.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Static Process Scheduling
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
5 PRAM and Basic Algorithms
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Lecture 9 Architecture Independent (MPI) Algorithm Design
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
PRAM and Parallel Computing
Distributed and Parallel Processing
Parallel Programming By J. H. Wang May 2, 2017.
Lecture 2: Parallel computational models
Parallel Algorithm Design
Parallel computation models
PRAM Algorithms.
Packet Classification with Evolvable Hardware Hash Functions
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie

Outline The Model Computation on BSP Model Automatic Memory Management Matrix Multiplication Computational Analysis BSP vs. PRAM BSPRAM

The Model This model was proposed by Leslie Valiant in Combination of 3 attributes: Components: to perform processing and/or memory functions Router: to deliver messages among components Periodicity parameter L: to facilitate synchronization at regular intervals of L time units.

Computation on BSP Model A computation consists of several supersteps A superstep consists of: A computation where each processor uses only locally held values A global message transmission from each processor to any subset of others A barrier synchronization At the end of a superstep, the transmitted messages become available as local data for the next superstep

Continued.. The components can be seen as processors The router can be seen as inter- connection network The periodicity parameter can be seen as barrier. Virtual Processors Local Computation Global Communication Barrier Synchronization

Components (Processors) No need for programmers to manage memory, assign communication and perform low-level synchronization. This is achieved by programs written with sufficient parallel slackness. When programs written for v virtual processors are run on p real processors with v >> p (e.g. v = p log p) then there is parallel slackness. Parallel slackness makes work distribution more balanced (than in cases such as v=p or v < p).

Barrier Synchronization After each period of L time units (periodicity parameter), a global check is made to determine whether each processor has completed its task. If all processors have completed the superstep the machine proceeds to next superstep Otherwise, the next period of L units is allocated to the unfinished superstep. Synchronization can be switched off for a subset of processors. However, they can still send messages over the network.

Continued.. What is the optimal value for L? The lower bound is set by the hardware. The upper bound is set by the software, which in turn defines the granularity of the system. When each processor has an independent task of approximately L steps, only then optimal processor utilization can be achieved.

The Network (Router) The network delivers messages point to point. It assumes no combining, duplicating or broadcasting facilities. It basically realizes arbitrary h-relations That is each processor sends at most h messages and receives at most h messages.

Continued.. If ĝ is network throughput when it is in continuous operation and s is the latency or startup cost then h- relation is ĝh + s. If ĝh > s, then we can let g = 2ĝ and the cost of a h- relation becomes gh (an overestimate of at most 2). h-relations therefore can be realized in gh time for h larger than some h 0 If L > gh 0 then every h-relation for h < h 0 will cost as if it were a h 0 relation.

Continued.. Value of g is dictated by the network design. By increasing the bandwidth of network connections and providing better switching the value of g is kept low. As p increases, the required communication can increase with p 2 and to maintain a fixed or low g, network costs increase similarly.

Automatic Memory Management Random distribution, equally frequent access If p accesses are made to p components, one component will get about (log p/log log p) accesses with high probability. Which will need Ω (log p/ log log p) time units. If (p log p) accesses are made the probability is high that each processor will get no more than (3log p) and the time requirement will be O(log p) In general, if pf(p) accesses are made, where f(p) grows faster than (log p) the worst-case access will exceed the average rate by even smaller factors. To make the mapping from symbolic address to physical address efficiently computable hashing can be used.

Matrix Multiplication n = 16 p = 4 Each processor has to perform 2n 3 /p additions and multiplications, and receives 2n 2 /√p messages. Every processor has 2n 2 /p elements which are to be sent at most √p times. This may be achieved by data replication at source when g = O(n/√p) and L = O(n 3 /p) provided h is suitably small. n X n n/√p X n/√p

Matrix Multiplication on Hypercube Let us assume that in g units of time a packet can traverse one edge of the hypercube. That is, a packet takes O (g log p) time to go to an arbitrary destination. In the previous example, the computational (local) bounds stay intact when implemented on a hypercube. The communication now becomes O(n logp/ √p). Therefore, L = O(n 3 /p) suffice, if network can realize the h-relation (g log p) given above.

The execution time for one superstep S i of a BSP program consisting of S supersteps is given by: w i +gh i + L Where, w i is the largest amount of work done and h i is the largest number of messages sent or received during superstep S i. The execution time of entire program is W + gH + LS Where, W = Σ(w i ) and H = Σ(h i ), for i = 0 to s-1 Computational Analysis

BSP vs. PRAM BSP can be regarded as a generalization of the PRAM model. If the BSP architecture has a small value of g (g=1), then it can be regarded as PRAM. Hashing can be used to automatically achieve efficient memory management The value of L determines the degree of parallel slackness required to achieve optimal efficiency. If L = g = 1 … corresponds to idealized PRAM where no slackness is required.

BSPRAM Variant of BSP, intended to support shared-memory style programming. There are two levels of memory Local memory of individual processors Shared global memory The network is implemented as a random-access shared memory unit. As in BSP the computation proceeds in supersteps. A superstep consists of an input phase, a local computation phase, and an output phase.

Continued.. In the input phase a processor can read data from the main memory; in the output phase it can write data to the main memory. The processors are synchronized between supersteps. The computation within a superstep are asynchronous. There are two types of BSPRAM, EREW BSPRAM, in which every cell of memory can be read from and written to only once in every superstep, and CRCW BSPRAM, which has no such restriction on memory access.

Computational Analysis We will assume for the sake of convenience that if a value x is being written to a memory cell containing value y, the result may be determined by any function f(x,y) computable in O(1) time. Similarly if values x 1, x 2, x 3,….., x m are being written to a main memory cell containing the value y, the result may be determined by any prescribed function f(x 1 ⊕ …. ⊕ X m,y) where ⊕  is a commutative and associative operator and both f and ⊕  are computable in time O(1).

Continued.. The computation cost is similar to BSP and can be given by w + hg + l, where w is the total number of local operations performed by each processor, and h is defined as a sum of number of data units read from and written to the main memory. g and l are fixed parameter of computer. We write BSPRAM(p,g,l) to denote a BSPRAM with the given values for p,g, and l An Asynchronous EREW PRAM charges a unit cost for global read/write operation, d units for communication startup and B units for synchronization, which is equivalent to EREW BSPRAM (p,1,d+B)

Simulation For efficient BSPRAM simulation on BSP some extra “parallelism” is necessary. A BSPRAM algorithm has a slackness σ, if the communication cost of each of its supersteps is at least σ. An optimal randomized simulation on BSP (p,g,l) can be achieved for Theorem (i) Any EREW BSPRAM (p,g,l) algorithm with slackness σ ≥ log p; Theorem (ii) Any CRCW BSPRAM (p,g,l) algorithm with slackness σ ≥ p ε for some ε > 0.

Continued.. A BSPRAM algorithm is said to be communication- oblivious, if the sequence of communication and synchronization operations executed by any processor are the same for any size input but no such restrictions are made for local computation. A BSPRAM algorithm is said to have granularity ˠ if all memory cells used by the algorithm can be partitioned into granules of size at least ˠ. σ ≥ ˠ

Matrix multiplication on BSPRAM We need to multiply two matrices X and Y and output the result matrix Z. Z ik = X ij + Y jk, for j = 1,…,n. here 1 ≤ i,k ≤ n Initialization: Z ik <- 0 for i,k = 1,…,n Computation: V ijk <- X ij Y jk, Z ik <- Z ik + V ijk for all i,j,k, 1 ≤ i,j,k ≤ n Computation of different triples i,j,k is independent and therefore can be performed in parallel.

Continued.. The array V =(V ijk ) is represented as a cube of volume n 3 in integer three- dimensional space The matrices are represented as projections of the cube. Computation of point V ijk requires the input of its X and Y projections x ij and y jk, and the output of its Z projection Z ik.

Continued.. In order to provide a communication efficient BSP algorithm, the array V must be divided into p regular cubic blocks of size n/p 1/3 There will be n/p 2/3 such partitions for each matrix. Each processor can compute a block product sequentially. Cost analysis: W = O(n 3 /p), H = O(n 2 / p 2/3 ), S = O(1) The algorithm is oblivious with σ = ˠ = n 2 /p 2/3

References Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, 1990 Alexandre Tiskin, The bulk-synchronous parallel random access machine, Theoretical computer science, Elsevier