PRAM Model for Parallel Computation

Slides:



Advertisements
Similar presentations
Models of Computation Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms Week 1, Lecture 2.
Advertisements

1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.
Parallel Algorithms.
PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Advanced Algorithms Piyush Kumar (Lecture 12: Parallel Algorithms) Welcome to COT5405 Courtesy Baker 05.
Theoretical Parallel Computing: PRAMS, Sorting networks, etc.
PRAM (Parallel Random Access Machine)
Efficient Parallel Algorithms COMP308
TECH Computer Science Parallel Algorithms  several operations can be executed at the same time  many problems are most naturally modeled with parallelism.
Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
What is an Algorithm? (And how do we analyze one?)
Simulating a CRCW algorithm with an EREW algorithm Efficient Parallel Algorithms COMP308.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.
Fall 2008Paradigms for Parallel Algorithms1 Paradigms for Parallel Algorithms.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
Complexity Classes Kang Yu 1. NP NP : nondeterministic polynomial time NP-complete : 1.In NP (can be verified in polynomial time) 2.Every problem in NP.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3.
Analysis of Algorithms
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.
RAM, PRAM, and LogP models
1 PRAM Algorithms Sums Prefix Sums by Doubling List Ranking.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
CSC 211 Data Structures Lecture 13
Parallel Algorithms. Parallel Models u Hypercube u Butterfly u Fully Connected u Other Networks u Shared Memory v.s. Distributed Memory u SIMD v.s. MIMD.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Searching Topics Sequential Search Binary Search.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-1.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
PRAM and Parallel Computing
Applied Discrete Mathematics Week 2: Functions and Sequences
Introduction to Analysis of Algorithms
Lecture 3: Parallel Algorithm Design
PRAM Model for Parallel Computation
Analysis of Algorithms
Parallel Algorithms (chap. 30, 1st edition)
Lecture 2: Parallel computational models
Analysis of Algorithms
Algorithm Analysis CSE 2011 Winter September 2018.
Algorithms Furqan Majeed.
PRAM Algorithms.
Computation.
CHAPTER 30 (in old edition) Parallel Algorithms
Algorithm Analysis (not included in any exams!)
Data Structures and Algorithms in Parallel Computing
Alternating Bit Protocol
Parallel and Distributed Algorithms
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective of This Course
Lecture 5 PRAM Algorithms (cont.)
CSE838 Lecture notes copy right: Moon Jung Chung
Unit –VIII PRAM Algorithms.
Parallel Algorithms A Simple Model for Parallel Processing
Analysis of Algorithms
Module 6: Introduction to Parallel Computing
Analysis of Algorithms
Presentation transcript:

PRAM Model for Parallel Computation Chapter 1A: Part 2 of Chapter 1

The RAM Model of Computation Revisited The Random Access Model (or RAM) model for sequential computation was discussed earlier. Assume that the memory has M memory locations, where M is a large (finite) number Accessing memory can be done in unit time. Instructions are executed one after another, with no concurrent operations. The input size depends on the problem being studied and is the number of items in the input The running time of an algorithm is the number of primitive operations or steps performed.

PRAM (Parallel Random Access Machine) PRAM is a natural generalization of the RAM sequential model. Each of the p processors P0, P1, … , Pp-1 are identical to a RAM processor and are often referred to as processing elements (PEs) or simply as processors. All processors can read or write to a shared global memory in parallel (i.e., at the same time). The processors can also perform various arithmetic and logical operations in parallel Running time can be measured in terms of the number of parallel memory accesses an algorithm performs.

PRAM Properties An unbounded number of processors all can access Numerical Algorithms 11/7/2018 12:40 AM PRAM Properties An unbounded number of processors all can access All processors can access an unbounded shared memory All processor’s execution steps are synchronized However, processors can run different programs. Each processor has an unique id, called the pid Processors can be instructed to do different things, based on their pid (if pid < 200, do this, else do that)

The PRAM Model Parallel Random Access Machine Numerical Algorithms 11/7/2018 12:40 AM The PRAM Model Parallel Random Access Machine Theoretical model for parallel machines p processors with uniform access to a large memory bank UMA (uniform memory access) – Equal memory access time for any processor to any address

The PRAM Models PRAM models vary according Numerical Algorithms 11/7/2018 12:40 AM The PRAM Models PRAM models vary according How they handle write conflicts The models differ in how fast they can solve various problems. Exclusive Read, Exclusive Write (EREW) Only one processor is allow to read or write to the same memory cell during any one step Concurrent Read Exclusive Write (CREW) Concurrent Read Concurrent Write (CRCW) An algorithm that works correctly for EREW will also work correctly for CREW and CRCW, but not vice versa

Summary of Memory Protocols Numerical Algorithms 11/7/2018 12:40 AM Summary of Memory Protocols Exclusive-Read Exclusive-Write Exclusive-Read Concurrent-Write Concurrent-Read Exclusive-Write Concurrent-Read Concurrent-Write If concurrent write is allowed we must decide which “written value” to accept

Assumptions There is no upper bound on the number of processors in the PRAM model. Any memory location is uniformly accessible from any processor. There is no limit on the amount of shared memory in the system. Resource contention is absent. The algorithms designed for the PRAM model can be implemented on real parallel machines, but will incur communication cost. Since communication and resource costs varies on real machines, PRAM algorithms can be used to establish a lower bound on running time for problems.

More Details on the PRAM Model Numerical Algorithms 11/7/2018 12:40 AM More Details on the PRAM Model Both the memory size and the number of processors are unbounded No direct communication between processors they communicate via the memory Every processor accesses any memory location in 1 cycle Typically all processors execute the same algorithm in a synchronous fashion although each processor can run a different program. READ phase COMPUTE phase WRITE phase Some subset of the processors can stay idle (e.g., even numbered processors may not work, while odd processors do, and conversely) Shared Memory P3 PN P1 P2 .

PRAM CW? What ends up being stored when multiple writes occur? Numerical Algorithms 11/7/2018 12:40 AM PRAM CW? What ends up being stored when multiple writes occur? priority CW: processors are assigned priorities and the top priority processor is the one that does writing for each group write Fail common CW: if values are not equal, no write occurs Collision common CW: if values not equal, write a “failure value” Fail-safe common CW: if values not equal, then algorithm aborts Random CW: non-deterministic choice of which value is written Combining CW: write the sum, average, max, min, etc. of the values etc. For CRCW PRAM, one of the above type CWs is assumed. The CWs can be ordered so that later type CWs can execute the earlier types of CWs. As shown in the textbook, Parallel Computation: Models & Algorithms” by Selim Akl, a PRAM machine that can perform all of PRAM operations including all CWs can be built with circuits and runs in O(log n) time. In fact, most PRAM algorithms end up not needing CW.

PRAM Example 1 Problem: We have a linked list of length n Numerical Algorithms 11/7/2018 12:40 AM PRAM Example 1 Problem: We have a linked list of length n For each element i, compute its distance to the end of the list: d[i] = 0 if next[i] = NIL d[i] = d[next[i]] + 1 otherwise Sequential algorithm is O(n) We can define a PRAM algorithm running in O(log n) time Associate one processor to each element of the list At each iteration split the list in two with odd-placed and even-placed elements in different lists List size is divided by 2 at each step, hence O(log n)

PRAM Example 1 Principle: Look at the next element Numerical Algorithms 11/7/2018 12:40 AM PRAM Example 1 Principle: Look at the next element Add its d[i] value to yours Point to the next element’s next element 1 1 1 1 1 The active processors in each list is reduced by 2 at each step, hence the O(log n) complexity 2 2 2 2 1 4 4 3 2 1 5 4 3 2 1

PRAM Example 1 Algorithm forall i Numerical Algorithms 11/7/2018 12:40 AM PRAM Example 1 Algorithm forall i if next[i] == NIL then d[i]  0 else d[i]  1 while there is an i such that next[i] ≠ NIL if next[i] ≠ NIL then d[i]  d[i] + d[next[i]] next[i]  next[next[i]] What about the correctness of this algorithm?

Numerical Algorithms 11/7/2018 12:40 AM forall Loop At each step, the updates must be synchronized so that pointers point to the right things: next[i]  next[next[i]] Ensured by the semantic of forall Nobody really writes it out, but one mustn’t forget that it’s really what happens underneath forall i tmp[i] = B[i] A[i] = tmp[i] forall i A[i] = B[i]

while Condition while there is an i such that next[i] ≠NULL Numerical Algorithms 11/7/2018 12:40 AM while Condition while there is an i such that next[i] ≠NULL How can one do such a global test on a PRAM? Cannot be done in constant time unless the PRAM is CRCW At the end of above step, each processor could write to a same memory location TRUE or FALSE depending on next[i] being equal to NULL or not, and one can then take the AND of all values (to resolve concurrent writes) On a PRAM CREW, one needs O(log n) steps for doing a global test like the above In this case, one can just rewrite the while loop into a for loop, because we have analyzed the way in which the iterations go: for step = 1 to log n

Numerical Algorithms 11/7/2018 12:40 AM What Type of PRAM? The previous algorithm does not require a CW machine, but: tmp[i]  d[i] + d[next[i] could require a concurrent reads by proc j and k if k=next[j] and k performed the “d[i]” part of execution while j performed the “d[next[i]]” part of this execution Solution 1: Won’t occur if inline execution is strictly synchronous as d[k] and d[next[j]] will execute at different times Solution 2: Execute two inline instructions on lines: tmp2[i]  d[i] tmp[i]  tmp2[i] + d[next[i]] (note that the above are technically in two different steps in each pass through this loop) Now we have an execution that works on a EREW PRAM, which is the most restrictive type

Final Algorithm on a EREW PRAM Numerical Algorithms 11/7/2018 12:40 AM Final Algorithm on a EREW PRAM forall i if next[i] == NILL then d[i]  0 else d[i]  1 for step = 1 to log n if next[i] ≠ NIL then tmp[i]  d[i] d[i]  tmp[i] + d[next[i]] next[i]  next[next[i]] O(1) O(log n) O(log n) O(1) Conclusion: One can compute the length of a list of size n in time O(log n) on any PRAM

Are All PRAMs Equivalent? Numerical Algorithms 11/7/2018 12:40 AM Are All PRAMs Equivalent? Consider the following problem Given an array of n elements, ei=1,n, all distinct, find whether some element e is in the array On a CREW PRAM, there is an algorithm that works in time O(1) on n processors: P1 initializes a boolean variable B to FALSE Each processor i reads ei and e and compare them If equal, then processor Pi writes TRUE into a boolean memory location B. Only one Pi will write since ei elements are unique , so we’re ok for CREW

Are All PRAMs Equivalent? Numerical Algorithms 11/7/2018 12:40 AM Are All PRAMs Equivalent? On a EREW PRAM, one cannot do better than log n running time: Each processor must read e separately At worst a complexity of O(n), with sequential reads At best a complexity of O(log n), with series of “doubling” of the value at each step so that eventually everybody has a copy (just like a broadcast in a binary tree, or in fact a k-ary tree for some constant k) Generally, “diffusion of information” to n processors on an EREW PRAM takes O(log n) Conclusion: CREW PRAMs are more powerful than EREW PRAMs

This is a Typical Question for Various Parallel Models Numerical Algorithms 11/7/2018 12:40 AM This is a Typical Question for Various Parallel Models Is model A more powerful than model B? Basically, you are asking if one can simulate the other. Whether or not the model maps to a “real” machine, is another question of interest. Often a research group tries to build a machine that has the characteristics of a given model.

Numerical Algorithms 11/7/2018 12:40 AM Simulation Theorem Simulation theorem: Any algorithm running on a CRCW PRAM with p processors cannot be more than O(log p) times faster than the best algorithm on a EREW PRAM with p processors for the same problem Proof: We will “simulate” concurrent writes. Let ti denote ti t. When Pi writes value xi to address li, one replaces the write by an (exclusive) write of (li ,xi) to A[i], where A[i] is an auxiliary array with one slot per processor Then one sorts array A by the first component of its content Processor i of the EREW PRAM looks at A[i] and A[i-1] If the first two components are different or if i = 0, write value xi to address li Since A is sorted according to the first component, writing is exclusive

Proof (continued) Picking one processor for each competing write P0 8 Numerical Algorithms 11/7/2018 12:40 AM Proof (continued) Picking one processor for each competing write P0 8 12 P0  (29,43) = A[0] P1  (8,12) = A[1] P2  (29,43) = A[2] P3  (29,43) = A[3] P4  (92,26) = A[4] P5  (8,12) = A[5] A[0]=(8,12) P0 writes A[1]=(8,12) P1 nothing A[2]=(29,43) P2 writes A[3]=(29,43) P3 nothing A[4]=(29,43) P4 nothing A[5]=(92,26) P5 writes P1 29 P2 43 sort P3 P4 92 26 P5

Note that we said that we just sort array A Numerical Algorithms 11/7/2018 12:40 AM Proof (continued) Note that we said that we just sort array A If we have an algorithm that sorts p elements with O(p) processors in O(log p) time, we’re set Turns out, there is such an algorithm: Cole’s Algorithm. Basically a merge-sort in which lists are merged in constant time! It’s beautiful, but we don’t really have time for it, and it’s rather complicated The complexity constant is quite large, so algorithm is currently not practical. Therefore, the proof is complete.

Let the time steps of the algorithm A be numbered 1,2, …, t. Brent’s Theorem: If a p-processor PRAM algorithm runs in time t, then for any p’<p, there is a p’-processor PRAM algorithm A’ for the same problem that runs in time O(pt/p’) Let the time steps of the algorithm A be numbered 1,2, …, t. Each of these steps requires O(1) time in Algorithm A Steps inside of loops will be listed in order multiple times, and depends on the data size. Algorithm A’ completes the execution of each step, before going on to the next one. The p tasks performed by the p processors in each step are divided among the p’ processors, with each performing p/p’ steps. Since processors have same speed, this takes O(p/p’) time There are t steps in A, so the entire simulation takes O(p/p’ t) = O(pt/p’) time.