Slide 1 Parallel Computation Models Lecture 3 Lecture 4.

Slides:



Advertisements
Similar presentations
Parallel Algorithms.
Advertisements

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
SE-292 High Performance Computing
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Parallel vs Sequential Algorithms
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
PRAM (Parallel Random Access Machine)
Efficient Parallel Algorithms COMP308
TECH Computer Science Parallel Algorithms  several operations can be executed at the same time  many problems are most naturally modeled with parallelism.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Parallel Architectures: Topologies Heiko Schröder, 2003.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Simulating a CRCW algorithm with an EREW algorithm Efficient Parallel Algorithms COMP308.
1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Overview Efficient Parallel Algorithms COMP308. COMP 308 Exam Time allowed : 2.5 hours Answer four questions (out of six). If you attempt to answer more.
2. Multiprocessors Main Structures 2.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1Cuba PARALLEL AND DISTRIBUTED COMPUTATION (Lucidi di L. Pagli) MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY INTERCONNECTION NETWORK P2 P3 P1 P4P5.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.
1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.
4. Multiprocessors Main Structures 4.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
A Perspective Hardware and Software
Performance Evaluation of Parallel Processing. Why Performance?
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
COMP308 Efficient Parallel Algorithms
CS 345: Chapter 10 Parallelism, Concurrency, and Alternative Models Or, Getting Lots of Stuff Done at Once.
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
RAM, PRAM, and LogP models
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Parallel Computing Department Of Computer Engineering Ferdowsi University Hossain Deldari.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
Lecture 4 – Parallel Performance Theory - 2 Parallel Performance Theory - 2 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Parallel computation Section 10.5 Giorgi Japaridze Theory of Computability.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Data Structures and Algorithms in Parallel Computing Lecture 1.
2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
1 A simple parallel algorithm Adding n numbers in parallel.
Classification of parallel computers Limitations of parallel processing.
Distributed and Parallel Processing
Lecture 2: Parallel computational models
PRAM Algorithms.
A Perspective Hardware and Software
PRAM Model for Parallel Computation
Data Structures and Algorithms in Parallel Computing
Parallel and Distributed Algorithms
CSE838 Lecture notes copy right: Moon Jung Chung
Unit –VIII PRAM Algorithms.
Distributed Computing:
Parallel Algorithms A Simple Model for Parallel Processing
Module 6: Introduction to Parallel Computing
Presentation transcript:

Slide 1 Parallel Computation Models Lecture 3 Lecture 4

Slide 2 Parallel Computation Models PRAM (parallel RAM) Fixed Interconnection Network –bus, ring, mesh, hypercube, shuffle-exchange Boolean Circuits Combinatorial Circuits BSP LOGP

Slide 3 PARALLEL AND DISTRIBUTED COMPUTATION MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY INTERCONNECTION NETWORK P2 P3 P1 P4P5 Pn.. CONNECTION MACHINE INTERNET Connects all the computers of the world

Slide 4 TYPES OF MULTIPROCESSING FRAMEWORKS PARALLEL DISTRIBUTED TECHNICAL ASPECTS TECHNICAL ASPECTS PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT SYNCRONY, SHARE MEMORY TO A LARGE EXTENT AND HAVE A VERY FAST AND RELIABLE COMMUNICATION MECHANISM BETWEEN THEM. DISTRIBUTED COMPUTERS DISTRIBUTED COMPUTERS ARE MORE INDEPENDENT, COMMUNICATION IS LESS FREQUENT AND LESS SYNCRONOUS, AND THE COOPERATION IS LIMITED. PURPOSES PURPOSES PARALLEL COMPUTERS COOPERATE TO SOLVE MORE EFFICIENTLY (POSSIBLY) DIFFICULT PROBLEMS DISTRIBUTED COMPUTERS DISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND PRIVATE ACTIVITIES. SOMETIME COMMUNICATIONS WITH OTHER ONES ARE NEEDED. (E. G. DISTRIBUTED DATA BASE OPERATIONS). PARALLEL COMPUTERS: COOPERATION IN A POSITIVE SENSE DISTRIBUTED COMPUTERS: COOPERATION IN A NEGATIVE SENSE, ONLY WHEN IT IS NECESSARY DISTRIBUTED COMPUTERS: COOPERATION IN A NEGATIVE SENSE, ONLY WHEN IT IS NECESSARY

Slide 5 FOR PARALLEL SYSTEMS WE ARE INTERESTED TO SOLVE ANY PROBLEM IN PARALLEL FOR DISTRIBUTED SYSTEMS WE ARE INTERESTED TO SOLVE IN PARALLEL PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES ARE: COMMUNICATION SERVICES ROUTING BROADCASTING MAINTENANCE OF CONTROL STUCTURE SPANNING TREE CONSTRUCTION TOPOLOGY UPDATE LEADER ELECTION RESOURCE CONTROL ACTIVITIES LOAD BALANCING MANAGING GLOBAL DIRECTORIES

Slide 6 PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW TO CONSTRUCT EFFICIENT ALGORITHMS? MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE REVISITED IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS? ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT PARALLEL SOLUTION, THAT IS INHERENTLY SEQUENTIAL PROBLEMS?

Slide 7 We need a model of computation NETWORK (VLSI) MODEL The processors are connected by a network of bounded degree. No shared memory is available. Several interconnection topologies. Synchronous way of operating. MESH CONNECTED ARRAY degree = 4 (N)diameter = 2N

Slide 8 N = 2 4 PROCESSORS diameter = 4 degree = 4 (log 2 N) HYPERCUBE

Slide 9 binary trees mesh of trees cube connected cycles In the network model a PARALLEL MACHINE is a very complex ensemble of small interconnected units, performing elementary operations. - Each processor has its own memory. - Processors work synchronously. LIMITS OF THE MODEL different topologies require different algorithms to solve the same problem it is difficult to describe and analyse algorithms (the migration of data have to be described) A shared-memory model is more suitable by an algorithmic point of view Other important topologies

Slide 10 Model Equivalence given two models M 1 and M 2, and a problem  of size n if M 1 and M 2 are equivalent then solving  requires: –T(n) time and P(n) processors on M 1 –T(n) O(1) time and P(n) O(1) processors on M 2

Slide 11 PRAM Parallel Random Access Machine Shared-memory multiprocessor unlimited number of processors, each –has unlimited local memory –knows its ID –able to access the shared memory unlimited shared memory

Slide 12 PRAM MODEL... P1P1P1P1 P2P2P2P2 PnPnPnPn.. ? m Common Memory PiPiPiPi PRAM n RAM processors connected to a common memory of m cells ASSUMPTION: at each time unit each P i can read a memory cell, make an internal computation and write another memory cell. constant time! CONSEQUENCE: any pair of processor P i P j can communicate in constant time! P i writes the message in cell x at time t P i reads the message in cell x at time t+1

Slide 13 PRAM Inputs/Outputs are placed in the shared memory (designated address) Memory cell stores an arbitrarily large integer Each instruction takes unit time Instructions are synchronized across the processors

Slide 14 PRAM Instruction Set accumulator architecture –memory cell R 0 accumulates results multiply/divide instructions take only constant operands –prevents generating exponentially large numbers in polynomial time

Slide 15 PRAM Complexity Measures for each individual processor –time: number of instructions executed –space: number of memory cells accessed PRAM machine –time: time taken by the longest running processor –hardware: maximum number of active processors

Slide 16 Two Technical Issues for PRAM How processors are activated How shared memory is accessed

Slide 17 Processor Activation P 0 places the number of processors (p) in the designated shared-memory cell –each active P i, where i < p, starts executing –O(1) time to activate –all processors halt when P 0 halts Active processors explicitly activate additional processors via FORK instructions –tree-like activation –O(log p) time to activate

Slide 18 THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL The interconnection network between processors and memory would require a very large amount of area. The message-routing on the interconnection network would require time proportional to network size (i. e. the assumption of a constant access time to the memory is not realistic). WHY THE PRAM IS A REFERENCE MODEL? Algorithm’s designers can forget the communication problems and focus their attention on the parallel computation only. There exist algorithms simulating any PRAM algorithm on bounded degree networks. E. G. A PRAM algorithm requiring time T(n), can be simulated in a mesh of tree in time T(n)log 2 n/loglogn, that is each step can be simulated with a slow-down of log 2 n/loglogn. Instead of design ad hoc algorithms for bounded degree networks, design more general algorithms for the PRAM model and simulate them on a feasible network.

Slide 19 For the PRAM model there exists a well developed body of techniques and methods to handle different classes of computational problems. HOT The discussion on parallel model of computation is still HOT The actual trend: COARSE-GRAINED MODELS COARSE-GRAINED MODELS The degree of parallelism allowed is independent from the number The degree of parallelism allowed is independent from the number of processors. of processors. The computation is divided in supersteps, each one includes local computation communication phase syncronization phase the study is still at the beginning!

Slide 20 A measure of relative performance between a multiprocessor system and a single processor system is the speed-up S( p), defined as follows: S( p) = Execution time using a single processor system Execution time using a multiprocessor with p processors S( p) = T1TpT1Tp Efficiency = SppSpp Cost = p  T p Metrics

Slide 21 Metrics Parallel algorithm is cost-optimal: parallel cost = sequential time C p = T 1 E p = 100% Critical when down-scaling: parallel implementation may become slower than sequential T 1 = n 3 T p = n 2.5 when p = n 2 C p = n 4.5

Slide 22 Amdahl’s Law f = fraction of the problem that’s inherently sequential (1 – f) = fraction that’s parallel Parallel time T p : Speedup with p processors:

Slide 23 Amdahl’s Law Upper bound on speedup (p =  ) Example: f = 2% S = 1 / 0.02 = 50 Converges to 0

Slide 24 PRAM Too many interconnections gives problems with synchronization However it is the best conceptual model for designing efficient parallel algorithms –due to simplicity and possibility of simulating efficiently PRAM algorithms on more realistic parallel architectures

Slide 25 Shared-Memory Access Concurrent (C) means, many processors can do the operation simultaneously in the same memory Exclusive (E) not concurent EREW (Exclusive Read Exclusive Write) CREW (Concurrent Read Exclusive Write) –Many processors can read simultaneously the same location, but only one can attempt to write to a given location ERCW CRCW –Many processors can write/read at/from the same memory location

Slide 26 Example CRCW-PRAM Initially –table A contains values 0 and 1 –output contains value 0 The program computes the “Boolean OR” of A[1], A[2], A[3], A[4], A[5]

Slide 27 Example CREW-PRAM Assume initially table A contains [0,0,0,0,0,1] and we have the parallel program

Slide 28 Pascal triangle PRAM CREW