1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
SE-292 High Performance Computing
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Parallel System Performance CS 524 – High-Performance Computing.
Reference: Message Passing Fundamentals.
CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Parallel System Performance CS 524 – High-Performance Computing.
CS 240A: Complexity Measures for Parallel Computation.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Computer System Architectures Computer System Software
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
RAM, PRAM, and LogP models
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.
Outline Why this subject? What is High Performance Computing?
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Background Computer System Architectures Computer System Software.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
CMSC 611: Advanced Computer Architecture
Complexity Measures for Parallel Computation
Introduction to Multiprocessors
Distributed Systems CS
COMP60621 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 4 Multiprocessors
Complexity Measures for Parallel Computation
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

1 Introduction to Parallel Computing

2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors communicate via message passing. Shared-Memory Architectures –Single address space shared by all processors. –Processors communicate by memory read/write. –SMP or NUMA. –Cache coherence is important issue. Lots of middle ground and hybrids. No clear consensus on terminology.

3 Message-Passing Architecture... processor cache memory processor cache memory processor cache memory interconnection network...

4 Shared-Memory Architecture... interconnection network... processor 1 cache processor 2 cache processor N cache memory 1 memory M memory 2

5 Shared-Memory Architecture: SMP and NUMA SMP = Symmetric Multiprocessor –All memory is equally close to all processors. –Typical interconnection network is a shared bus. –Easier to program, but doesn’t scale to many processors. NUMA = Non-Uniform Memory Access –Each memory is closer to some processors than others. –a.k.a. “Distributed Shared Memory”. –Typically interconnection is grid or hypercube. –Harder to program, but scales to more processors.

6 Shared-Memory Architecture: Cache Coherence Effective caching reduces memory contention. Processors must see single consistent memory. Many different consistency models. Weak consistency is sufficient. Snoopy cache coherence for bus-based SMPs. Distributed directories for NUMA. Many implementation issues: multiple-levels, I-D separation, cache line size, update policy, etc. etc. Usually don’t need to know all the details.

7 Example: Quad-Processor Pentium Pro SMP, bus interconnection. 4 x 200 MHz Intel Pentium Pro processors Kb L1 cache per processor. 512 Kb L2 cache per processor. Snoopy cache coherence. Compaq, HP, IBM, NetPower. Windows NT, Solaris, Linux, etc.

8 100 Mbit switch Diplopodus Node config  2 x 500 MHz Pentium III  512 Mb RAM  Gb disk Beowulf-based cluster of Linux/Intel workstations 24 PCs

9 The first program Purpose: Illustrate notation Given –Length of vectors M –Data x m, y m, m=0,1,…,M-1 of real numbers, and two real scalars  and . Compute –z =  x +  y, i.e., z[m] =  x[m] +  y[m] for m=0,1,…,M-1.

10 Program Vector_Sum_1 declare m: integer; x,y,z: array[0,1,…,M-1] of real; initially assign end Here || is a concurrent operator. It means that is two operations O 1 and O 2 are separated by ||, i.e. O 1 || O 2, then the two operations can be performed concurrently independently of each other. In addition, is short for O 0 ||O 1 ||…||O M-1 meaning that all the M operations can be done concurrently.

11 Sequential assignment initially a=1, b=2 assign a:=b; b:=a results in a=b=2. Concurrent assignment initially a=1, b=2 assign a:=b || b:=a results in a=2, b=1.

12 A model of a parallel computer P processors (nodes); p=0,1,…P-1. All processors are identical. All processors compute sequentially. All nodes can communicate with any other node. The communication is handled by mechanisms for sending and receiving data at each processor.

13 Data distribution Suppose distribution of vector x with M elements x 0,…x M-1 over a collection of P identical computers. On each computer define index set J p = {0,1,…I p -1}, where I p is the number of indices stored at processor p. Assume I 0 +I 1 +…+I P-1 = M, x=(x 0,…x I 0 -1,…,…,…,…,…,x M-1 ) stored proc 0 stored proc P-1

14 A proper data distribution defines a one-to-one mapping  from a global index m to a local index i on processor p. For a global index m,  (m) gives a unique index i on a unique processor p. Similarly, an index i on processor p is uniquely mapped to a unique global index m=  -1 (p,i). Globally: x = x 0,…x M-1 Locally: x 0,…x I 0 -1, x 0,…x I 1 -1,…, x 0,…x I P-1 -1 proc 0 proc 1 proc P-1

15 Purpose: –derive a multicomputer version of Vector_Sum_1 Given –Length of vectors M. –Data x m, y m, m=0,1,…,M-1 of real numbers, and two real scalars  and . –Number of processors P. –Set of indices J p ={0,1,…,I p -1} where the number of entries I p on the p-th processor is given. –A one-to-one mapping between global and local indices. Compute z=  x +  y, i,.e, z[m]=  x[m] +  y[m] for m=0,1,…,M-1.

16 O,…,P-1 || p Program Vector_Sum_2 declare i: integer; x,y,z: array[J p ] of real; initially assign end Notice that we have one program for each processor - all programs being identical. In each program, the identifier p is known. Also the mapping is assumed to be known. The result is stored in a distributed vector z.

17 Performance analysis Let P be the number of processors, and let T = T(P) denote the execution time for a program on this multicomputer. Performance analysis is the study of the properties of T(P). In order to analyze concurrent algorithms, we have to assume certain properties of the computer. In fact these assumptions are rather strict and thus leave out a lot of existing computers. On the other hand; without these assumptions the analysis tend to be extremely complicated.

18 Observation Let T(1) be the fastest possible scalar computation, then T(P)  T(1)/P. This relation states a bound for how fast a computation can be done on a parallel computer compared with a scalar computer.

19 Definitions Speed-up: The speed-up of a P-node computation with execution time T(P) is given by S(P) = T(1)/T(P). Efficiency: The efficiency of a P-node computation with speed-up S(P) is given by  (P) = S(P)/P.

20 Discussion Suppose we are in an optimal situation, i.e., we have T(P) = T(1)/P. Then the speed-up is given by S(P) = T(1)/T(P) = P, and the efficiency is  (P) = S(P)/P = 1.

21 More generally we have T(P)  T(1)/P, which implies that S(P) = T(1)/T(P)  P, and  (P) = S(P)/P  1. In practical computations we are pleased if we are close to the optimal results. A speed-up close to P and to an efficiency close to 1 is very good. Practical details often result in weaker performance than expected from the analysis.

22 Efficiency modelling Goal: estimate how fast a certain algorithm can run on a multicomputer. The models depend on the following parameters:  A = Arithmetic time; the time of one single arithmetic operation. Integer ops ignored, all nodes assumed equal.  C (L) = Message exchange time; the time it takes to send a message of length L (in proper units) from one processor to another. We assume that this time is equal for any pair of processors.  L = Latency; the start-up time for a communication - or the time it takes to send a message of length zero. 1/  = Bandwidth; the maximum rate of messages (in proper units) that can be exchanged.

23 Efficiency modelling In our efficiency models, we will assume that there is a linear relation between the message exchange time and the length of the message:  C (L) =  L +  L.

24 Analysis of Vector_Sum_2 O,…,P-1 || p Program Vector_Sum_2 declare i: integer; x,y,z: array[J p ] of real; initially assign end Recall that J p ={0,1,…,I p -1}, and define I = max p I p. Then a model of the execution time is given by T(P) = 3 max p I p  A = 3I  A. Notice that there are three arithmetic operations for each entry of the array.

25 Load balancing Obviously, we would like to balance the load of the processors. Basically, we would like to have each of them perform approximately the same number of operations. (Recall that we assume all processors of same capacity). In the notation used in the present vector operation, we have load balance if I is as small as possible. In the case that M (the number of array entries) is a multiple of P (the number of processors), we have load balance if I = M/P, meaning that there are equally many vector entries on each processor.

26 Speed-up For this problem the speed-up is S(P) = T(1)/T(P) = 3M  A / 3I  A = M/I. If the problem is load balanced, we have I = M/P and thus S(P) = P which is optimal. Notice that we are typically interested in very large values of M, say M= The number of processors P are usually below 1000.

27 The communication cost In the above example, no communication at all was necessary. In the next example, one real number must be communicated. This changes the analysis a bit!

28 The communication cost Purpose: –derive a multicomputer program for computation of an inner product. Given –Length of vectors M. –Data x m, y m, m=0,1,…,M-1 of real numbers. –Number of processors P. –Set of indices J p ={0,1,…,I p -1} where the number of entries I p on the p-th processor is given. –A one-to-one mapping between global and local indices. Compute  = (x,y), i.e.,  = x[0] y[0] + x[1] y[1] + … + x[M-1] y[M-1].

29 Program Inner_Product O,…,P-1 || p Program Inner_Product declare i: integer; w: array[0,1,…,P-1] of real; x,y: array[J p ] of real; initially assign w[p] = ; send w[p] to all ;  = ; end

30 Performance modelling of Inner_Product Recall J p = {0,1,…,I p -1} and I = max p I p. A model of the execution time for Inner_Product is given by T(P) = (2I-1)  A + (P-1)  C (1) + (P-1)  A Here the first term arises from the sum of x[i]y[i] over local i values (I p multiplications and I p -1 additions). The second term arises from the cost of sending one real number from one processors to all others. The third term arises from the computation of the inner product based on the values on each processor.

31 Simplifications Assume I = M/P, i.e., a load balanced problem. Assume (as always) P  M, and  C (1) =  A (for practical computers  is quite large, ). We then have T(P)  2I  A + P  C (1), or T(P)  (2M/P + P  )  A.

32 Example I Choosing M=10 5 and  = 50, we get T(P) = (2* 10 5 /P + 50P)

33 Example II Choosing M=10 7 and  = 50, we get T(P) = (2* 10 7 /P + 50P)

34 Speed-up For this problem, the speed-up is S(P) = T(1)/T(P)  [(2M+  )  A ] / [(2M/P + P  )  A ] = P [1+  /(2M)] / [1+  P 2 /(2M)]. Optimal speed-up characterized by S(p)  P, we must require  P 2 /(2M)  1 in order for this to be the case.