Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Conventional Von Neumann architecture consists of a processor executing a program stored in a (main) memory: Each main memory location located by its address. Addresses start at zero and extend to 2 n – 1 when there are n bits (binary digits) in the address. Parallel Architectures and Performance Analysis – Slide 2

Parallel computer: multiple-processor system supporting parallel programming. Three principle types of architecture Vector computers, in particular processor arrays Shared memory multiprocessors Specially designed and manufactured systems Distributed memory multicomputers Message passing systems readily formed from a cluster of workstations Parallel Architectures and Performance Analysis – Slide 3

Vector computer: instruction set includes operations on vectors as well as scalars Two ways to implement vector computers Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic units Processor array: many identical, synchronized arithmetic processing elements Parallel Architectures and Performance Analysis – Slide 4

Historically, high cost of a control unit Scientific applications have data parallelism Parallel Architectures and Performance Analysis – Slide 5

Front end computer (standard uniprocessor) Program Data manipulated sequentially Processor array (individual processor/memory pairs) Data manipulated in parallel Performance Speed of processing elements Utilization of processing elements Size of data structure Parallel Architectures and Performance Analysis – Slide 6

Each VLSI chip has 16 processing elements Parallel Architectures and Performance Analysis – Slide 7

Not all problems are data parallel Speed drops for conditionally executed code Do not adapt to multiple users well Do not scale down well to “starter” systems Rely on custom VLSI for processors Expense of control units has dropped Parallel Architectures and Performance Analysis – Slide 8

Natural way to extend single processor model Have multiple processors connected to multiple memory modules such that each processor can access any memory module So-called shared memory configuration: Parallel Architectures and Performance Analysis – Slide 9

Parallel Architectures and Performance Analysis – Slide 10

Any memory location can be accessible by any of the processors. A single address space exists, meaning that each memory location is given unique address within a single range of addresses. Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections, etc.). Parallel Architectures and Performance Analysis – Slide 11

Alternately known as a tightly coupled architecture. No local memory associated with processors. Avoid three problems of processor arrays Can be built from commodity CPUs Naturally support multiple users Maintain efficiency in conditional code Parallel Architectures and Performance Analysis – Slide 12

Several alternatives for programming shared memory multiprocessors Using threads (pthreads, Java, …) in which the programmer decomposes the program into individual parallel sequences, each being a thread, and each being able to access variables declared outside the threads. Using a sequential programming language with user- level libraries to declare and access shared variables. Parallel Architectures and Performance Analysis – Slide 13

Several alternatives for programming shared memory multiprocessors Using a sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism. Ex: OpenMP – the industry standard An API for shared-memory systems Supports higher performance parallel programming of symmetrical multiprocessors Parallel Architectures and Performance Analysis – Slide 14

Several alternatives for programming shared memory multiprocessors Using a parallel programming language with syntax for parallelism, in which the compiler creates the appropriate executable code for each processor. Using a sequential programming language and ask a parallelizing compiler to convert it into parallel executable code. Neither of these not now common. Parallel Architectures and Performance Analysis – Slide 15

Type 1: Centralized Multiprocessor Straightforward extension of uniprocessor Add CPUs to bus All processors share same primary memory Memory access time same for all CPUs An example of a uniform memory access (UMA) multiprocessor Symmetrical multiprocessor (SMP) Parallel Architectures and Performance Analysis – Slide 16

Private data: items used only by a single processor Shared data: values used by multiple processors In a centralized multiprocessor, processors communicate via shared data values Problems associated with shared data Cache coherence Replicating data across multiple caches reduces contention How to ensure different processors have same value for same address? Synchronization Mutual exclusion Barriers Parallel Architectures and Performance Analysis – Slide 18

Making the main memory of a cluster of computers look as though it is a single memory with a single address space (via hidden message passing). Then can use shared memory programming techniques. Parallel Architectures and Performance Analysis – Slide 19

Type 2: Distributed Multiprocessor Distribute primary memory among processors Increase aggregate memory bandwidth and lower average memory access time Allow greater number of processors Also called non-uniform memory access (NUMA) multiprocessor Parallel Architectures and Performance Analysis – Slide 20

Some NUMA multiprocessors do not support it in hardware Only instructions, private data in cache Large memory access time variance Implementations more difficult No shared memory bus to “snoop” Directory-based protocol needed Parallel Architectures and Performance Analysis – Slide 22

Distributed directory contains information about cacheable memory blocks One directory entry for each cache block Each entry has Sharing status Uncached: block not in any processor’s cache Shared: cached by one or more processors; read only Exclusive: cached by exactly one processor which has written block, so copy in memory obsolete Which processors have copies Parallel Architectures and Performance Analysis – Slide 23

Complete computers connected through an interconnection network Parallel Architectures and Performance Analysis – Slide 24

Distributed memory multiple-CPU computer Same address on different processors refers to different physical memory locations Processors interact through message passing Commercial multicomputers Commodity clusters Parallel Architectures and Performance Analysis – Slide 25

Alternate name for message-passing multicomputer systems. Each processor has its own memory accessible only to that processor. A message passing interconnection network provides point-to-point connections among processors. Memory access varies between processors. Parallel Architectures and Performance Analysis – Slide 26

Advantages: Back-end processors dedicated to parallel computations Easier to understand, model, tune performance Only a simple back-end operating system needed Easy for a vendor to create Disadvantages: Front-end computer is a single point of failure Single front-end computer limits scalability of system Primitive operating system in back-end processors makes debugging difficult Every application requires development of both front- end and back-end programs Parallel Architectures and Performance Analysis – Slide 28

Advantages: Alleviate performance bottleneck caused by single front- end computer Better support for debugging Every processor executes same program Disadvantages: More difficult to maintain illusion of single “parallel computer” No simple way to balance program development workload among processors More difficult to achieve high performance when multiple processes on each processor Parallel Architectures and Performance Analysis – Slide 30

Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams. Also important are number of processors, number of programs which can be executed, and the memory structure. Parallel Architectures and Performance Analysis – Slide 32

Single instruction stream, single data stream (SISD) computer In a single processor computer, a single stream of instructions is generated from the program. The instructions operate upon a single stream of data items. The single CPU executes one instruction at a time and fetches or stores one item of data at a time. Parallel Architectures and Performance Analysis – Slide 33

Parallel Architectures and Performance Analysis – Slide 34 Control unit Arithmetic Processor Memory Control Signals Instruction Data Stream Results

Single instruction stream, multiple data stream (SIMD) computer A specially designed computer in which a single instruction stream is from a single program, but multiple data streams exist. The instructions from the program are broadcast to more than one processor. Each processor executes the same instruction in synchronism, but using different data. Developed because there are a number of important applications that mostly operate upon arrays of data. Parallel Architectures and Performance Analysis – Slide 35

Parallel Architectures and Performance Analysis – Slide 36 Control Unit Control Signal PE 1 PE 2PE n Data Stream 1Data Stream 2Data Stream n

Processing distributed over a large amount of hardware. Operates concurrently on many different data elements. Performs the same computation on all data elements. Processors operate synchronously. Examples: pipelined vector processors (e.g. Cray-1) and processor arrays (e.g. Connection Machine) Parallel Architectures and Performance Analysis – Slide 37

Parallel Architectures and Performance Analysis – Slide 38 X 1 X 2 X 3 X 4 PEs satisfy a = 0, others are idle PEs satisfy a ≠ 0, others are idle All PEs SIMD machine X 1 a=0 ? X 3 X 2 X 4 Yes No SISD machine

Multiple instruction stream, single data stream (MISD) computer MISD machines may execute several different programs on the same data item. There are two categories Distinct processing units perform distinct instructions on the same data. Currently there is no such machine. Pipelined architectures, where data flows through a series of processing elements. Parallel Architectures and Performance Analysis – Slide 39

Parallel Architectures and Performance Analysis – Slide 40 Control Unit 1 Control Unit 2 Control Unit n Processing Element 1 Processing Element 2 Processing Element n Instruction Stream 1 Instruction Stream 2 Instruction Stream n Data Stream

A pipeline processor works according to the principle of pipelining. A process can be broken down into several stages (segments). While one stage is executing, another stage is being loaded and the input of one stage is the output of the previous stage. The processor carries out many different computations concurrently. Example: systolic array Parallel Architectures and Performance Analysis – Slide 41

Parallel Architectures and Performance Analysis – Slide 42 Serial execution of two processes with 4 stages each. Time to execute T = 8 t, where t is the time to execute one stage. Pipelined execution of the same two processes. T = 5 t S1S2S3S4S1S2S3S4 S1S2S3S4 S1S2S3S4

Multiple instruction stream, multiple data stream (MIMD) computer General purpose multiprocessor system. Multiple processors, each with a separate (different) program operating on its own data. One instruction stream is generated from each program for each processor. Each instruction operates upon different data. Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification. Parallel Architectures and Performance Analysis – Slide 43

Parallel Architectures and Performance Analysis – Slide 44 Control Unit 1 Control Unit 2 Control Unit n Processing Element 1 Processing Element 2 Processing Element n Instruction Stream 1 Instruction Stream 2 Instruction Stream n Data Stream 1 Data Stream 2 Data Stream n

Processing distributed over a number of processors operating independently and concurrently. Resources (memory) shared among processors. Each processor runs its own program. MIMD systems execute operations in a parallel asynchronous fashion. Parallel Architectures and Performance Analysis – Slide 45

Differ with regard to Interconnection networks Memory addressing techniques Synchronization Control structures A high throughput can be achieved if the processing can be broken into parallel streams keeping all the processors active concurrently. Parallel Architectures and Performance Analysis – Slide 46

Multiple Program Multiple Data (MPMD) Structure Within the MIMD classification, which we are concerned with, each processor will have its own program to execute. Parallel Architectures and Performance Analysis – Slide 47

Single Program Multiple Data (SPMD) Structure Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism. The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer. Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware. Parallel Architectures and Performance Analysis – Slide 48

SIMD needs less hardware (only one control unit). In MIMD each processor has its own control unit. SIMD needs less memory than MIMD (SIMD need only one copy of instructions). In MIMD the program and operating system needs to be stored at each processor. SIMD has implicit synchronization of PEs. In contrast, explicit synchronization may be required in MIMD. Parallel Architectures and Performance Analysis – Slide 49

MIMD allows different operations to be performed on different processing elements simultaneously (functional parallelism). SIMD is limited to data parallelism. For MIMD it is possible to use general-purpose microprocessor as a processing unit. Processor may be cheaper and more powerful. Parallel Architectures and Performance Analysis – Slide 50

Time to execute a sequence of instructions in which the execution time is data dependent is less for MIMD than for SIMD. MIMD allows each instruction to execute independently. In SIMD each processing element must wait until all the others have finished the execution of one instruction. Thus T(MIMD) = MAX {t 1 + t 2 + … + t n } T(SIMD) = MAX {t 1 } + MAX {t 2 } + … + MAX {t n }  T(MIMD) ≤ T(SIMD) Parallel Architectures and Performance Analysis – Slide 51

In MIMD each processing element can independently follow either direction path in executing if-then-else statement. This requires two phases on SIMD. MIMD can operate in SIMD mode. Parallel Architectures and Performance Analysis – Slide 52

Architectures Vector computers Shared memory multiprocessors: tightly coupled Centralized/symmetrical multiprocessor (SMP): UMA Distributed multiprocessor: NUMA Distributed memory/message-passing multicomputers: loosely coupled Asymmetrical vs. symmetrical Flynn’s Taxonomy SISD, SIMD, MISD, MIMD (MPMD, SPMD) Parallel Architectures and Performance Analysis – Slide 53

A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input. The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements. Parallel Architectures and Performance Analysis – Slide 54

The degree of parallelism is a measure of the number of operations that an algorithm can perform in parallel for a problem of size W, and it is independent of the parallel architecture. If P(W) is the degree of parallelism of a parallel algorithm, then for a problem of size W no more than P(W) processors can be employed effectively. Want to be able to do two things: predict performance of parallel programs, and understand barriers to higher performance. Parallel Architectures and Performance Analysis – Slide 55

General speedup formula Amdahl’s Law Decide if program merits parallelization Gustafson-Barsis’ Law Evaluate performance of a parallel program Parallel Architectures and Performance Analysis – Slide 56

The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel. The speedup factor of a parallel computation utilizing p processors is defined as the following ratio: In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time. Parallel Architectures and Performance Analysis – Slide 57

Speedup factor can also be cast in terms of computational steps: Maximum speedup is (usually) p with p processors (linear speedup). Parallel Architectures and Performance Analysis – Slide 58

It is assumed that the processor used in parallel computation is identical to the one used by sequential algorithm. S(p) gives the increase in speed by using a multiprocessor. Underlying algorithm for parallel implementation might be (and usually is) different. Parallel Architectures and Performance Analysis – Slide 59

The sequential algorithm has to be the best algorithm known for a particular computation problem. This means that it is fair to judge the performance of parallel computation with respect to the fastest sequential algorithm for solving the same problem in a single processor architecture. Several issues such as synchronization and communication are involved in the parallel computation. Parallel Architectures and Performance Analysis – Slide 60

Given a problem of size n on p processors let Inherently sequential computations  (n) Potentially parallel computations  (n) Communication operations  (n,p) Then: Parallel Architectures and Performance Analysis – Slide 61

Parallel Architectures and Performance Analysis – Slide 62 “elbowing out” Number of processors 

The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system: Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation. Parallel Architectures and Performance Analysis – Slide 63

In an ideal parallel system the speedup factor is equal to p and the efficiency is equal to one. In practice ideal behavior is not achieved, since processors cannot devote 100 percent of their time to the computation. Every parallel program has overhead factors such as creating processes, process synchronization and communication. In practice efficiency is between zero and one, depending on the degree of effectiveness with which processing elements are utilized. Parallel Architectures and Performance Analysis – Slide 64

Since E = S(p)/p, by what we did earlier Since all terms are positive, E > 0 Furthermore, since the denominator is larger than the numerator, E < 1 Parallel Architectures and Performance Analysis – Slide 65

Consider the problem of adding n numbers on a p processor system. Initial brute force approach: all tasks send values to one processor which adds them all up.. Parallel Architectures and Performance Analysis – Slide 66

Parallel algorithm: find the global sum by using a binomial tree. Parallel Architectures and Performance Analysis – Slide 67 S

Assume it takes one unit of time for two directly connected processors to add two numbers and to communicate to each other. Adding n/p numbers locally on each processor takes n/p –1 units of time. The p partial sums may be added in log p steps, each consisting of one addition and one communication. Parallel Architectures and Performance Analysis – Slide 68

The total parallel computation time T p is n/p – 1 + 2 log p. For large values of p and n this can be approximated by T p = n / p + 2 log p. The serial computation time can be approximated by T s = n. Parallel Architectures and Performance Analysis – Slide 69

The expression for speedup is The expression for efficiency is Speedup and efficiency can be calculated for any p and n. Parallel Architectures and Performance Analysis – Slide 70

Computational efficiency as a function of n and p. Parallel Architectures and Performance Analysis – Slide 71 processors p n 1 2 4 8 16 32 64 1.980.930.815.623.399 192 1.990.975.930.832.665 320 1.995.985.956.892.768 512 1.995.990.972.930.841

Parallel Architectures and Performance Analysis – Slide 72 010 2030 5 10 15 20 25 30 0 speedup processors n=64 n=192 n=320 n=512

As before since the communication time must be non-trivial. Let f represent the inherently sequential portion of the computation; then Parallel Architectures and Performance Analysis – Slide 74

Then In short, the maximum speedup factor is given by where f is the fraction of the computation that cannot be divided into concurrent tasks. Parallel Architectures and Performance Analysis – Slide 75

Limitations Ignores communication time Overestimates speedup achievable Amdahl Effect Typically  (n,p) has lower complexity than  (n)/p So as p increases,  (n)/p dominates  (n,p) Thus as p increases, speedup increases Parallel Architectures and Performance Analysis – Slide 76

Even with an infinite number of processors, maximum speedup limited to 1 / f. Ex: With only 5% of a computation being serial, maximum speedup is 20, irrespective of number of processors. Parallel Architectures and Performance Analysis – Slide 77

So Amdahl’s Law Treats problem size as a constant Shows how execution time decreases as the number of processors increases However, we often use faster computers to solve larger problem instances Let’s treat time as a constant and allow the problem size to increase with the number of processors Parallel Architectures and Performance Analysis – Slide 78

As before Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then Parallel Architectures and Performance Analysis – Slide 79

Then Parallel Architectures and Performance Analysis – Slide 80

Begin with parallel execution time instead of sequential time Estimate sequential execution time to solve same problem Problem size is an increasing function of p Predicts scaled speedup Parallel Architectures and Performance Analysis – Slide 81

An application running on 10 processors spends 3% of its time in serial code. According to Amdahl’s Law the maximum speedup is However the scaled speedup is Parallel Architectures and Performance Analysis – Slide 82

Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time Both overestimate speedup or scaled speedup achievable Gene Amdahl John L. Gustafson Parallel Architectures and Performance Analysis – Slide 83

Performance terms: speedup, efficiency Model of speedup: serial, parallel and communication components What prevents linear speedup? Serial and communication operations Process start-up Imbalanced workloads Architectural limitations Analyzing parallel performance Amdahl’s Law Gustafson-Barsis’ Law Parallel Architectures and Performance Analysis – Slide 84

Based on original material from The University of Akron: Tim O’Neil, Kathy Liszka Hiram College: Irena Lomonosov The University of North Carolina at Charlotte Barry Wilkinson, Michael Allen Oregon State University: Michael Quinn Revision history: last updated 7/28/2011. Parallel Architectures and Performance Analysis – Slide 85

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Similar presentations

Presentation on theme: "Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Similar presentations

Presentation on theme: "Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."— Presentation transcript:

Similar presentations

About project

Feedback