Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

Distributed Systems CS

SE-292 High Performance Computing

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Types of Parallel Computers

Parallel Computers Chapter 1

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

Chapter 17 Parallel Processing.

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Parallel Architectures

Computer System Architectures Computer System Software

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Computer Architecture and Interconnect 1b.1.

Chapter 2 Parallel Architectures. Outline Interconnection networks Interconnection networks Processor arrays Processor arrays Multiprocessors Multiprocessors.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Department of Computer Science University of the West Indies.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Parallel Programming with MPI and OpenMP

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Outline Why this subject? What is High Performance Computing?

Lecture 3: Computer Architectures

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Background Computer System Architectures Computer System Software.

Classification of parallel computers Limitations of parallel processing.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.

18-447: Computer Architecture Lecture 30B: Multiprocessors

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

Parallel Processing - introduction

CS 147 – Parallel Processing

Morgan Kaufmann Publishers

CMSC 611: Advanced Computer Architecture

Chapter 17 Parallel Processing

Outline Interconnection networks Processor arrays Multiprocessors

Multiprocessors - Flynn’s taxonomy (1966)

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

AN INTRODUCTION ON PARALLEL PROCESSING

COMP60621 Fundamentals of Parallel and Distributed Systems

Chapter 4 Multiprocessors

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Conventional Von Neumann architecture consists of a processor executing a program stored in a (main) memory: Each main memory location located by its address. Addresses start at zero and extend to 2 n – 1 when there are n bits (binary digits) in the address. Parallel Architectures and Performance Analysis – Slide 2

Parallel computer: multiple-processor system supporting parallel programming. Three principle types of architecture Vector computers, in particular processor arrays Shared memory multiprocessors Specially designed and manufactured systems Distributed memory multicomputers Message passing systems readily formed from a cluster of workstations Parallel Architectures and Performance Analysis – Slide 3

Vector computer: instruction set includes operations on vectors as well as scalars Two ways to implement vector computers Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic units Processor array: many identical, synchronized arithmetic processing elements Parallel Architectures and Performance Analysis – Slide 4

Historically, high cost of a control unit Scientific applications have data parallelism Parallel Architectures and Performance Analysis – Slide 5

Front end computer (standard uniprocessor) Program Data manipulated sequentially Processor array (individual processor/memory pairs) Data manipulated in parallel Performance Speed of processing elements Utilization of processing elements Size of data structure Parallel Architectures and Performance Analysis – Slide 6

Each VLSI chip has 16 processing elements Parallel Architectures and Performance Analysis – Slide 7

Not all problems are data parallel Speed drops for conditionally executed code Do not adapt to multiple users well Do not scale down well to “starter” systems Rely on custom VLSI for processors Expense of control units has dropped Parallel Architectures and Performance Analysis – Slide 8

Natural way to extend single processor model Have multiple processors connected to multiple memory modules such that each processor can access any memory module So-called shared memory configuration: Parallel Architectures and Performance Analysis – Slide 9

Parallel Architectures and Performance Analysis – Slide 10

Any memory location can be accessible by any of the processors. A single address space exists, meaning that each memory location is given unique address within a single range of addresses. Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections, etc.). Parallel Architectures and Performance Analysis – Slide 11

Alternately known as a tightly coupled architecture. No local memory associated with processors. Avoid three problems of processor arrays Can be built from commodity CPUs Naturally support multiple users Maintain efficiency in conditional code Parallel Architectures and Performance Analysis – Slide 12

Several alternatives for programming shared memory multiprocessors Using threads (pthreads, Java, …) in which the programmer decomposes the program into individual parallel sequences, each being a thread, and each being able to access variables declared outside the threads. Using a sequential programming language with user- level libraries to declare and access shared variables. Parallel Architectures and Performance Analysis – Slide 13

Several alternatives for programming shared memory multiprocessors Using a sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism. Ex: OpenMP – the industry standard An API for shared-memory systems Supports higher performance parallel programming of symmetrical multiprocessors Parallel Architectures and Performance Analysis – Slide 14

Several alternatives for programming shared memory multiprocessors Using a parallel programming language with syntax for parallelism, in which the compiler creates the appropriate executable code for each processor. Using a sequential programming language and ask a parallelizing compiler to convert it into parallel executable code. Neither of these not now common. Parallel Architectures and Performance Analysis – Slide 15

Type 1: Centralized Multiprocessor Straightforward extension of uniprocessor Add CPUs to bus All processors share same primary memory Memory access time same for all CPUs An example of a uniform memory access (UMA) multiprocessor Symmetrical multiprocessor (SMP) Parallel Architectures and Performance Analysis – Slide 16

Parallel Architectures and Performance Analysis – Slide 17

Private data: items used only by a single processor Shared data: values used by multiple processors In a centralized multiprocessor, processors communicate via shared data values Problems associated with shared data Cache coherence Replicating data across multiple caches reduces contention How to ensure different processors have same value for same address? Synchronization Mutual exclusion Barriers Parallel Architectures and Performance Analysis – Slide 18

Making the main memory of a cluster of computers look as though it is a single memory with a single address space (via hidden message passing). Then can use shared memory programming techniques. Parallel Architectures and Performance Analysis – Slide 19

Type 2: Distributed Multiprocessor Distribute primary memory among processors Increase aggregate memory bandwidth and lower average memory access time Allow greater number of processors Also called non-uniform memory access (NUMA) multiprocessor Parallel Architectures and Performance Analysis – Slide 20

Parallel Architectures and Performance Analysis – Slide 21

Some NUMA multiprocessors do not support it in hardware Only instructions, private data in cache Large memory access time variance Implementations more difficult No shared memory bus to “snoop” Directory-based protocol needed Parallel Architectures and Performance Analysis – Slide 22

Distributed directory contains information about cacheable memory blocks One directory entry for each cache block Each entry has Sharing status Uncached: block not in any processor’s cache Shared: cached by one or more processors; read only Exclusive: cached by exactly one processor which has written block, so copy in memory obsolete Which processors have copies Parallel Architectures and Performance Analysis – Slide 23

Complete computers connected through an interconnection network Parallel Architectures and Performance Analysis – Slide 24

Distributed memory multiple-CPU computer Same address on different processors refers to different physical memory locations Processors interact through message passing Commercial multicomputers Commodity clusters Parallel Architectures and Performance Analysis – Slide 25

Alternate name for message-passing multicomputer systems. Each processor has its own memory accessible only to that processor. A message passing interconnection network provides point-to-point connections among processors. Memory access varies between processors. Parallel Architectures and Performance Analysis – Slide 26

Parallel Architectures and Performance Analysis – Slide 27

Advantages: Back-end processors dedicated to parallel computations Easier to understand, model, tune performance Only a simple back-end operating system needed Easy for a vendor to create Disadvantages: Front-end computer is a single point of failure Single front-end computer limits scalability of system Primitive operating system in back-end processors makes debugging difficult Every application requires development of both front- end and back-end programs Parallel Architectures and Performance Analysis – Slide 28

Parallel Architectures and Performance Analysis – Slide 29

Advantages: Alleviate performance bottleneck caused by single front- end computer Better support for debugging Every processor executes same program Disadvantages: More difficult to maintain illusion of single “parallel computer” No simple way to balance program development workload among processors More difficult to achieve high performance when multiple processes on each processor Parallel Architectures and Performance Analysis – Slide 30

Parallel Architectures and Performance Analysis – Slide 31

Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams. Also important are number of processors, number of programs which can be executed, and the memory structure. Parallel Architectures and Performance Analysis – Slide 32

Single instruction stream, single data stream (SISD) computer In a single processor computer, a single stream of instructions is generated from the program. The instructions operate upon a single stream of data items. The single CPU executes one instruction at a time and fetches or stores one item of data at a time. Parallel Architectures and Performance Analysis – Slide 33

Parallel Architectures and Performance Analysis – Slide 34 Control unit Arithmetic Processor Memory Control Signals Instruction Data Stream Results

Single instruction stream, multiple data stream (SIMD) computer A specially designed computer in which a single instruction stream is from a single program, but multiple data streams exist. The instructions from the program are broadcast to more than one processor. Each processor executes the same instruction in synchronism, but using different data. Developed because there are a number of important applications that mostly operate upon arrays of data. Parallel Architectures and Performance Analysis – Slide 35

Parallel Architectures and Performance Analysis – Slide 36 Control Unit Control Signal PE 1 PE 2PE n Data Stream 1Data Stream 2Data Stream n

Processing distributed over a large amount of hardware. Operates concurrently on many different data elements. Performs the same computation on all data elements. Processors operate synchronously. Examples: pipelined vector processors (e.g. Cray-1) and processor arrays (e.g. Connection Machine) Parallel Architectures and Performance Analysis – Slide 37

Parallel Architectures and Performance Analysis – Slide 38 X 1 X 2 X 3 X 4 PEs satisfy a = 0, others are idle PEs satisfy a ≠ 0, others are idle All PEs SIMD machine X 1 a=0 ? X 3 X 2 X 4 Yes No SISD machine

Multiple instruction stream, single data stream (MISD) computer MISD machines may execute several different programs on the same data item. There are two categories Distinct processing units perform distinct instructions on the same data. Currently there is no such machine. Pipelined architectures, where data flows through a series of processing elements. Parallel Architectures and Performance Analysis – Slide 39

Parallel Architectures and Performance Analysis – Slide 40 Control Unit 1 Control Unit 2 Control Unit n Processing Element 1 Processing Element 2 Processing Element n Instruction Stream 1 Instruction Stream 2 Instruction Stream n Data Stream

A pipeline processor works according to the principle of pipelining. A process can be broken down into several stages (segments). While one stage is executing, another stage is being loaded and the input of one stage is the output of the previous stage. The processor carries out many different computations concurrently. Example: systolic array Parallel Architectures and Performance Analysis – Slide 41

Parallel Architectures and Performance Analysis – Slide 42 Serial execution of two processes with 4 stages each. Time to execute T = 8 t, where t is the time to execute one stage. Pipelined execution of the same two processes. T = 5 t S1S2S3S4S1S2S3S4 S1S2S3S4 S1S2S3S4

Multiple instruction stream, multiple data stream (MIMD) computer General purpose multiprocessor system. Multiple processors, each with a separate (different) program operating on its own data. One instruction stream is generated from each program for each processor. Each instruction operates upon different data. Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification. Parallel Architectures and Performance Analysis – Slide 43

Parallel Architectures and Performance Analysis – Slide 44 Control Unit 1 Control Unit 2 Control Unit n Processing Element 1 Processing Element 2 Processing Element n Instruction Stream 1 Instruction Stream 2 Instruction Stream n Data Stream 1 Data Stream 2 Data Stream n

Processing distributed over a number of processors operating independently and concurrently. Resources (memory) shared among processors. Each processor runs its own program. MIMD systems execute operations in a parallel asynchronous fashion. Parallel Architectures and Performance Analysis – Slide 45

Differ with regard to Interconnection networks Memory addressing techniques Synchronization Control structures A high throughput can be achieved if the processing can be broken into parallel streams keeping all the processors active concurrently. Parallel Architectures and Performance Analysis – Slide 46

Multiple Program Multiple Data (MPMD) Structure Within the MIMD classification, which we are concerned with, each processor will have its own program to execute. Parallel Architectures and Performance Analysis – Slide 47

Single Program Multiple Data (SPMD) Structure Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism. The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer. Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware. Parallel Architectures and Performance Analysis – Slide 48

SIMD needs less hardware (only one control unit). In MIMD each processor has its own control unit. SIMD needs less memory than MIMD (SIMD need only one copy of instructions). In MIMD the program and operating system needs to be stored at each processor. SIMD has implicit synchronization of PEs. In contrast, explicit synchronization may be required in MIMD. Parallel Architectures and Performance Analysis – Slide 49

MIMD allows different operations to be performed on different processing elements simultaneously (functional parallelism). SIMD is limited to data parallelism. For MIMD it is possible to use general-purpose microprocessor as a processing unit. Processor may be cheaper and more powerful. Parallel Architectures and Performance Analysis – Slide 50

Time to execute a sequence of instructions in which the execution time is data dependent is less for MIMD than for SIMD. MIMD allows each instruction to execute independently. In SIMD each processing element must wait until all the others have finished the execution of one instruction. Thus T(MIMD) = MAX {t 1 + t 2 + … + t n } T(SIMD) = MAX {t 1 } + MAX {t 2 } + … + MAX {t n }  T(MIMD) ≤ T(SIMD) Parallel Architectures and Performance Analysis – Slide 51

In MIMD each processing element can independently follow either direction path in executing if-then-else statement. This requires two phases on SIMD. MIMD can operate in SIMD mode. Parallel Architectures and Performance Analysis – Slide 52

Architectures Vector computers Shared memory multiprocessors: tightly coupled Centralized/symmetrical multiprocessor (SMP): UMA Distributed multiprocessor: NUMA Distributed memory/message-passing multicomputers: loosely coupled Asymmetrical vs. symmetrical Flynn’s Taxonomy SISD, SIMD, MISD, MIMD (MPMD, SPMD) Parallel Architectures and Performance Analysis – Slide 53

A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input. The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements. Parallel Architectures and Performance Analysis – Slide 54

The degree of parallelism is a measure of the number of operations that an algorithm can perform in parallel for a problem of size W, and it is independent of the parallel architecture. If P(W) is the degree of parallelism of a parallel algorithm, then for a problem of size W no more than P(W) processors can be employed effectively. Want to be able to do two things: predict performance of parallel programs, and understand barriers to higher performance. Parallel Architectures and Performance Analysis – Slide 55

General speedup formula Amdahl’s Law Decide if program merits parallelization Gustafson-Barsis’ Law Evaluate performance of a parallel program Parallel Architectures and Performance Analysis – Slide 56

The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel. The speedup factor of a parallel computation utilizing p processors is defined as the following ratio: In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time. Parallel Architectures and Performance Analysis – Slide 57

Speedup factor can also be cast in terms of computational steps: Maximum speedup is (usually) p with p processors (linear speedup). Parallel Architectures and Performance Analysis – Slide 58

It is assumed that the processor used in parallel computation is identical to the one used by sequential algorithm. S(p) gives the increase in speed by using a multiprocessor. Underlying algorithm for parallel implementation might be (and usually is) different. Parallel Architectures and Performance Analysis – Slide 59

The sequential algorithm has to be the best algorithm known for a particular computation problem. This means that it is fair to judge the performance of parallel computation with respect to the fastest sequential algorithm for solving the same problem in a single processor architecture. Several issues such as synchronization and communication are involved in the parallel computation. Parallel Architectures and Performance Analysis – Slide 60

Given a problem of size n on p processors let Inherently sequential computations  (n) Potentially parallel computations  (n) Communication operations  (n,p) Then: Parallel Architectures and Performance Analysis – Slide 61

Parallel Architectures and Performance Analysis – Slide 62 “elbowing out” Number of processors 

The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system: Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation. Parallel Architectures and Performance Analysis – Slide 63

In an ideal parallel system the speedup factor is equal to p and the efficiency is equal to one. In practice ideal behavior is not achieved, since processors cannot devote 100 percent of their time to the computation. Every parallel program has overhead factors such as creating processes, process synchronization and communication. In practice efficiency is between zero and one, depending on the degree of effectiveness with which processing elements are utilized. Parallel Architectures and Performance Analysis – Slide 64

Since E = S(p)/p, by what we did earlier Since all terms are positive, E > 0 Furthermore, since the denominator is larger than the numerator, E < 1 Parallel Architectures and Performance Analysis – Slide 65

Consider the problem of adding n numbers on a p processor system. Initial brute force approach: all tasks send values to one processor which adds them all up.. Parallel Architectures and Performance Analysis – Slide 66

Parallel algorithm: find the global sum by using a binomial tree. Parallel Architectures and Performance Analysis – Slide 67 S

Assume it takes one unit of time for two directly connected processors to add two numbers and to communicate to each other. Adding n/p numbers locally on each processor takes n/p –1 units of time. The p partial sums may be added in log p steps, each consisting of one addition and one communication. Parallel Architectures and Performance Analysis – Slide 68

The total parallel computation time T p is n/p – log p. For large values of p and n this can be approximated by T p = n / p + 2 log p. The serial computation time can be approximated by T s = n. Parallel Architectures and Performance Analysis – Slide 69

The expression for speedup is The expression for efficiency is Speedup and efficiency can be calculated for any p and n. Parallel Architectures and Performance Analysis – Slide 70

Computational efficiency as a function of n and p. Parallel Architectures and Performance Analysis – Slide 71 processors p n

Parallel Architectures and Performance Analysis – Slide speedup processors n=64 n=192 n=320 n=512

Parallel Architectures and Performance Analysis – Slide 73

As before since the communication time must be non-trivial. Let f represent the inherently sequential portion of the computation; then Parallel Architectures and Performance Analysis – Slide 74

Then In short, the maximum speedup factor is given by where f is the fraction of the computation that cannot be divided into concurrent tasks. Parallel Architectures and Performance Analysis – Slide 75

Limitations Ignores communication time Overestimates speedup achievable Amdahl Effect Typically  (n,p) has lower complexity than  (n)/p So as p increases,  (n)/p dominates  (n,p) Thus as p increases, speedup increases Parallel Architectures and Performance Analysis – Slide 76

Even with an infinite number of processors, maximum speedup limited to 1 / f. Ex: With only 5% of a computation being serial, maximum speedup is 20, irrespective of number of processors. Parallel Architectures and Performance Analysis – Slide 77

So Amdahl’s Law Treats problem size as a constant Shows how execution time decreases as the number of processors increases However, we often use faster computers to solve larger problem instances Let’s treat time as a constant and allow the problem size to increase with the number of processors Parallel Architectures and Performance Analysis – Slide 78

As before Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then Parallel Architectures and Performance Analysis – Slide 79

Then Parallel Architectures and Performance Analysis – Slide 80

Begin with parallel execution time instead of sequential time Estimate sequential execution time to solve same problem Problem size is an increasing function of p Predicts scaled speedup Parallel Architectures and Performance Analysis – Slide 81

An application running on 10 processors spends 3% of its time in serial code. According to Amdahl’s Law the maximum speedup is However the scaled speedup is Parallel Architectures and Performance Analysis – Slide 82

Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time Both overestimate speedup or scaled speedup achievable Gene Amdahl John L. Gustafson Parallel Architectures and Performance Analysis – Slide 83

Performance terms: speedup, efficiency Model of speedup: serial, parallel and communication components What prevents linear speedup? Serial and communication operations Process start-up Imbalanced workloads Architectural limitations Analyzing parallel performance Amdahl’s Law Gustafson-Barsis’ Law Parallel Architectures and Performance Analysis – Slide 84

Based on original material from The University of Akron: Tim O’Neil, Kathy Liszka Hiram College: Irena Lomonosov The University of North Carolina at Charlotte Barry Wilkinson, Michael Allen Oregon State University: Michael Quinn Revision history: last updated 7/28/2011. Parallel Architectures and Performance Analysis – Slide 85