What Exactly is Parallel Processing?

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Distributed Systems CS
Potential for parallel computers/parallel programming
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Computer Organization and Architecture 18 th March, 2008.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.
Recap.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Performance Evaluation of Parallel Processing. Why Performance?
Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
2015/10/14Part-I1 Introduction to Parallel Processing.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Lecture 8: 9/19/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Parallel Computing.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 Part I Fundamental Concepts.
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
An Overview of Parallel Processing
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Concurrency and Performance Based on slides by Henri Casanova.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Classification of parallel computers Limitations of parallel processing.
Introduction to Parallel Processing
Potential for parallel computers/parallel programming
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
Defining Performance Which airplane has the best performance?
Parallel Programming By J. H. Wang May 2, 2017.
Introduction to Parallelism.
Multi-Processing in High Performance Computer Architecture:
Morgan Kaufmann Publishers
EE 193: Parallel Computing
Multi-Processing in High Performance Computer Architecture:
Data Structures and Algorithms in Parallel Computing
Complexity Measures for Parallel Computation
Multiprocessors - Flynn’s taxonomy (1966)
CSE8380 Parallel and Distributed Processing Presentation
Overview Parallel Processing Pipelining
AN INTRODUCTION ON PARALLEL PROCESSING
Distributed Systems CS
Distributed Systems CS
COMP60621 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 4 Multiprocessors
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Complexity Measures for Parallel Computation
Database System Architectures
Potential for parallel computers/parallel programming
Parallel Programming in C with MPI and OpenMP
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
Presentation transcript:

What Exactly is Parallel Processing? Parallelism = Concurrency Doing more than one thing at a time Multiple agents (hardware units, software processes) collaborate to perform our main computational task - Multiplying two matrices - Breaking a secret code - Deciding on the next chess move Parallel Processing, Fundamental Concepts

Why High-Performance Computing? Higher speed (solve problems faster) Higher throughput (solve more problems) Higher computational power (solve larger problems) Categories of supercomputers Uniprocessor; vector machine Multiprocessor; centralized or distributed shared memory Multicomputer; communicating via message passing Massively parallel processor (MPP; 1K or more processors)

Parallel Processing, Fundamental Concepts Fall 2010 Parallel Processing, Fundamental Concepts

Parallel Processing, Fundamental Concepts Fall 2010 Parallel Processing, Fundamental Concepts

Parallel Processing, Fundamental Concepts Fall 2010 Parallel Processing, Fundamental Concepts

Parallel Processing, Fundamental Concepts Fall 2010 Parallel Processing, Fundamental Concepts

Parallel Processing, Fundamental Concepts Fall 2010 Parallel Processing, Fundamental Concepts

Types of Parallelism: A Taxonomy Fig. 1.11 The Flynn-Johnson classification of computer systems. Parallel Processing, Fundamental Concepts

Parallel Processing, Fundamental Concepts 1.2 A Motivating Example Init. Pass 1 Pass 2 Pass 3 2m 2 2 2 3 3m 3 3 4 5 5 5m 5 6 7 7 7 7  m 8 9 9 10 11 11 11 11 12 13 13 13 13 14 15 15 16 17 17 17 17 18 19 19 19 19 20 21 21 22 23 23 23 23 24 25 25 25 26 27 27 28 29 29 29 29 30 Fig. 1.3 The sieve of Eratosthenes yielding a list of 10 primes for n = 30. Marked elements have been distinguished by erasure from the list. Any composite number has a prime factor that is no greater than its square root. Parallel Processing, Fundamental Concepts

Single-Processor Implementation of the Sieve Bit-vector Fig. 1.4 Schematic representation of single-processor solution for the sieve of Eratosthenes. Fall 2010 Parallel Processing, Fundamental Concepts

Control-Parallel Implementation of the Sieve Fig. 1.5 Schematic representation of a control-parallel solution for the sieve of Eratosthenes. Fall 2010 Parallel Processing, Fundamental Concepts

Running Time of the Sequential/Parallel Sieve Fig. 1.6 Control-parallel realization of the sieve of Eratosthenes with n = 1000 and 1  p  3. Parallel Processing, Fundamental Concepts

Data-Parallel Implementation of the Sieve Assume at most n processors, so that all prime factors dealt with are in P1 (which broadcasts them)  n < n / p Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes. Parallel Processing, Fundamental Concepts

One Reason for Sublinear Speedup: Communication Overhead Fig. 1.8 Trade-off between communication time and computation time in the data-parallel realization of the sieve of Eratosthenes. Parallel Processing, Fundamental Concepts

Another Reason for Sublinear Speedup: Input/Output Overhead Fig. 1.9 Effect of a constant I/O time on the data-parallel realization of the sieve of Eratosthenes. Fall 2010 Parallel Processing, Fundamental Concepts

Parallel Processing, Fundamental Concepts Amdahl’s Law f = fraction unaffected p = speedup of the rest s =  min(p, 1/f) 1 f + (1 – f)/p Fig. 1.12 Limit on speed-up according to Amdahl’s law. Fall 2010 Parallel Processing, Fundamental Concepts

1.6 Effectiveness of Parallel Processing Fig. 1.13 Task graph exhibiting limited inherent parallelism. p Number of processors W(p) Work performed by p processors T(p) Execution time with p processors T(1) = W(1); T(p)  W(p) S(p) Speedup = T(1) / T(p) E(p) Efficiency = T(1) / [p T(p)] R(p) Redundancy = W(p) / W(1) U(p) Utilization = W(p) / [p T(p)] Q(p) Quality = T3(1) / [p T2(p) W(p)] W(1) = 13 T(1) = 13 T() = 8 Parallel Processing, Fundamental Concepts

Reduction or Fan-in Computation Example: Adding 16 numbers, 8 processors, unit-time additions Zero-time communication E(8) = 15 / (8  4) = 47% S(8) = 15 / 4 = 3.75 R(8) = 15 / 15 = 1 Q(8) = 1.76 Unit-time communication E(8) = 15 / (8  7) = 27% S(8) = 15 / 7 = 2.14 R(8) = 22 / 15 = 1.47 Q(8) = 0.39 Fig. 1.14 Computation graph for finding the sum of 16 numbers . Parallel Processing, Fundamental Concepts

ABCs of Parallel Processing in One Slide A Amdahl’s Law (Speedup Formula) Bad news – Sequential overhead will kill you, because: Speedup = T1/Tp  1/[f + (1 – f)/p]  min(1/f, p) Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS. B Brent’s Scheduling Theorem Good news – Optimal scheduling is very difficult, but even a naive scheduling algorithm can ensure: T1/p  Tp  T1/p + T = (T1/p)[1 + p/(T1/T)] Result: For a reasonably parallel task (large T1/T), or for a suitably small p (say, p  T1/T), good speedup and efficiency are possible. C Cost-Effectiveness Adage Real news – The most cost-effective parallel solution may not be the one with highest peak OPS (communication?), greatest speed-up (at what cost?), or best utilization (hardware busy doing what?). Analogy: Mass transit might be more cost-effective than private cars even if it is slower and leads to many empty seats. Parhami, B., “ABCs of Parallel Processing in One Transparency,” Computer Architecture News, Vol. 27, No. 1, p. 2, March 1999. Parallel Processing, Fundamental Concepts