Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

Slides:



Advertisements
Similar presentations
Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.
Advertisements

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Potential for parallel computers/parallel programming
Performance Analysis of Multiprocessor Architectures
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Chapter 7 Performance Analysis. 2 Additional References Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated online version.
CS 584. Logic The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. The basis of.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.
Performance Metrics Parallel Computing - Theory and Practice (2/e) Section 3.6 Michael J. Quinn mcGraw-Hill, Inc., 1994.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Recap.
Parallel Programming in C with MPI and OpenMP
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
Chapter 7 Performance Analysis. 2 References (Primary Reference): Selim Akl, “Parallel Computation: Models and Methods”, Prentice Hall, 1997, Updated.
CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming Chapter 3 Introduction to Parallel Architectures Johnnie Baker January 26 , 2011.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 2 Image Slides.
Chapter 8 Traffic-Analysis Techniques. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 8-1.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Computer Science 320 Measuring Speedup. What Is Running Time? T(N, K) says that the running time T is a function of the problem size N and the number.
Parallel Programming in C with MPI and OpenMP
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Performance Evaluation of Parallel Processing. Why Performance?
Chapter 7 Performance Analysis.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Parallel Programming with MPI and OpenMP
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.
1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
Chapter 13 Transportation Demand Analysis. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.
Classification of parallel computers Limitations of parallel processing.
DCS/1 CENG Distributed Computing Systems Measures of Performance.
Supercomputing in Plain English Tuning Blue Waters Undergraduate Petascale Education Program May 29 – June
Potential for parallel computers/parallel programming
Introduction to Analysis of Algorithms
4- Performance Analysis of Parallel Programs
Chapter 3: Principles of Scalable Performance
CS 584.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Ch. 2: Getting Started.
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Presentation transcript:

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 7 Performance Analysis

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Learning Objectives Predict performance of parallel programs Predict performance of parallel programs Understand barriers to higher performance Understand barriers to higher performance

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Outline General speedup formula General speedup formula Amdahl’s Law Amdahl’s Law Gustafson-Barsis’ Law Gustafson-Barsis’ Law Karp-Flatt metric Karp-Flatt metric Isoefficiency metric Isoefficiency metric

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Speedup Formula

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Execution Time Components Inherently sequential computations:  (n) Inherently sequential computations:  (n) Potentially parallel computations:  (n) Potentially parallel computations:  (n) Communication operations:  (n,p) Communication operations:  (n,p)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Speedup Expression

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.  (n)/p

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. (n,p)(n,p)

 (n)/p +  (n,p)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Speedup Plot “elbowing out”

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Efficiency

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 0   (n,p)  1 All terms > 0   (n,p) > 0 Denominator > numerator   (n,p) < 1

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Amdahl’s Law Let f =  (n)/(  (n) +  (n))

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example 1 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs? 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example 2 20% of a program’s execution time is spent within inherently sequential code. What is the limit to the speedup achievable by a parallel version of the program? 20% of a program’s execution time is spent within inherently sequential code. What is the limit to the speedup achievable by a parallel version of the program?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Pop Quiz An oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on 8 processors? An oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on 8 processors?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Pop Quiz A computer animation program generates a feature movie frame-by-frame. Each frame can be generated independently and is output to its own file. If it takes 99 seconds to render a frame and 1 second to output it, how much speedup can be achieved by rendering the movie on 100 processors? A computer animation program generates a feature movie frame-by-frame. Each frame can be generated independently and is output to its own file. If it takes 99 seconds to render a frame and 1 second to output it, how much speedup can be achieved by rendering the movie on 100 processors?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Limitations of Amdahl’s Law Ignores  (n,p) Ignores  (n,p) Overestimates speedup achievable Overestimates speedup achievable

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Amdahl Effect Typically  (n,p) has lower complexity than  (n)/p Typically  (n,p) has lower complexity than  (n)/p As n increases,  (n)/p dominates  (n,p) As n increases,  (n)/p dominates  (n,p) As n increases, speedup increases As n increases, speedup increases

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Illustration of Amdahl Effect n = 100 n = 1,000 n = 10,000 Speedup Processors

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Review of Amdahl’s Law Treats problem size as a constant Treats problem size as a constant Shows how execution time decreases as number of processors increases Shows how execution time decreases as number of processors increases

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Another Perspective We often use faster computers to solve larger problem instances We often use faster computers to solve larger problem instances Let’s treat time as a constant and allow problem size to increase with number of processors Let’s treat time as a constant and allow problem size to increase with number of processors

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Gustafson-Barsis’s Law Let s =  (n)/(  (n)+  (n)/p)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Gustafson-Barsis’s Law Begin with parallel execution time Begin with parallel execution time Estimate sequential execution time to solve same problem Estimate sequential execution time to solve same problem Problem size is an increasing function of p Problem size is an increasing function of p Predicts scaled speedup Predicts scaled speedup

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example 1 An application running on 10 processors spends 3% of its time in serial code. What is the scaled speedup of the application? An application running on 10 processors spends 3% of its time in serial code. What is the scaled speedup of the application? Execution on 1 CPU takes 10 times as long… …except 9 do not have to execute serial code

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example 2 What is the maximum fraction of a program’s parallel execution time that can be spent in serial code if it is to achieve a scaled speedup of 7 on 8 processors? What is the maximum fraction of a program’s parallel execution time that can be spent in serial code if it is to achieve a scaled speedup of 7 on 8 processors?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Pop Quiz A parallel program executing on 32 processors spends 5% of its time in sequential code. What is the scaled speedup of this program? A parallel program executing on 32 processors spends 5% of its time in sequential code. What is the scaled speedup of this program?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. The Karp-Flatt Metric Amdahl’s Law and Gustafson-Barsis’ Law ignore  (n,p) Amdahl’s Law and Gustafson-Barsis’ Law ignore  (n,p) They can overestimate speedup or scaled speedup They can overestimate speedup or scaled speedup Karp and Flatt proposed another metric Karp and Flatt proposed another metric

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Experimentally Determined Serial Fraction Inherently serial component of parallel computation + processor communication and synchronization overhead Single processor execution time

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Experimentally Determined Serial Fraction Takes into account parallel overhead Takes into account parallel overhead Detects other sources of overhead or inefficiency ignored in speedup model Detects other sources of overhead or inefficiency ignored in speedup model  Process startup time  Process synchronization time  Imbalanced workload  Architectural overhead

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example 1 p  What is the primary reason for speedup of only 4.7 on 8 CPUs? e0.1 Since e is constant, large serial fraction is the primary reason.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example 2 p  What is the primary reason for speedup of only 4.7 on 8 CPUs? e Since e is steadily increasing, overhead is the primary reason.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Pop Quiz Is this program likely to achieve a speedup of 10 on 12 processors? Is this program likely to achieve a speedup of 10 on 12 processors? p  12 ?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Isoefficiency Metric Parallel system: parallel program executing on a parallel computer Parallel system: parallel program executing on a parallel computer Scalability of a parallel system: measure of its ability to increase performance as number of processors increases Scalability of a parallel system: measure of its ability to increase performance as number of processors increases A scalable system maintains efficiency as processors are added A scalable system maintains efficiency as processors are added Isoefficiency: way to measure scalability Isoefficiency: way to measure scalability

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Isoefficiency Derivation Steps Begin with speedup formula Begin with speedup formula Compute total amount of overhead Compute total amount of overhead Assume efficiency remains constant Assume efficiency remains constant Determine relation between sequential execution time and overhead Determine relation between sequential execution time and overhead

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Deriving Isoefficiency Relation Determine overhead Substitute overhead into speedup equation Substitute T(n,1) =  (n) +  (n). Assume efficiency is constant. Isoefficiency Relation

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Scalability Function Suppose isoefficiency relation is n  f(p) Suppose isoefficiency relation is n  f(p) Let M(n) denote memory required for problem of size n Let M(n) denote memory required for problem of size n M(f(p))/p shows how memory usage per processor must increase to maintain same efficiency M(f(p))/p shows how memory usage per processor must increase to maintain same efficiency We call M(f(p))/p the scalability function We call M(f(p))/p the scalability function

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Meaning of Scalability Function To maintain efficiency when increasing p, we must increase n To maintain efficiency when increasing p, we must increase n Maximum problem size limited by available memory, which is linear in p Maximum problem size limited by available memory, which is linear in p Scalability function shows how memory usage per processor must grow to maintain efficiency Scalability function shows how memory usage per processor must grow to maintain efficiency Scalability function a constant means parallel system is perfectly scalable Scalability function a constant means parallel system is perfectly scalable

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Interpreting Scalability Function Number of processors Memory needed per processor Cplogp Cp Clogp C Memory Size Can maintain efficiency Cannot maintain efficiency

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example 1: Reduction Sequential algorithm complexity T(n,1) =  (n) Sequential algorithm complexity T(n,1) =  (n) Parallel algorithm Parallel algorithm  Computational complexity =  (n/p)  Communication complexity =  (log p) Parallel overhead T 0 (n,p) =  (p log p) Parallel overhead T 0 (n,p) =  (p log p)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Reduction (continued) Isoefficiency relation: n  C p log p Isoefficiency relation: n  C p log p We ask: To maintain same level of efficiency, how must n increase when p increases? We ask: To maintain same level of efficiency, how must n increase when p increases? M(n) = n M(n) = n The system has good scalability The system has good scalability

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example 2: Floyd’s Algorithm Sequential time complexity:  (n 3 ) Sequential time complexity:  (n 3 ) Parallel computation time:  (n 3 /p) Parallel computation time:  (n 3 /p) Parallel communication time:  (n 2 log p) Parallel communication time:  (n 2 log p) Parallel overhead: T 0 (n,p) =  (pn 2 log p) Parallel overhead: T 0 (n,p) =  (pn 2 log p)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Floyd’s Algorithm (continued) Isoefficiency relation n 3  C(p n 3 log p)  n  C p log p Isoefficiency relation n 3  C(p n 3 log p)  n  C p log p M(n) = n 2 M(n) = n 2 The parallel system has poor scalability The parallel system has poor scalability

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Example 3: Finite Difference Sequential time complexity per iteration:  (n 2 ) Sequential time complexity per iteration:  (n 2 ) Parallel communication complexity per iteration:  (n/  p) Parallel communication complexity per iteration:  (n/  p) Parallel overhead:  (n  p) Parallel overhead:  (n  p)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Finite Difference (continued) Isoefficiency relation n 2  Cn  p  n  C  p Isoefficiency relation n 2  Cn  p  n  C  p M(n) = n 2 M(n) = n 2 This algorithm is perfectly scalable This algorithm is perfectly scalable

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Summary (1/3) Performance terms Performance terms  Speedup  Efficiency Model of speedup Model of speedup  Serial component  Parallel component  Communication component

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Summary (2/3) What prevents linear speedup? What prevents linear speedup?  Serial operations  Communication operations  Process start-up  Imbalanced workloads  Architectural limitations

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Summary (3/3) Analyzing parallel performance Analyzing parallel performance  Amdahl’s Law  Gustafson-Barsis’ Law  Karp-Flatt metric  Isoefficiency metric