Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Slides:

Advertisements

Similar presentations

Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.

Advertisements

How to Present your Work

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Potential for parallel computers/parallel programming

Parallel System Performance CS 524 – High-Performance Computing.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

CS 104 Introduction to Computer Science and Graphics Problems

Performance Metrics Parallel Computing - Theory and Practice (2/e) Section 3.6 Michael J. Quinn mcGraw-Hill, Inc., 1994.

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.

Chapter 13 Recursion. Learning Objectives Recursive void Functions – Tracing recursive calls – Infinite recursion, overflows Recursive Functions that.

Chapter 2 Memory Management: Early Systems Understanding Operating Systems, Fourth Edition.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 5 Arrays.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

CSIS 123A Lecture 9 Recursion Glenn Stevenson CSIS 113A MSJC.

Compiled by Maria Ramila Jimenez

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2005, 2006 Dr. Ken Hoganson CS8625-June Class Will Start Momentarily… Homework.

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Parallel Programming with MPI and OpenMP

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

SEM Basics 2 Byrne Chapter 2 Kline pg 7-15, 50-51, ,

INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.

Concurrency and Performance Based on slides by Henri Casanova.

Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.

The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.

Tuning Threaded Code with Intel® Parallel Amplifier.

Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.

Distributed and Parallel Processing George Wells.

Measuring Performance II and Logic Design

Potential for parallel computers/parallel programming

PERFORMANCE EVALUATIONS

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Chapter 2 Memory and process management

4- Performance Analysis of Parallel Programs

EE 193: Parallel Computing

CSCI1600: Embedded and Real Time Software

CSE8380 Parallel and Distributed Processing Presentation

COMP60621 Fundamentals of Parallel and Distributed Systems

Memory management Explain how memory is managed in a typical modern computer system (virtual memory, paging and segmentation should be described.

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Parallel Computing Explained How to Parallelize a Code

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Memory System Performance Chapter 3

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

COMP60611 Fundamentals of Parallel and Distributed Systems

Performance Measurement and Analysis

CSCI1600: Embedded and Real Time Software

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Parallel Computing Explained Parallel Performance Analysis

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 8.1 Speedup 8.2 Speedup Extremes 8.3 Efficiency 8.4 Amdahl's Law 8.5 Speedup Limitations 8.6 Benchmarks 8.7 Summary 9 About the IBM Regatta P690

Parallel Performance Analysis Now that you have parallelized your code, and have run it on a parallel computer using multiple processors you may want to know the performance gain that parallelization has achieved. This chapter describes how to compute parallel code performance. Often the performance gain is not perfect, and this chapter also explains some of the reasons for limitations on parallel performance. Finally, this chapter covers the kinds of information you should provide in a benchmark, and some sample benchmarks are given.

Speedup The speedup of your code tells you how much performance gain is achieved by running your program in parallel on multiple processors. A simple definition is that it is the length of time it takes a program to run on a single processor, divided by the time it takes to run on a multiple processors. Speedup generally ranges between 0 and p, where p is the number of processors. Scalability When you compute with multiple processors in a parallel environment, you will also want to know how your code scales. The scalability of a parallel code is defined as its ability to achieve performance proportional to the number of processors used. As you run your code with more and more processors, you want to see the performance of the code continue to improve. Computing speedup is a good way to measure how a program scales as more processors are used.

Speedup Linear Speedup If it takes one processor an amount of time t to do a task and if p processors can do the task in time t / p, then you have perfect or linear speedup (Sp= p). That is, running with 4 processors improves the time by a factor of 4, running with 8 processors improves the time by a factor of 8, and so on. This is shown in the following illustration.

Speedup Extremes The extremes of speedup happen when speedup is greater than p, called super-linear speedup, less than 1. Super-Linear Speedup You might wonder how super-linear speedup can occur. How can speedup be greater than the number of processors used? The answer usually lies with the program's memory use. When using multiple processors, each processor only gets part of the problem compared to the single processor case. It is possible that the smaller problem can make better use of the memory hierarchy, that is, the cache and the registers. For example, the smaller problem may fit in cache when the entire problem would not. When super-linear speedup is achieved, it is often an indication that the sequential code, run on one processor, had serious cache miss problems. The most common programs that achieve super-linear speedup are those that solve dense linear algebra problems.

Speedup Extremes Parallel Code Slower than Sequential Code When speedup is less than one, it means that the parallel code runs slower than the sequential code. This happens when there isn't enough computation to be done by each processor. The overhead of creating and controlling the parallel threads outweighs the benefits of parallel computation, and it causes the code to run slower. To eliminate this problem you can try to increase the problem size or run with fewer processors.

Efficiency Efficiency is a measure of parallel performance that is closely related to speedup and is often also presented in a description of the performance of a parallel program. Efficiency with p processors is defined as the ratio of speedup with p processors to p. Efficiency is a fraction that usually ranges between 0 and 1. E p =1 corresponds to perfect speedup of S p = p. You can think of efficiency as describing the average speedup per processor.

Amdahl's Law An alternative formula for speedup is named Amdahl's Law attributed to Gene Amdahl, one of America's great computer scientists. This formula, introduced in the 1980s, states that no matter how many processors are used in a parallel run, a program's speedup will be limited by its fraction of sequential code. That is, almost every program has a fraction of the code that doesn't lend itself to parallelism. This is the fraction of code that will have to be run with just one processor, even in a parallel run. Amdahl's Law defines speedup with p processors as follows: Where the term f stands for the fraction of operations done sequentially with just one processor, and the term (1 - f) stands for the fraction of operations done in perfect parallelism with p processors.

Amdahl's Law The sequential fraction of code, f, is a unitless measure ranging between 0 and 1. When f is 0, meaning there is no sequential code, then speedup is p, or perfect parallelism. This can be seen by substituting f = 0 in the formula above, which results in S p = p. When f is 1, meaning there is no parallel code, then speedup is 1, or there is no benefit from parallelism. This can be seen by substituting f = 1 in the formula above, which results in S p = 1. This shows that Amdahl's speedup ranges between 1 and p, where p is the number of processors used in a parallel processing run.

Amdahl's Law The interpretation of Amdahl's Law is that speedup is limited by the fact that not all parts of a code can be run in parallel. Substituting in the formula, when the number of processors goes to infinity, your code's speedup is still limited by 1 / f. Amdahl's Law shows that the sequential fraction of code has a strong effect on speedup. This helps to explain the need for large problem sizes when using parallel computers. It is well known in the parallel computing community, that you cannot take a small application and expect it to show good performance on a parallel computer. To get good performance, you need to run large applications, with large data array sizes, and lots of computation. The reason for this is that as the problem size increases the opportunity for parallelism grows, and the sequential fraction shrinks, and it shrinks in its importance for speedup.

Agenda 8 Parallel Performance Analysis 8.1 Speedup 8.2 Speedup Extremes 8.3 Efficiency 8.4 Amdahl's Law 8.5Speedup Limitations Memory Contention Limitation Problem Size Limitation 8.6 Benchmarks 8.7 Summary

Speedup Limitations This section covers some of the reasons why a program doesn't get perfect Speedup. Some of the reasons for limitations on speedup are: Too much I/O Speedup is limited when the code is I/O bound. That is, when there is too much input or output compared to the amount of computation. Wrong algorithm Speedup is limited when the numerical algorithm is not suitable for a parallel computer. You need to replace it with a parallel algorithm. Too much memory contention Speedup is limited when there is too much memory contention. You need to redesign the code with attention to data locality. Cache reutilization techniques will help here.

Speedup Limitations Wrong problem size Speedup is limited when the problem size is too small to take best advantage of a parallel computer. In addition, speedup is limited when the problem size is fixed. That is, when the problem size doesn't grow as you compute with more processors. Too much sequential code Speedup is limited when there's too much sequential code. This is shown by Amdahl's Law. Too much parallel overhead Speedup is limited when there is too much parallel overhead compared to the amount of computation. These are the additional CPU cycles accumulated in creating parallel regions, creating threads, synchronizing threads, spin/blocking threads, and ending parallel regions. Load imbalance Speedup is limited when the processors have different workloads. The processors that finish early will be idle while they are waiting for the other processors to catch up.

Memory Contention Limitation Gene Golub, a professor of Computer Science at Stanford University, writes in his book on parallel computing that the best way to define memory contention is with the word delay. When different processors all want to read or write into the main memory, there is a delay until the memory is free. On the SGI Origin2000 computer, you can determine whether your code has memory contention problems by using SGI's perfex utility. The perfex utility is covered in the Cache Tuning lecture in this course. You can also refer to SGI's manual page, man perfex, for more details. On the Linux clusters, you can use the hardware performance counter tools to get information on memory performance. On the IA32 platform, use perfex, vprof, hmpcount, psrun/perfsuite. On the IA64 platform, use vprof, pfmon, psrun/perfsuite.

Memory Contention Limitation Many of these tools can be used with the PAPI performance counter interface. Be sure to refer to the man pages and webpages on the NCSA website for more information.NCSA website If the output of the utility shows that memory contention is a problem, you will want to use some programming techniques for reducing memory contention. A good way to reduce memory contention is to access elements from the processor's cache memory instead of the main memory. Some programming techniques for doing this are: Access arrays with unit `. Order nested do loops (in Fortran) so that the innermost loop index is the leftmost index of the arrays in the loop. For the C language, the order is the opposite of Fortran. Avoid specific array sizes that are the same as the size of the data cache or that are exact fractions or exact multiples of the size of the data cache. Pad common blocks. These techniques are called cache tuning optimizations. The details for performing these code modifications are covered in the section on Cache Optimization of this lecture.

Problem Size Limitation Small Problem Size Speedup is almost always an increasing function of problem size. If there's not enough work to be done by the available processors, the code will show limited speedup. The effect of small problem size on speedup is shown in the following illustration.

Problem Size Limitation Fixed Problem Size When the problem size is fixed, you can reach a point of negative returns when using additional processors. As you compute with more and more processors, each processor has less and less amount of computation to perform. The additional parallel overhead, compared to the amount of computation, causes the speedup curve to start turning downward as shown in the following figure.

Benchmarks It will finally be time to report the parallel performance of your application code. You will want to show a speedup graph with the number of processors on the x axis, and speedup on the y axis. Some other things you should report and record are: the date you obtained the results the problem size the computer model the compiler and the version number of the compiler any special compiler options you used

Benchmarks When doing computational science, it is often helpful to find out what kind of performance your colleagues are obtaining. In this regard, NCSA has a compilation of parallel performance benchmarks online at You might be interested in looking at these benchmarks to see how other people report their parallel performance. In particular, the NAMD benchmark is a report about the performance of the NAMD program that does molecular dynamics simulations.NAMD benchmark

Summary There are many good texts on parallel computing which treat the subject of parallel performance analysis. Here are two useful references: Scientific Computing An Introduction with Parallel Computing, Gene Golub and James Ortega, Academic Press, Inc. Parallel Computing Theory and Practice, Michael J. Quinn, McGraw-Hill, Inc.