Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Potential for parallel computers/parallel programming
Performance Analysis of Multiprocessor Architectures
Parallel Processors Todd Charlton Eric Uriostique.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Computer Organization and Architecture 18 th March, 2008.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Computer Performance Evaluation: Cycles Per Instruction (CPI)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Parallelizing Compilers Presented by Yiwei Zhang.
Chapter 4 Assessing and Understanding Performance
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
Parallel System Performance CS 524 – High-Performance Computing.
Introduction to a Programming Environment
Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.
Lecture 3: Computer Performance
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
The Pentium: A CISC Architecture Shalvin Maharaj CS Umesh Maharaj:
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Parallel Programming with MPI and OpenMP
 Introduction to SUN SPARC  What is CISC?  History: CISC  Advantages of CISC  Disadvantages of CISC  RISC vs CISC  Features of SUN SPARC  Architecture.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Concurrency and Performance Based on slides by Henri Casanova.
1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.
Potential for parallel computers/parallel programming
Parallel Computing and Parallel Computers
September 2 Performance Read 3.1 through 3.4 for Tuesday
Defining Performance Which airplane has the best performance?
What Exactly is Parallel Processing?
Atomic Operations in Hardware
Atomic Operations in Hardware
Morgan Kaufmann Publishers
Lecture 2: Intro to the simd lifestyle and GPU internals
Parallel Architectures
CSCE 212 Chapter 4: Assessing and Understanding Performance
The Pentium: A CISC Architecture
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Parallel Computing and Parallel Computers
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Presentation transcript:

Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002

1. Quote 32-bit performance results, not 64-bit results 32-bit performance generally faster, but 64-bit arithmetic often needed for types of applications performed on supercomputers

2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application. Although the application typically spends a good deal of time in the inner kernel, it tends to exhibit greater parallelism than the overall application. Thus, representing speedups for the inner kernel as representative of speedup for overall application is misleading. See: Amdahl’s Law

3. Quietly employ assembly code and other low-level language constructs. The compiler for a parallel supercomputer may not take full advantage of the hardware of the system. Using assembly code or other low-level constructs will permit better use of the underlying hardware. However, the use of such low-level constructs should be reported when providing performance results.

4. Scale up the problem size with the number of processors, but omit any mention of this fact. For a fixed problem size, as you add more processors, the benefits of additional processors drops off as you introduce more overhead relative to the amount of computation done, and speed-up is thus less than linear. Scaling up the size of the problem as you add processors improves the ratio of useful work to overhead. Failing to state how you’ve measured speedup is misleading.

5. Quote performance results projected to a full system. Such projections assume linear functions – not likely true.

6.Compare your results against scalar, unoptimized code on Crays. You should compare your parallel version of a code to the best serial implementation that is known. Similarly, you should compare your parallel version of a code to the best implementation on whatever architecture you’re comparing to – not to the naïve version or worst version.

7.When direct run time comparisons are required, compare with an old code on an obsolete system. Same idea here …

8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best serial implementation. Parallel version for single processor is typically slower than serial version – due to added overhead

9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar. Runtime or MFLOPS, though likely more informative, don’t make your codes look quite so impressive

10.Mutilate the algorithm used in the parallel implementation to match the architecture. For example: to get higher MFLOPS (but longer run time)

11.Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment. Again, you should be comparing “your best” to “their best”.

12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance. … you get the idea ….