10/4/01CSE 260 - Class #5 CSE 260 – Introduction to Parallel Computation Class 5: Parallel Performance October 4, 2001.

Slides:



Advertisements
Similar presentations
Analysis of Computer Algorithms
Advertisements

Turing Machines January 2003 Part 2:. 2 TM Recap We have seen how an abstract TM can be built to implement any computable algorithm TM has components:
MEMORY popo.
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Announcements You survived midterm 2! No Class / No Office hours Friday.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Potential for parallel computers/parallel programming
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Computer Organization and Architecture 18 th March, 2008.
1 Complexity of Network Synchronization Raeda Naamnieh.
Introduction to Analysis of Algorithms
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
CS107 Introduction to Computer Science
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Parallel System Performance CS 524 – High-Performance Computing.
Analysis of Algorithms COMP171 Fall Analysis of Algorithms / Slide 2 Introduction * What is Algorithm? n a clearly specified set of simple instructions.
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Performance Evaluation of Parallel Processing. Why Performance?
“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Algorithms & Flowchart
Chapter 10 Algorithm Analysis.  Introduction  Generalizing Running Time  Doing a Timing Analysis  Big-Oh Notation  Analyzing Some Simple Programs.
Parallel Programming with MPI and OpenMP
TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Supercomputing in Plain English Tuning Blue Waters Undergraduate Petascale Education Program May 29 – June
Potential for parallel computers/parallel programming
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
4- Performance Analysis of Parallel Programs
Big-O notation.
Introduction to Parallelism.
Chapter 3: Principles of Scalable Performance
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Algorithm Analysis (not included in any exams!)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
CSE8380 Parallel and Distributed Processing Presentation
COMP60621 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Memory System Performance Chapter 3
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

10/4/01CSE Class #5 CSE 260 – Introduction to Parallel Computation Class 5: Parallel Performance October 4, 2001

CSE Class #52 Speed vs. Cost How do you reconcile the fact that you may have to pay a LOT for a small speed gain?? Typically, the fastest computer is much more expensive per FLOP than a slightly slower one. And if you’re in no hurry, you can probably do the computation for free.

CSE Class #53 Speed vs. Cost How do you reconcile the fact that you may have to pay a LOT for a small speed gain?? Answer from economics: One needs to consider the “time value of answers” - for each situation, one can imagine two curves: value of solution Time of solution $ conference deadlines cost of solution maximum “utility”

CSE Class #54 Speed vs. Cost (2) But this information is difficult to determine Users often don’t know value of early answers. So one of two answers are usually considered: Raw speed – choose the fastest solution independent of cost Cost-performance – choose the most economical independent of speed Example: Gordon Bell prizes awarded yearly for fastest and best cost-performance.

CSE Class #55 Measuring Performance The following should be self-evident: The best way to compare two algorithms is to measure the execution time (or cost) of well-tuned implementations on the same computer. Ideally, the computer you plan to use. The best way to compare two computers is to measure the execution time (or cost) of the best algorithm on each one. Ideally, for the problem you want to solve.

CSE Class #56 How to lie with performance results The best way to compare two algorithms is to measure their execution time (or cost) on the same computer. Measure their FLOP/sec rate. See which “scales” better. Implement one carelessly. Run on different computers, try to “normalize”. The best way to compare two computers is to measure execution time (or cost) of the best algorithm on each Only look at theoretical peak speed. Use the same algorithm on both computers –even though it’s not well suited to one.

CSE Class #57 Speed-Up Speedup: S(p) = Execution time on one CPU Execution on p processors Speedup is a measure of how well a program “scales” as you increase the number of processors. We need more information to define speedup: What problem size? Do we use the same problem for all p’s? What serial algorithm and program should we use for the numerator? Can we use different algorithms for the numerator and the denominator??

CSE Class #58 Common Definitions* Speedup = Fixed size problem “Scaleup” or scaling = Problem size grows as p increases Choice of serial algorithm: 1.Serial algorithm is parallel algorithm with “p = 1”. (May even include code that initializes message buffers, etc.) 2.Serial time is fastest known serial algorithm running on a one processor of the parallel machine. Choice 1 is OK for getting a warm fuzzy feeling. doesn’t mean it’s a good algorithm. doesn’t mean it’s worth running job in parallel. Choice 2 is much better. * Warning: terms are used loosely. Check meaning by reading (or writing) paper carefully.

CSE Class #59 What is “good” speedup or scaling? Hopefully, S(p) > 1 Linear speedup or scaling: –S(p) = p –Parallel program considered perfectly scalable Superlinear speedup –S(p) > p –This actually can happen! Why??

CSE Class #510 What is maximum possible speedup? Let f = fraction of program (algorithm) that is serial and cannot be parallelized. For instance: –Loop initialization –Reading/writing to a single disk –Procedure call overhead Amdahl’s law gives a limit on speedup in terms of f. These “n”s should be “p”s.

CSE Class #511 Example of Amdahl’s Law Suppose that a calculation has a 4% serial portion, what is the limit of speedup on 16 processors? 16/(1 + (16 – 1)*.04) = 10 What is the maximum speedup, no matter how many processors you use? 1/0.04 = 25

CSE Class #512 Variants of Speedup: Efficiency Parallel Efficiency: E(p) = S(p)/p x 100% Efficiency is the fraction of total potential processing power that is actually used. –A program with linear speedup is 100% efficient

CSE Class #513 Important (but subtle) point! Speedup is defined for a fixed problem size. –As p get arbitrarily large, speedup must reach a limit (Ts/Tp < Ts/clock period). –Doesn’t reflect how big computers are used – people run bigger problems on bigger computers.

CSE Class #514 How should we increase problem size? Several choices: –Keep amount of “real work per processor” constant. –Keep size of problem per processor constant. –Keep efficiency constant Vipin Kumar’s “Isoefficiency” Analyze how problem size must grow.

CSE Class #515 Speedup vs. Scalability Gustafson* questioned Amdahl Law's assumption that the proportion of a program doing serial computations (f) remains the same over all problem sizes. –Example: suppose the serial part includes O(N) initialization for an O(N 3 ) algorithm. Then initialization takes a smaller fraction of time as the problem size grows. So f may becomes a smaller as the problem size grows larger. –Conversely, the cost of setting up communication buffers to all (p-1) other processors may increase f (particularly if work/processor is fixed.) *Gustafson: FixedTime/FixedTime.html

CSE Class #516 Speedup vs. Scalability Interesting consequence of increasing problem size: There is no bound to scaleup as n  infinity; Scaled speedup can be superlinear! Doesn’t happen often in practice.

CSE Class #517 Isoefficiency Kumar argues, for many algorithms, there is a problem size that keeps efficiency constant (e.g. 50%) –Make problem size small enough, it’s sure to be inefficient. –Good parallel algorithms will get more efficient as problems get larger. Serial fraction decreases Only problem is if there’s enough local memory

CSE Class #518 Isoefficiency The “isoefficiency” of an algorithm is the function N(p) that tells how much you must increase problem size to keep efficiency constant (as you increase p = number of procs). –A small (e.g. N(p)=p lg p) isoefficiency function means it’s easy to keep parallel computer working well. –Large isoefficiency function (e.g. N(p) = p 3 ) indicate the algorithm doesn’t scale up very well. –Isoefficiency doesn’t exist for unscalable methods.

CSE Class #519 Example of isoefficiency Tree-based algorithm to add list of N numbers: –Give each processor N/p number to add –Arrange processors in tree to communicate results. Parallel time is T(N,p) = N/p + c lg(p) Efficiency is N/(p T(N,p)) = N/(N+cp lg(p)) = 1/(1+cp lg(p)/N) For isoefficiency, must keep cp lg(p)/N constant, that is, we need N(p) = c p lg(p). Excellent isoefficiency: per-processor problem size increases only with lg(p). Time for problem of size N on p procs. Time to send one number Log base 2

CSE Class #520 More problems with scalability For many problems, it’s not obvious how to increase the size of a problem. Example: sparse matrix multiplication keep the number of nonzeros per row constant? keep the fraction of nonzeros per row constant? keep the fraction of nonzeros per matrix constant? keep the matrices square?? NAS Parallel Benchmarks made arbitrary decisions (class S, A, B and C problems)

CSE Class #521 Scalability analysis can be misused! Example: sparse Matrix Vector product y = Ax Suppose A has k non-zeros per row and column. Processors only need to store non-zeros. Vectors are dense. –Each processor “owns” a p ½ x p ½ submatrix of A and (1/p)-th of x and y. –Each processor must receive relevant portions of x from owners, compute local product, and send resulting portions of y back to owners. –We’ll keep number of non-zeros per processor constant. work/processor is constant storage/processor is constant

CSE Class #522 A scalable y = Ax algorithm: Each number is sent via a separate message. When A is large, each non-zero owned by a processor will be in distinct columns. Thus, it needs to get M messages (M= memory) Thus, the communication time is constant, even as P grows arbitrarily large. Actually, this is a terrible algorithm! Using a separate message per non-zero is very expensive! There are much more efficient ones (for reasonable size problems) – but they aren’t “perfectly scalable”. Cryptic details in Alpern&Carter, “Is Scalability Relevant?”, ‘95 SIAM conf. on Parallel Processing for Scientific Computing.

CSE Class #523 Little’s Law Transmission time x bits/time unit = bits in flight Translation by Burton Smith: Latency (in cycles) x Instructions/cycle = Concurrency Note: concurrent instructions must be independent Relevance to parallel computing Coarse-grained computation needs lots of independent chunks!