EECC756 - Shaaban #1 lec # 10 Spring2001 4-17-2001 Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Distributed Systems CS

SE-292 High Performance Computing

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Performance Analysis of Multiprocessor Architectures

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.

ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation.

Workload-Driven Architectural Evaluation (II). 2 Outline.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Parallel System Performance CS 524 – High-Performance Computing.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Slide 1 Parallel Computation Models Lecture 3 Lecture 4.

EECC756 - Shaaban #1 lec # 13 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.

Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

EECC756 - Shaaban #1 lec # 9 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

EECC756 - Shaaban #1 lec # 9 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance: –Algorithm-related,

Computer Performance Evaluation: Cycles Per Instruction (CPI)

EECC756 - Shaaban #1 lec # 12 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

EECC756 - Shaaban #1 lec # 12 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

EECC756 - Shaaban #1 lec # 11 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

CMPE655 - Shaaban #1 lec # 9 Fall Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

DISTRIBUTED COMPUTING

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Computer System Architectures Computer System Software

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

RAM, PRAM, and LogP models

Classic Model of Parallel Processing

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Data Structures and Algorithms in Parallel Computing Lecture 1.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Outline Why this subject? What is High Performance Computing?

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Workload-Driven Evaluation Dr. Xiao Qin.

Lecture 2: Performance Evaluation

Parallel System Performance: Evaluation & Scalability

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors

Chapter 3: Principles of Scalable Performance

Performance of computer systems

Computer Science Division

CSE8380 Parallel and Distributed Processing Presentation

Chapter 4 Multiprocessors

Performance of computer systems

Presentation transcript:

EECC756 - Shaaban #1 lec # 10 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance: –Algorithm-related, parallel program related, architecture/hardware- related. Workload-Driven Quantitative Architectural Evaluation: –Select applications or suite of benchmarks to evaluate architecture either on real or simulated machine. –From measured performance results compute performance metrics: Speedup, System Efficiency, Redundancy, Utilization, Quality of Parallelism. –Application Models of Parallel Computer Models: How the speedup of an application is affected subject to specific constraints: Fixed-load Model. Fixed-time Model. Fixed-Memory Model. Performance Scalability: –Definition. –Conditions of scalability. –Factors affecting scalability.

EECC756 - Shaaban #2 lec # 10 Spring Factors affecting Parallel System Performance Parallel Algorithm-related: –Available concurrency and profile, grain, uniformity, patterns. –Required communication/synchronization, uniformity and patterns. –Data size requirements. –Communication to computation ratio. Parallel program related: –Programming model used. –Resulting data/code memory requirements, locality and working set characteristics. –Parallel task grain size. –Assignment: Dynamic or static. –Cost of communication/synchronization. Hardware/Architecture related: –Total CPU computational power available. –Shared address space Vs. message passing. –Communication network characteristics. –Memory hierarchy properties.

EECC756 - Shaaban #3 lec # 10 Spring Performance Evaluation of Uniprocessors Design modification decisions made only after quantitative performance evaluation. For existing systems: Comparison of results. For future systems: Careful extrapolation from known quantities Wide base of programs leads to standard benchmarks: –Measured on wide range of machines and successive generations Measurements and technology assessment lead to proposed features. Simulation of new features designs: –Simulator developed that can run with and without a feature –Benchmarks run through the simulator to obtain results –Together with cost and complexity, decisions made.

EECC756 - Shaaban #4 lec # 10 Spring Performance Isolation: Microbenchmarks Microbenchmarks: Small, specially written programs to isolate performance characteristics –Processing. –Local memory. –Input/output. –Communication and remote access (read/write, send/receive) –Synchronization (locks, barriers). –Contention.

EECC756 - Shaaban #5 lec # 10 Spring Types of Workloads/Benchmarks –Kernels: matrix factorization, FFT, depth-first tree search –Complete Applications: ocean simulation, crew scheduling, database. –Multiprogrammed Workloads. Multiprog. ApplsKernels Microbench. Realistic Complex Higher level interactions Are what really matters Easier to understand Controlled Repeatable Basic machine characteristics Each has its place: Use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance

EECC756 - Shaaban #6 lec # 10 Spring Sample Workload/Benchmark Suites Numerical Aerodynamic Simulation (NAS) –Originally pencil and paper benchmarks SPLASH/SPLASH-2 –Shared address space parallel programs ParkBench –Message-passing parallel programs ScaLapack –Message-passing kernels TPC –Transaction processing –SPEC-HPC...

EECC756 - Shaaban #7 lec # 10 Spring Multiprocessor Simulation Simulation runs on a uniprocessor (can be parallelized too) –Simulated processes are interleaved on the processor Two parts to a simulator: –Reference generator: plays role of simulated processors And schedules simulated processes based on simulated time –Simulator of extended memory hierarchy Simulates operations (references, commands) issued by reference generator Coupling or information flow between the two parts varies –Trace-driven simulation: from generator to simulator –Execution-driven simulation: in both directions (more accurate) Simulator keeps track of simulated time and detailed statistics.

EECC756 - Shaaban #8 lec # 10 Spring Execution-Driven Simulation Memory hierarchy simulator returns simulated time information to reference generator, which is used to schedule simulated processes.

EECC756 - Shaaban #9 lec # 10 Spring Parallel Performance Metrics Revisited Degree of Parallelism (DOP): For a given time period, reflects the number of processors in a specific parallel computer actually executing a particular parallel program. Average Parallelism: –Given maximum parallelism = m –n homogeneous processors –Computing capacity of a single processor  –Total amount of work (instructions, computations: or as a discrete summation Where t i is the total time that DOP = i and The average parallelism A: In discrete form

EECC756 - Shaaban #10 lec # 10 Spring Parallel Performance Metrics Revisited Asymptotic Speedup: Execution time with one processor Execution time with an infinite number of available processors Asymptotic speedup S 

EECC756 - Shaaban #11 lec # 10 Spring Parallel Performance Metrics Revisited Arithmetic Mean Performance: Let R i be the execution rates (instructions per second) of m programs or benchmarks i = 1, 2 … m –With equal weighting: –With distribution  = (f i i = 1,2, …, m) Geometric Mean Performance: –With distribution  = (f i i = 1,2, …, m)

EECC756 - Shaaban #12 lec # 10 Spring Harmonic Mean Performance Arithmetic mean execution time per instruction: The harmonic mean execution rate across m benchmark programs: Weighted harmonic mean execution rate with weight distribution  = {f i |i = 1, 2, …, m} Harmonic Mean Speedup for a program with n parallel execution modes:

EECC756 - Shaaban #13 lec # 10 Spring Harmonic Mean Speedup for n Execution Mode Multiprocessor system Fig 3.2 page 111 See handout

EECC756 - Shaaban #14 lec # 10 Spring Efficiency, Utilization, Redundancy, Quality of Parallelism System Efficiency: Let O(n) be the total number of unit operations performed by an n-processor system and T(n) be the execution time in unit time steps: –Speedup factor: S(n) = T(1) /T(n) –System efficiency for an n-processor system: E(n) = S(n)/n = T(1)/[nT(n)] Redundancy: R(n) = O(n)/O(1) Utilization: U(n) = R(n)E(n) = O(n) /[nT(n)] Quality of Parallelism: Q(n) = S(n) E(n) / R(n) = T 3 (1) /[nT 2 (n)O(n)] Parallel Performance Metrics Revisited

EECC756 - Shaaban #15 lec # 10 Spring Parallel Performance Example The execution time T for three parallel programs is given in terms of processor count P and problem size N In each case, we assume that the total computation work performed by an optimal sequential algorithm scales as N+N 2. 1 For first parallel algorithm: T = N + N 2 /P This algorithm partitions the computationally demanding O(N 2 ) component of the algorithm but replicates the O(N) component on every processor. There are no other sources of overhead. 2 For the second parallel algorithm: T = (N+N 2 )/P This algorithm optimally divides all the computation among all processors but introduces an additional cost of For the third parallel algorithm: T = (N+N 2 )/P + 0.6P 2 This algorithm also partitions all the computation optimally but introduces an additional cost of 0.6P 2. All three algorithms achieve a speedup of about 10.8 when P = 12 and N=100. However, they behave differently in other situations as shown next. With N=100, all three algorithms perform poorly for larger P, although Algorithm (3) does noticeably worse than the other two. When N=1000, Algorithm (2) is much better than Algorithm (1) for larger P.

EECC756 - Shaaban #16 lec # 10 Spring Parallel Performance Example (continued) All algorithms achieve: Speedup = 10.8 when P = 12 and N=100 N=1000, Algorithm (2) performs much better than Algorithm (1) for larger P. Algorithm 1: T = N + N 2 /P Algorithm 2: T = (N+N 2 )/P Algorithm 3: T = (N+N 2 )/P + 0.6P 2

EECC756 - Shaaban #17 lec # 10 Spring A Parallel Performance measures Example O(1) = T(1) = n 3 O(n) = n 3 + n 2 log 2 n T(n) = 4n 3 /(n+3) Fig 3.4 page 114 Table 3.1 page 115 See handout

EECC756 - Shaaban #18 lec # 10 Spring Parallel Performance Metrics Revisited: Amdahl’s Law Harmonic Mean Speedup (i number of processors used): In the case w = {f i for i = 1, 2,.., n} = ( , 0, 0, …, 1-  ), the system is running sequential code with probability  and utilizing n processors with probability (1-  ) with other processor modes not utilized. Amdahl’s Law: S  1/  as n    Under these conditions the best speedup is upper-bounded by 1/ 

EECC756 - Shaaban #19 lec # 10 Spring Application Models of Parallel Computers If work load W or problem size s is unchanged then: –The efficiency E decreases rapidly as the machine size n increases because the overhead h increases faster than the machine size. The condition of a scalable parallel computer solving a scalable parallel problems exists when: –A desired level of efficiency is maintained by increasing the machine size and problem size proportionally. –In the ideal case the workload curve is a linear function of n: (Linear scalability in problem size). Application Workload Models for Parallel Computers: Bounded by limited memory, limited tolerance to interprocess communication (IPC) latency, or limited I/O bandwidth: –Fixed-load Model: Corresponds to a constant workload. –Fixed-time Model: Constant execution time. –Fixed-memory Model: Limited by the memory bound.

EECC756 - Shaaban #20 lec # 10 Spring The Isoefficiency Concept Workload w as a function of problem size s : w = w(s) h total communication/other overhead, as a function of problem size s and machine size n, h = h(s,n) Efficiency of a parallel algorithm implemented on a given parallel computer can be defined as: Isoefficiency Function: E can be rewritten as: E = 1/(1 + h(s, n)/w(s)). To maintain a constant E, W(s) should grow in proportion to h(s,n) or, C = E/(1-E) is a constant for a fixed efficiency E. The isoefficiency function is defined as follows: If the workload w(s) grows as fast as f E (n) then a constant efficiency can be maintained for the algorithm-architecture combination.

EECC756 - Shaaban #21 lec # 10 Spring Speedup Performance Laws: Fixed-Workload Speedup When DOP = i > n (n = number of processors) Execution time of W i Total execution time Fixed-load speedup factor is defined as the ratio of T(1) to T(n): Let Q(n) be the total system overheads on an n-processor system: The overhead delay Q(n) is both application- and machine-dependent and difficult to obtain in closed form.

EECC756 - Shaaban #22 lec # 10 Spring Amdahl’s Law for Fixed-Load Speedup For the special case where the system either operates in sequential mode (DOP = 1) or a perfect parallel mode (DOP = n), the Fixed-load speedup is simplified to: We assume here that the overhead factor Q(n)= 0 For the normalized case where: The equation is reduced to the previously seen form of Amdahl’s Law:

EECC756 - Shaaban #23 lec # 10 Spring Fixed-Time Speedup To run the largest problem size possible on a larger machine with about the same execution time. Speedup is given by:

EECC756 - Shaaban #24 lec # 10 Spring Gustafson’s Fixed-Time Speedup For the special fixed-time speedup case where DOP can either be 1 or n and assuming Q(n) = 0

EECC756 - Shaaban #25 lec # 10 Spring Fixed-Memory Speedup Let M be the memory requirement of a given problem Let W = g(M) or M = g -1 (W) where The fixed-memory speedup is defined by:

EECC756 - Shaaban #26 lec # 10 Spring Scalability Metrics The study of scalability is concerned with determining the degree of matching between a computer architecture and and an application algorithm and whether this degree of matching continues to hold as problem and machine sizes are scaled up. Basic scalablity metrics affecting the scalability of the system for a given problem: Machine Size n Clock rate f Problem Size s CPU time T I/O Demand d Memory Capacity m Communication overhead h(s, n), where h(s, 1) =0 Computer Cost c Programming Overhead p

EECC756 - Shaaban #27 lec # 10 Spring Parallel Scalability Metrics Scalability of An architecture/algorithm Combination Machine Size Hardware Cost CPU Time I/O Demand Memory Demand Programming Cost Problem Size Communication Overhead

EECC756 - Shaaban #28 lec # 10 Spring Revised Asymptotic Speedup, Efficiency Revised Asymptotic Speedup: –s problem size. –T(s, 1) minimal sequential execution time on a uniprocessor. –T(s, n) minimal parallel execution time on an n-processor system. –h(s, n) lump sum of all communication and other overheads. Revised Asymptotic Efficiency:

EECC756 - Shaaban #29 lec # 10 Spring Parallel System Scalability Scalability (informal restrictive definition): A system architecture is scalable if the system efficiency E(s, n) = 1 for all algorithms with any number of processors and any size problem s Scalability Definition (more formal): The scalability  (s, n) of a machine for a given algorithm is defined as the ratio of the asymptotic speedup S(s,n) on the real machine to the asymptotic speedup S I (s, n) on the ideal realization of an EREW PRAM

EECC756 - Shaaban #30 lec # 10 Spring Example: Scalability of Network Architectures for Parity Calculation Table 3.7 page 142 see handout

EECC756 - Shaaban #31 lec # 10 Spring Programmability Vs. Scalability Increased Scalability Increased Programmability Ideal Parallel Computers Message-passing multicomuter with distributed memory Multiprocessor with shared memory

EECC756 - Shaaban #32 lec # 10 Spring MPPs Scalability Issues Problems: –Memory-access latency. –Interprocess communication complexity or synchronization overhead. –Multi-cache inconsistency. –Message-passing overheads. –Low processor utilization and poor system performance for very large system sizes. Possible Solutions: –Low-latency fast synchronization techniques. –Weaker memory consistency models. –Scalable cache coherence protocols. –To relize shared virtual memory. –Improved software portability; standard parallel and distributed operating system support.

EECC756 - Shaaban #33 lec # 10 Spring Scalable Machines Design Choices: –Custom-designed or commodity nodes? –Capability of node-to-network interface –Supporting programming models? What does hardware scalability mean? –Avoids inherent design limits on resources –Bandwidth increases with machine size P –Latency does not. –Cost increases slowly with P.

EECC756 - Shaaban #34 lec # 10 Spring Limited Scaling of a Bus Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components CharacteristicBus Physical Length~ 1 ft Number of Connectionsfixed Maximum Bandwidthfixed Interface to Comm. mediummemory inf Global Orderarbitration ProtectionVirt -> physical Trusttotal OSsingle comm. abstractionHW

EECC756 - Shaaban #35 lec # 10 Spring Scaling of Workstations in a LAN? No clear limit to physical scaling, no global order, consensus difficult to achieve. Independent failure and restart. CharacteristicBusLAN Physical Length~ 1 ftKM Number of Connectionsfixedmany Maximum Bandwidthfixed??? Interface to Comm. mediummemory infperipheral Global Orderarbitration??? ProtectionVirt -> physicalOS Trusttotalnone OSsingleindependent comm. abstractionHWSW

EECC756 - Shaaban #36 lec # 10 Spring What fundamentally limits bandwidth? –single set of wires Must have many independent wires Connect modules through switches Bus vs Network Switch. Bandwidth Scalability

EECC756 - Shaaban #37 lec # 10 Spring Dancehall MP Organization Network bandwidth? Bandwidth demand? –independent processes? –communicating processes? Latency?

EECC756 - Shaaban #38 lec # 10 Spring Generic Distributed Memory Organization Network bandwidth? Bandwidth demand? –independent processes? –communicating processes? Latency?

EECC756 - Shaaban #39 lec # 10 Spring Key System Scaling Property Large number of independent communication paths between nodes. => allow a large number of concurrent transactions using different wires. initiated independently. No global arbitration Effect of a transaction only visible to the nodes involved –Effects propagated through additional transactions.

EECC756 - Shaaban #40 lec # 10 Spring Network Latency Scaling Example max distance: log n number of switches:  n log n overhead = 1 us, BW = 64 MB/s, 200 ns per hop Pipelined T 64 (128) = 1.0 us us + 6 hops * 0.2 us/hop = 4.2 us T 1024 (128) = 1.0 us us + 10 hops * 0.2 us/hop = 5.0 us Store and Forward T 64 sf (128) = 1.0 us + 6 hops * ( ) us/hop = 14.2 us T 64 sf (1024) = 1.0 us + 10 hops * ( ) us/hop = 23 us

EECC756 - Shaaban #41 lec # 10 Spring Cost Scaling cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP? Ratio of processors : memory : network : I/O ? Parallel efficiency(p) = Speedup(P) / P Costup(p) = Cost(p) / Cost(1) Cost-effective: speedup(p) > costup(p) Is super-linear speedup

EECC756 - Shaaban #42 lec # 10 Spring nCUBE/2 Machine Organization Entire machine synchronous at 40 MHz Single-chip node Basic module Hypercube network configuration DRAM interface D M A c h a n n e l s R o u t e r MMU I-Fetch & decode 64-bit integer IEEE floating point Operand $ Execution unit

EECC756 - Shaaban #43 lec # 10 Spring CM-5 Machine Organization