Presentation is loading. Please wait.

Presentation is loading. Please wait.

EECC756 - Shaaban #1 lec # 10 Spring2001 4-17-2001 Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

Similar presentations


Presentation on theme: "EECC756 - Shaaban #1 lec # 10 Spring2001 4-17-2001 Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:"— Presentation transcript:

1 EECC756 - Shaaban #1 lec # 10 Spring2001 4-17-2001 Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance: –Algorithm-related, parallel program related, architecture/hardware- related. Workload-Driven Quantitative Architectural Evaluation: –Select applications or suite of benchmarks to evaluate architecture either on real or simulated machine. –From measured performance results compute performance metrics: Speedup, System Efficiency, Redundancy, Utilization, Quality of Parallelism. –Application Models of Parallel Computer Models: How the speedup of an application is affected subject to specific constraints: Fixed-load Model. Fixed-time Model. Fixed-Memory Model. Performance Scalability: –Definition. –Conditions of scalability. –Factors affecting scalability.

2 EECC756 - Shaaban #2 lec # 10 Spring2001 4-17-2001 Factors affecting Parallel System Performance Parallel Algorithm-related: –Available concurrency and profile, grain, uniformity, patterns. –Required communication/synchronization, uniformity and patterns. –Data size requirements. –Communication to computation ratio. Parallel program related: –Programming model used. –Resulting data/code memory requirements, locality and working set characteristics. –Parallel task grain size. –Assignment: Dynamic or static. –Cost of communication/synchronization. Hardware/Architecture related: –Total CPU computational power available. –Shared address space Vs. message passing. –Communication network characteristics. –Memory hierarchy properties.

3 EECC756 - Shaaban #3 lec # 10 Spring2001 4-17-2001 Performance Evaluation of Uniprocessors Design modification decisions made only after quantitative performance evaluation. For existing systems: Comparison of results. For future systems: Careful extrapolation from known quantities Wide base of programs leads to standard benchmarks: –Measured on wide range of machines and successive generations Measurements and technology assessment lead to proposed features. Simulation of new features designs: –Simulator developed that can run with and without a feature –Benchmarks run through the simulator to obtain results –Together with cost and complexity, decisions made.

4 EECC756 - Shaaban #4 lec # 10 Spring2001 4-17-2001 Performance Isolation: Microbenchmarks Microbenchmarks: Small, specially written programs to isolate performance characteristics –Processing. –Local memory. –Input/output. –Communication and remote access (read/write, send/receive) –Synchronization (locks, barriers). –Contention.

5 EECC756 - Shaaban #5 lec # 10 Spring2001 4-17-2001 Types of Workloads/Benchmarks –Kernels: matrix factorization, FFT, depth-first tree search –Complete Applications: ocean simulation, crew scheduling, database. –Multiprogrammed Workloads. Multiprog. ApplsKernels Microbench. Realistic Complex Higher level interactions Are what really matters Easier to understand Controlled Repeatable Basic machine characteristics Each has its place: Use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance

6 EECC756 - Shaaban #6 lec # 10 Spring2001 4-17-2001 Sample Workload/Benchmark Suites Numerical Aerodynamic Simulation (NAS) –Originally pencil and paper benchmarks SPLASH/SPLASH-2 –Shared address space parallel programs ParkBench –Message-passing parallel programs ScaLapack –Message-passing kernels TPC –Transaction processing –SPEC-HPC...

7 EECC756 - Shaaban #7 lec # 10 Spring2001 4-17-2001 Multiprocessor Simulation Simulation runs on a uniprocessor (can be parallelized too) –Simulated processes are interleaved on the processor Two parts to a simulator: –Reference generator: plays role of simulated processors And schedules simulated processes based on simulated time –Simulator of extended memory hierarchy Simulates operations (references, commands) issued by reference generator Coupling or information flow between the two parts varies –Trace-driven simulation: from generator to simulator –Execution-driven simulation: in both directions (more accurate) Simulator keeps track of simulated time and detailed statistics.

8 EECC756 - Shaaban #8 lec # 10 Spring2001 4-17-2001 Execution-Driven Simulation Memory hierarchy simulator returns simulated time information to reference generator, which is used to schedule simulated processes.

9 EECC756 - Shaaban #9 lec # 10 Spring2001 4-17-2001 Parallel Performance Metrics Revisited Degree of Parallelism (DOP): For a given time period, reflects the number of processors in a specific parallel computer actually executing a particular parallel program. Average Parallelism: –Given maximum parallelism = m –n homogeneous processors –Computing capacity of a single processor  –Total amount of work (instructions, computations: or as a discrete summation Where t i is the total time that DOP = i and The average parallelism A: In discrete form

10 EECC756 - Shaaban #10 lec # 10 Spring2001 4-17-2001 Parallel Performance Metrics Revisited Asymptotic Speedup: Execution time with one processor Execution time with an infinite number of available processors Asymptotic speedup S 

11 EECC756 - Shaaban #11 lec # 10 Spring2001 4-17-2001 Parallel Performance Metrics Revisited Arithmetic Mean Performance: Let R i be the execution rates (instructions per second) of m programs or benchmarks i = 1, 2 … m –With equal weighting: –With distribution  = (f i i = 1,2, …, m) Geometric Mean Performance: –With distribution  = (f i i = 1,2, …, m)

12 EECC756 - Shaaban #12 lec # 10 Spring2001 4-17-2001 Harmonic Mean Performance Arithmetic mean execution time per instruction: The harmonic mean execution rate across m benchmark programs: Weighted harmonic mean execution rate with weight distribution  = {f i |i = 1, 2, …, m} Harmonic Mean Speedup for a program with n parallel execution modes:

13 EECC756 - Shaaban #13 lec # 10 Spring2001 4-17-2001 Harmonic Mean Speedup for n Execution Mode Multiprocessor system Fig 3.2 page 111 See handout

14 EECC756 - Shaaban #14 lec # 10 Spring2001 4-17-2001 Efficiency, Utilization, Redundancy, Quality of Parallelism System Efficiency: Let O(n) be the total number of unit operations performed by an n-processor system and T(n) be the execution time in unit time steps: –Speedup factor: S(n) = T(1) /T(n) –System efficiency for an n-processor system: E(n) = S(n)/n = T(1)/[nT(n)] Redundancy: R(n) = O(n)/O(1) Utilization: U(n) = R(n)E(n) = O(n) /[nT(n)] Quality of Parallelism: Q(n) = S(n) E(n) / R(n) = T 3 (1) /[nT 2 (n)O(n)] Parallel Performance Metrics Revisited

15 EECC756 - Shaaban #15 lec # 10 Spring2001 4-17-2001 Parallel Performance Example The execution time T for three parallel programs is given in terms of processor count P and problem size N In each case, we assume that the total computation work performed by an optimal sequential algorithm scales as N+N 2. 1 For first parallel algorithm: T = N + N 2 /P This algorithm partitions the computationally demanding O(N 2 ) component of the algorithm but replicates the O(N) component on every processor. There are no other sources of overhead. 2 For the second parallel algorithm: T = (N+N 2 )/P + 100 This algorithm optimally divides all the computation among all processors but introduces an additional cost of 100. 3 For the third parallel algorithm: T = (N+N 2 )/P + 0.6P 2 This algorithm also partitions all the computation optimally but introduces an additional cost of 0.6P 2. All three algorithms achieve a speedup of about 10.8 when P = 12 and N=100. However, they behave differently in other situations as shown next. With N=100, all three algorithms perform poorly for larger P, although Algorithm (3) does noticeably worse than the other two. When N=1000, Algorithm (2) is much better than Algorithm (1) for larger P.

16 EECC756 - Shaaban #16 lec # 10 Spring2001 4-17-2001 Parallel Performance Example (continued) All algorithms achieve: Speedup = 10.8 when P = 12 and N=100 N=1000, Algorithm (2) performs much better than Algorithm (1) for larger P. Algorithm 1: T = N + N 2 /P Algorithm 2: T = (N+N 2 )/P + 100 Algorithm 3: T = (N+N 2 )/P + 0.6P 2

17 EECC756 - Shaaban #17 lec # 10 Spring2001 4-17-2001 A Parallel Performance measures Example O(1) = T(1) = n 3 O(n) = n 3 + n 2 log 2 n T(n) = 4n 3 /(n+3) Fig 3.4 page 114 Table 3.1 page 115 See handout

18 EECC756 - Shaaban #18 lec # 10 Spring2001 4-17-2001 Parallel Performance Metrics Revisited: Amdahl’s Law Harmonic Mean Speedup (i number of processors used): In the case w = {f i for i = 1, 2,.., n} = ( , 0, 0, …, 1-  ), the system is running sequential code with probability  and utilizing n processors with probability (1-  ) with other processor modes not utilized. Amdahl’s Law: S  1/  as n    Under these conditions the best speedup is upper-bounded by 1/ 

19 EECC756 - Shaaban #19 lec # 10 Spring2001 4-17-2001 Application Models of Parallel Computers If work load W or problem size s is unchanged then: –The efficiency E decreases rapidly as the machine size n increases because the overhead h increases faster than the machine size. The condition of a scalable parallel computer solving a scalable parallel problems exists when: –A desired level of efficiency is maintained by increasing the machine size and problem size proportionally. –In the ideal case the workload curve is a linear function of n: (Linear scalability in problem size). Application Workload Models for Parallel Computers: Bounded by limited memory, limited tolerance to interprocess communication (IPC) latency, or limited I/O bandwidth: –Fixed-load Model: Corresponds to a constant workload. –Fixed-time Model: Constant execution time. –Fixed-memory Model: Limited by the memory bound.

20 EECC756 - Shaaban #20 lec # 10 Spring2001 4-17-2001 The Isoefficiency Concept Workload w as a function of problem size s : w = w(s) h total communication/other overhead, as a function of problem size s and machine size n, h = h(s,n) Efficiency of a parallel algorithm implemented on a given parallel computer can be defined as: Isoefficiency Function: E can be rewritten as: E = 1/(1 + h(s, n)/w(s)). To maintain a constant E, W(s) should grow in proportion to h(s,n) or, C = E/(1-E) is a constant for a fixed efficiency E. The isoefficiency function is defined as follows: If the workload w(s) grows as fast as f E (n) then a constant efficiency can be maintained for the algorithm-architecture combination.

21 EECC756 - Shaaban #21 lec # 10 Spring2001 4-17-2001 Speedup Performance Laws: Fixed-Workload Speedup When DOP = i > n (n = number of processors) Execution time of W i Total execution time Fixed-load speedup factor is defined as the ratio of T(1) to T(n): Let Q(n) be the total system overheads on an n-processor system: The overhead delay Q(n) is both application- and machine-dependent and difficult to obtain in closed form.

22 EECC756 - Shaaban #22 lec # 10 Spring2001 4-17-2001 Amdahl’s Law for Fixed-Load Speedup For the special case where the system either operates in sequential mode (DOP = 1) or a perfect parallel mode (DOP = n), the Fixed-load speedup is simplified to: We assume here that the overhead factor Q(n)= 0 For the normalized case where: The equation is reduced to the previously seen form of Amdahl’s Law:

23 EECC756 - Shaaban #23 lec # 10 Spring2001 4-17-2001 Fixed-Time Speedup To run the largest problem size possible on a larger machine with about the same execution time. Speedup is given by:

24 EECC756 - Shaaban #24 lec # 10 Spring2001 4-17-2001 Gustafson’s Fixed-Time Speedup For the special fixed-time speedup case where DOP can either be 1 or n and assuming Q(n) = 0

25 EECC756 - Shaaban #25 lec # 10 Spring2001 4-17-2001 Fixed-Memory Speedup Let M be the memory requirement of a given problem Let W = g(M) or M = g -1 (W) where The fixed-memory speedup is defined by:

26 EECC756 - Shaaban #26 lec # 10 Spring2001 4-17-2001 Scalability Metrics The study of scalability is concerned with determining the degree of matching between a computer architecture and and an application algorithm and whether this degree of matching continues to hold as problem and machine sizes are scaled up. Basic scalablity metrics affecting the scalability of the system for a given problem: Machine Size n Clock rate f Problem Size s CPU time T I/O Demand d Memory Capacity m Communication overhead h(s, n), where h(s, 1) =0 Computer Cost c Programming Overhead p

27 EECC756 - Shaaban #27 lec # 10 Spring2001 4-17-2001 Parallel Scalability Metrics Scalability of An architecture/algorithm Combination Machine Size Hardware Cost CPU Time I/O Demand Memory Demand Programming Cost Problem Size Communication Overhead

28 EECC756 - Shaaban #28 lec # 10 Spring2001 4-17-2001 Revised Asymptotic Speedup, Efficiency Revised Asymptotic Speedup: –s problem size. –T(s, 1) minimal sequential execution time on a uniprocessor. –T(s, n) minimal parallel execution time on an n-processor system. –h(s, n) lump sum of all communication and other overheads. Revised Asymptotic Efficiency:

29 EECC756 - Shaaban #29 lec # 10 Spring2001 4-17-2001 Parallel System Scalability Scalability (informal restrictive definition): A system architecture is scalable if the system efficiency E(s, n) = 1 for all algorithms with any number of processors and any size problem s Scalability Definition (more formal): The scalability  (s, n) of a machine for a given algorithm is defined as the ratio of the asymptotic speedup S(s,n) on the real machine to the asymptotic speedup S I (s, n) on the ideal realization of an EREW PRAM

30 EECC756 - Shaaban #30 lec # 10 Spring2001 4-17-2001 Example: Scalability of Network Architectures for Parity Calculation Table 3.7 page 142 see handout

31 EECC756 - Shaaban #31 lec # 10 Spring2001 4-17-2001 Programmability Vs. Scalability Increased Scalability Increased Programmability Ideal Parallel Computers Message-passing multicomuter with distributed memory Multiprocessor with shared memory

32 EECC756 - Shaaban #32 lec # 10 Spring2001 4-17-2001 MPPs Scalability Issues Problems: –Memory-access latency. –Interprocess communication complexity or synchronization overhead. –Multi-cache inconsistency. –Message-passing overheads. –Low processor utilization and poor system performance for very large system sizes. Possible Solutions: –Low-latency fast synchronization techniques. –Weaker memory consistency models. –Scalable cache coherence protocols. –To relize shared virtual memory. –Improved software portability; standard parallel and distributed operating system support.

33 EECC756 - Shaaban #33 lec # 10 Spring2001 4-17-2001 Scalable Machines Design Choices: –Custom-designed or commodity nodes? –Capability of node-to-network interface –Supporting programming models? What does hardware scalability mean? –Avoids inherent design limits on resources –Bandwidth increases with machine size P –Latency does not. –Cost increases slowly with P.

34 EECC756 - Shaaban #34 lec # 10 Spring2001 4-17-2001 Limited Scaling of a Bus Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components CharacteristicBus Physical Length~ 1 ft Number of Connectionsfixed Maximum Bandwidthfixed Interface to Comm. mediummemory inf Global Orderarbitration ProtectionVirt -> physical Trusttotal OSsingle comm. abstractionHW

35 EECC756 - Shaaban #35 lec # 10 Spring2001 4-17-2001 Scaling of Workstations in a LAN? No clear limit to physical scaling, no global order, consensus difficult to achieve. Independent failure and restart. CharacteristicBusLAN Physical Length~ 1 ftKM Number of Connectionsfixedmany Maximum Bandwidthfixed??? Interface to Comm. mediummemory infperipheral Global Orderarbitration??? ProtectionVirt -> physicalOS Trusttotalnone OSsingleindependent comm. abstractionHWSW

36 EECC756 - Shaaban #36 lec # 10 Spring2001 4-17-2001 What fundamentally limits bandwidth? –single set of wires Must have many independent wires Connect modules through switches Bus vs Network Switch. Bandwidth Scalability

37 EECC756 - Shaaban #37 lec # 10 Spring2001 4-17-2001 Dancehall MP Organization Network bandwidth? Bandwidth demand? –independent processes? –communicating processes? Latency?

38 EECC756 - Shaaban #38 lec # 10 Spring2001 4-17-2001 Generic Distributed Memory Organization Network bandwidth? Bandwidth demand? –independent processes? –communicating processes? Latency?

39 EECC756 - Shaaban #39 lec # 10 Spring2001 4-17-2001 Key System Scaling Property Large number of independent communication paths between nodes. => allow a large number of concurrent transactions using different wires. initiated independently. No global arbitration Effect of a transaction only visible to the nodes involved –Effects propagated through additional transactions.

40 EECC756 - Shaaban #40 lec # 10 Spring2001 4-17-2001 Network Latency Scaling Example max distance: log n number of switches:  n log n overhead = 1 us, BW = 64 MB/s, 200 ns per hop Pipelined T 64 (128) = 1.0 us + 2.0 us + 6 hops * 0.2 us/hop = 4.2 us T 1024 (128) = 1.0 us + 2.0 us + 10 hops * 0.2 us/hop = 5.0 us Store and Forward T 64 sf (128) = 1.0 us + 6 hops * (2.0 + 0.2) us/hop = 14.2 us T 64 sf (1024) = 1.0 us + 10 hops * (2.0 + 0.2) us/hop = 23 us

41 EECC756 - Shaaban #41 lec # 10 Spring2001 4-17-2001 Cost Scaling cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP? Ratio of processors : memory : network : I/O ? Parallel efficiency(p) = Speedup(P) / P Costup(p) = Cost(p) / Cost(1) Cost-effective: speedup(p) > costup(p) Is super-linear speedup

42 EECC756 - Shaaban #42 lec # 10 Spring2001 4-17-2001 nCUBE/2 Machine Organization Entire machine synchronous at 40 MHz Single-chip node Basic module Hypercube network configuration DRAM interface D M A c h a n n e l s R o u t e r MMU I-Fetch & decode 64-bit integer IEEE floating point Operand $ Execution unit

43 EECC756 - Shaaban #43 lec # 10 Spring2001 4-17-2001 CM-5 Machine Organization


Download ppt "EECC756 - Shaaban #1 lec # 10 Spring2001 4-17-2001 Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:"

Similar presentations


Ads by Google