Workload-Driven Architectural Evaluation (II). 2 Outline.

Slides:

Advertisements

Similar presentations

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Advertisements

PRAM (Parallel Random Access Machine)

Workload-Driven Architectural Evaluation. 2 Difficult Enough for Uniprocessors Workloads need to be renewed and reconsidered Input data sets affect key.

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.

ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

What is an Algorithm? (And how do we analyze one?)

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.

EECC756 - Shaaban #1 lec # 9 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.

Workload-Driven Architecture Evaluation Todd C. Mowry CS 495 September 17-18, 2002.

EECC756 - Shaaban #1 lec # 9 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance: –Algorithm-related,

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Memory Management 2010.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.

Computer Architecture II 1 Computer architecture II Introduction.

Workload-Driven Evaluation CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Computer Architecture II 1 Computer architecture II Introduction.

EECC756 - Shaaban #1 lec # 12 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming.

EECC756 - Shaaban #1 lec # 11 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

CMPE655 - Shaaban #1 lec # 9 Fall Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance.

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

FLANN Fast Library for Approximate Nearest Neighbors

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Performance Evaluation of Parallel Processing. Why Performance?

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Combining the strengths of UMIST and The Victoria University of Manchester COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 7 Scalability.

1 Programming for Performance Reference: Chapter 3, Parallel Computer Architecture, Culler et.al.

1 What is a Multiprocessor? A collection of communicating processors View taken so far Goals: balance load, reduce inherent communication and extra work.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

RAM, PRAM, and LogP models

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

SNU OOPSLA Lab. 1 Great Ideas of CS with Java Part 1 WWW & Computer programming in the language Java Ch 1: The World Wide Web Ch 2: Watch out: Here comes.

Data Structures and Algorithms in Parallel Computing Lecture 1.

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Sunpyo Hong, Hyesoon Kim

Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.

Concurrency and Performance Based on slides by Henri Casanova.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Workload-Driven Evaluation Dr. Xiao Qin.

Lecture 2: Performance Evaluation

Parallel System Performance: Evaluation & Scalability

Workload-Driven Architecture Evaluation Todd C

Chapter 3: Principles of Scalable Performance

Department of Computer Science University of California, Santa Barbara

Course Outline Introduction in algorithms and applications

COMP60621 Fundamentals of Parallel and Distributed Systems

Parallel Programming in C with MPI and OpenMP

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

Workload-Driven Architectural Evaluation (II)

2 Outline

3 Questions in Scaling Under what constraints to scale the application? What are the appropriate metrics for performance improvement? work is not fixed any more, so time not enough How should the application be scaled? Definitions: Scaling a machine: Can scale power in many ways Assume adding identical nodes, each bringing memory Problem size: Vector of input parameters, e.g. N = (n, q,  t) Determines work done Distinct from data set size and memory usage Start by assuming it’s only one parameter n, for simplicity

4 Under What Constraints to Scale? Two types of constraints: User-oriented, e.g. particles, rows, transactions, I/Os per processor Resource-oriented, e.g. memory, time Which is more appropriate depends on application domain User-oriented easier for user to think about and change Resource-oriented more general, and often more real Resource-oriented scaling models: Problem constrained (PC) Memory constrained (MC) Time constrained (TC) Growth under MC and TC may be hard to predict

5 User wants to solve same problem, only faster Video compression Computer graphics VLSI routing But limited when evaluating larger machines Speedup PC (p) = Time(1) Time(p) Problem Constrained Scaling

6 Execution time is kept fixed as system scales User has fixed time to use machine or wait for result Performance = Work/Time as usual, and time is fixed, so Speedup TC (p) = How to measure work? Execution time on a single processor? (thrashing problems) Should be easy to measure, ideally analytical and intuitive Should scale linearly with sequential complexity Or ideal speedup will not be linear in p (e.g. no. of rows in matrix program) If cannot find intuitive application measure, as often true, measure execution time with ideal memory system on a uniprocessor Work(p) Work(1) Time Constrained Scaling

Memory Constrained Scaling Scaled Speedup: Time(1) / Time(p) for scaled up problem Hard to measure Time(1), and inappropriate MC Scaling Speedup MC (p) = Can lead to large increases in execution time If work grows faster than linearly in memory usage e.g. matrix factorization 10,000-by 10,000 matrix takes 800MB and 1 hour on uniprocessor With 1,024 processors, can run 320K-by-320K matrix, but ideal parallel time grows to 32 hours! With 10,024 processors, 100 hours... Time constrained seems to be most generally viable model Work(p) Time(p) x Time(1) Work(1) = Increase in Work Increase in Time

8 MC Scaling: Grid size = n√p-by-n√p Iterations to converge = n√p Work = O(n√p) 3 Ideal parallel execution time = O( ) = n 3 √p Grows by √p 1 hr on uniprocessor means 32 hr on 1024 processors TC scaling: If scaled grid size is k-by-k, then k 3 /p = n 3, so k = n. Memory needed per processor = k 2 /p = n 2 / Diminishes as cube root of number of processors Impact of Scaling Models: Grid Solver ( n√p) 3 p

9 Impact on Solver Execution Characteristics Concurrency: PC: fixed; MC: grows as p; TC: grows as p 0.67 Comm to comp: PC: grows as ; MC: fixed; TC: grows as Working Set: PC: shrinks as p; MC: fixed; TC: shrinks as Spatial locality? PC: decreases quickly; MC: fixed; TC: decreases less quickly Message size in message passing? A border row or column of a partition Expect speedups to be best under MC and worst under PC Should evaluate under all three models, unless some are unrealistic

10 Scaling Workload Parameters: Barnes- Hut Different parameters should be scaled relative to one another to meet the chosen constraint Number of bodies (n) Time-step resolution (  t) Force-calculation accuracy (  ) Scaling rule: All parameters should scale at same rate Work-load Result: if n scales by a factor of s  t and  must both scale by a factor of

11 Performance and Scaling Summary Performance improvement due to parallelism measured by speedup Scaling models are fundamental to proper evaluation Scaling constraints affect growth rates of key execution properties Time constrained scaling is a realistic method for many applications Should scale workload parameters appropriately with one another too Scaling only data set size can yield misleading results Proper scaling requires understanding the workload

12 Outline

13 Evaluating a Real Machine Performance Isolation using Microbenchmarks Choosing WorkloadsWorkloads Evaluating a Fixed-size MachineFixed-size Machine Varying Machine Size Metrics All these issues, plus more, relevant to evaluating a tradeoff via simulation

14 Performance Isolation: Microbenchmarks Microbenchmarks: Small, specially written programs to isolate performance characteristics Processing Local memory Input/output Communication and remote access (read/write, send/receive) Synchronization (locks, barriers) Contention for times =0 to 10,000 do for i=0 to Arraysize-1 by stride do load A[i] CRAY T3D

15 Evaluation using Realistic Workloads Must navigate three major axes: Workloads Problem Sizes No. of processors (one measure of machine size) (other machine parameters are fixed) Focus first on fixed number of processors

16 Realistic Complex Higher level interactions Are what really matters Easier to understand Controlled Repeatable Basic machine characteristics Each has its place: Use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance Types of Workloads Kernels: matrix factorization, FFT, depth-first tree search Complete Applications: ocean simulation, crew scheduling, database Multi-programmed Workloads Multiprog. ApplsKernels Microbench.

17 Desirable Properties of Workloads Representativeness of application domains Coverage of behavioral properties Adequate concurrency

18 Representativeness Should adequately represent domains of interest, e.g.: Scientific: Physics, Chemistry, Biology, Weather... Engineering: CAD, Circuit Analysis... Graphics: Rendering, radiosity... Information management: Databases, transaction processing, decision support... Optimization Artificial Intelligence: Robotics, expert systems... Multiprogrammed general-purpose workloads System software: e.g. the operating system

19 Coverage: Stressing Features Easy to mislead with workloads Choose those with features for which machine is good, avoid others Some features of interest: Compute vs. memory vs. communication vs. I/O bound Working set size and spatial locality Local memory and communication bandwidth needs Importance of communication latency Fine-grained or coarse-grained Data access, communication, task size Synchronization patterns and granularity Contention Communication patterns Choose workloads that cover a range of properties

20 Coverage: Levels of Optimization Many ways in which an application can be suboptimal Algorithmic, e.g. assignment, blocking Data structuring, e.g. 2-d or 4-d arrays for SAS grid problem Data layout, distribution and alignment, even if properly structured Orchestration contention long versus short messages synchronization frequency and cost,... Optimizing applications takes work Many practical applications may not be very well optimized May examine selected different levels to test robustness of system 2n2n p 4n4n p

21 Concurrency Should have enough to utilize the processors Algorithmic speedup: useful measure of concurrency/imbalance Speedup (under scaling model) assuming all memory/communication operations take zero time Ignores memory system, measures imbalance and extra work Uses PRAM machine model (Parallel Random Access Machine) Unrealistic, but widely used for theoretical algorithm development At least, should isolate performance limitations due to program characteristics that a machine cannot do much about (concurrency) from those that it can.

22 Workload/Benchmark Suites Numerical Aerodynamic Simulation (NAS) Originally pencil and paper benchmarks SPLASH/SPLASH-2 Shared address space parallel programs ParkBench Message-passing parallel programs ScaLapack Message-passing kernels TPC Transaction processing SPEC-HPC... back

23 Evaluating a Fixed-size Machine Many critical characteristics depend on problem size (Having fixed the workload and the machine size) Inherent application characteristics concurrency and load balance (generally improve with problem size) communication to computation ratio (generally improve) working sets and spatial locality (generally worsen and improve, resp.) Interactions with machine organizational parameters Nature of the major bottleneck: comm., imbalance, local access... Insufficient to use a single problem size Need to choose problem sizes appropriately Understanding of workloads will help Examine step by step using grid solver Assume 64 processors with 1MB cache and 64MB memory each

24 Steps in Choosing Problem Sizes 1. Determine range of useful sizes May know that users care only about a few problem sizes, but not generally applicable Below which problems are unrealistically small for the machine at hand Above which execution time or memory usage too large 2. Use understanding of inherent characteristics Communication-to-computation ratio, load balance... For grid solver, perhaps at least 32-by-32 points per processor 64 processors: 256 x 256 ; communication 4 x 32, or 128 grid per process

25 computation: 32 x 32; c-to-c : 128 x 8 (bytes)/ 32 x 32 x 5(floating point operations) 40MB/s c-to-c ratio with 200MFLOPS processor No need to go below 5MB/s (larger than 256-by-256 subgrid per processor) from this perspective, or 2K-by-2K grid overall So assume we choose 256-by-256, 1K-by-1K and 2K-by- 2K so far Variation of characteristics with problem size usually vary smoothly for inherent comm. and load balance So, pick some sizes along range Steps in Choosing Problem Sizes (cont’d)

26 Steps in Choosing Problem Sizes (cont’d) Interactions of locality with architecture often have thresholds (knees) Greatly affect characteristics like local traffic, artifactual comm. May require problem sizes to be added to ensure both sides of a knee are captured But also help prune the design space 3. Use temporal locality and working sets Fitting or not dramatically changes local traffic and artifactual comm. E.g. Raytrace working sets are nonlocal, Ocean are local

27 Steps in Choosing Problem Sizes (cont’d) Choose problem sizes on both sides of a knee if realistic Also try to pick one very large size (exercises TLB misses etc.) Solver: first (2 subrows) usually fits, second (full partition) may or not Doesn’t for largest (2K) so add 4K-by-4K grid Add 16K as large size, so grid sizes now 256, 1K, 2K, 4K, 16K (in each dimension) Miss ratio C WS Problem 1 WS Cache sizeProblem size 100 Problem 2 Problem 3 Problem 4 % of working set that fits in cache of sizeC (a)(b)

28 Steps in Choosing Problem Sizes (cont’d) 4. Use spatial locality and granularity interactions E.g., in grid solver, can we distribute data at page granularity in SAS? Affects whether cache misses are satisfied locally or cause comm. With 2D array representation, grid sizes 512, 1K, 2K no, 4K, 16K yes (4 KB page) For 4-D array representation, yes except for very small problems So no need to expand choices for this reason

29 Steps in Choosing Problem Sizes (cont’d) More stark example: false sharing in Radix sort Becomes a problem when cache line size exceeds n/(r*p) for radix r Many applications don’t display strong dependence (e.g. Barnes-Hut, irregular access patterns)