ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation.

Slides:

Advertisements

Similar presentations

Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Workload-Driven Architectural Evaluation. 2 Difficult Enough for Uniprocessors Workloads need to be renewed and reconsidered Input data sets affect key.

Workload-Driven Architectural Evaluation (II). 2 Outline.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.

EECC756 - Shaaban #1 lec # 9 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

Workload-Driven Architecture Evaluation Todd C. Mowry CS 495 September 17-18, 2002.

EECC756 - Shaaban #1 lec # 9 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance: –Algorithm-related,

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Computer Architecture II 1 Computer architecture II Introduction.

1 PERFORMANCE EVALUATION H Often in Computer Science you need to: – demonstrate that a new concept, technique, or algorithm is feasible –demonstrate that.

Workload-Driven Evaluation CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Computer Architecture II 1 Computer architecture II Introduction.

EECC756 - Shaaban #1 lec # 12 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

EECC756 - Shaaban #1 lec # 5 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.

EECC756 - Shaaban #1 lec # 11 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

CMPE655 - Shaaban #1 lec # 9 Fall Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance.

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Performance Evaluation of Parallel Processing. Why Performance?

Ratio Games and Designing Experiments Andy Wang CIS Computer Systems Performance Analysis.

Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-Core Architectures

1 Programming for Performance Reference: Chapter 3, Parallel Computer Architecture, Culler et.al.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

1 What is a Multiprocessor? A collection of communicating processors View taken so far Goals: balance load, reduce inherent communication and extra work.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

RAM, PRAM, and LogP models

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Static Process Scheduling

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 6, 2003 Interconnect 3: Richness.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Sunpyo Hong, Hyesoon Kim

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Workload-Driven Evaluation Dr. Xiao Qin.

Parallel System Performance: Evaluation & Scalability

Andy Wang CIS 5930 Computer Systems Performance Analysis

Workload-Driven Architecture Evaluation Todd C

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Chapter 3: Principles of Scalable Performance

Department of Computer Science University of California, Santa Barbara

Protocol Design Space of Snooping Cache Coherent Multiprocessors

Computer Systems Performance Evaluation

Massachusetts Institute of Technology

COMP60621 Fundamentals of Parallel and Distributed Systems

Parallel Programming: Performance Todd C

Computer Systems Performance Evaluation

CS 258 Parallel Computer Architecture Lecture 16 Snoopy Protocols I

Parallel Programming in C with MPI and OpenMP

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation

ECE669 L9: Workload Evaluation February 26, 2004 Outline °Evaluation of applications is important °Simulation of sample data sets provides important information °Working sets indicate grain size °Preliminary results offer opportunity for tuning °Understanding communication costs Remember: software and communication!

ECE669 L9: Workload Evaluation February 26, 2004 Workload-Driven Evaluation °Evaluating real machines °Evaluating an architectural idea or trade-offs => need good metrics of performance => need to pick good workloads => need to pay attention to scaling many factors involved °Today: narrow architectural comparison °Set in wider context

ECE669 L9: Workload Evaluation February 26, 2004 Evaluation in Uniprocessors °Decisions made only after quantitative evaluation °For existing systems: comparison and procurement evaluation °For future systems: careful extrapolation from known quantities °Wide base of programs leads to standard benchmarks Measured on wide range of machines and successive generations °Measurements and technology assessment lead to proposed features °Then simulation Simulator developed that can run with and without a feature Benchmarks run through the simulator to obtain results Together with cost and complexity, decisions made

ECE669 L9: Workload Evaluation February 26, 2004 More Difficult for Multiprocessors °What is a representative workload? °Software model has not stabilized °Many architectural and application degrees of freedom Huge design space: no. of processors, other architectural, application Impact of these parameters and their interactions can be huge High cost of communication °What are the appropriate metrics? °Simulation is expensive Realistic configurations and sensitivity analysis difficult Larger design space, but more difficult to cover °Understanding of parallel programs as workloads is critical Particularly interaction of application and architectural parameters

ECE669 L9: Workload Evaluation February 26, 2004 A Lot Depends on Sizes °Application parameters and no. of procs affect inherent properties Load balance, communication, extra work, temporal and spatial locality °Interactions with organization parameters of extended memory hierarchy affect communication and performance °Effects often dramatic, sometimes small: application-dependent Understanding size interactions and scaling relationships is key ocean Barnes-hut

ECE669 L9: Workload Evaluation February 26, 2004 Scaling: Why Worry? °Fixed problem size is limited °Too small a problem: May be appropriate for small machine Parallelism overheads begin to dominate benefits for larger machines -Load imbalance -Communication to computation ratio May even achieve slowdowns Doesn’t reflect real usage, and inappropriate for large machines -Can exaggerate benefits of architectural improvements, especially when measured as percentage improvement in performance °Too large a problem Difficult to measure improvement (next)

ECE669 L9: Workload Evaluation February 26, 2004 Too Large a Problem °Suppose problem realistically large for big machine °May not “fit” in small machine Can’t run Thrashing to disk Working set doesn’t fit in cache °Fits at some p, leading to superlinear speedup °Real effect, but doesn’t help evaluate effectiveness °Finally, users want to scale problems as machines grow Can help avoid these problems

ECE669 L9: Workload Evaluation February 26, 2004 Demonstrating Scaling Problems °Small Ocean and big equation solver problems on SGI Origin2000

ECE669 L9: Workload Evaluation February 26, 2004 Communication and Replication °View parallel machine as extended memory hierarchy Local cache, local memory, remote memory Classify “misses” in “cache” at any level as for uniprocessors -compulsory or cold misses (no size effect) -capacity misses (yes) -conflict or collision misses (yes) -communication or coherence misses (no) °Communication induced by finite capacity is most fundamental artifact Like cache size and miss rate or memory traffic in uniprocessors

ECE669 L9: Workload Evaluation February 26, 2004 Working Set Perspective Hierarchy of working sets At first level cache (fully assoc, one-word block), inherent to algorithm -working set curve for program Traffic from any type of miss can be local or nonlocal (communication) At a given level of the hierarchy (to the next further one) First working set Capacity-generated traffic (including conflicts) Second working set Data traf fic Other capacity-independent communication Cold-start (compulsory) traffic Replication capacity (cache size) Inherent communication

ECE669 L9: Workload Evaluation February 26, 2004 Workload-Driven Evaluation °Evaluating real machines °Evaluating an architectural idea or trade-offs => need good metrics of performance => need to pick good workloads => need to pay attention to scaling many factors involved

ECE669 L9: Workload Evaluation February 26, 2004 Questions in Scaling °Scaling a machine: Can scale power in many ways Assume adding identical nodes, each bringing memory °Problem size: Vector of input parameters, e.g. N = (n, q,  t) Determines work done Distinct from data set size and memory usage °Under what constraints to scale the application? What are the appropriate metrics for performance improvement? -work is not fixed any more, so time not enough °How should the application be scaled?

ECE669 L9: Workload Evaluation February 26, 2004 Under What Constraints to Scale? °Two types of constraints: User-oriented, e.g. particles, rows, transactions, I/Os per processor Resource-oriented, e.g. memory, time °Which is more appropriate depends on application domain User-oriented easier for user to think about and change Resource-oriented more general, and often more real °Resource-oriented scaling models: Problem constrained (PC) Memory constrained (MC) Time constrained (TC)

ECE669 L9: Workload Evaluation February 26, 2004 Problem Constrained Scaling °User wants to solve same problem, only faster Video compression Computer graphics VLSI routing °But limited when evaluating larger machines Speedup PC (p) = Time(1) Time(p)

ECE669 L9: Workload Evaluation February 26, 2004 Time Constrained Scaling °Execution time is kept fixed as system scales User has fixed time to use machine or wait for result °Performance = Work/Time as usual, and time is fixed, so SpeedupTC(p) = °How to measure work? Execution time on a single processor? (thrashing problems) Should be easy to measure, ideally analytical and intuitive Should scale linearly with sequential complexity -Or ideal speedup will not be linear in p (e.g. no. of rows in matrix program) If cannot find intuitive application measure, as often true, measure execution time with ideal memory system on a uniprocessor Work(p) Work(1)

ECE669 L9: Workload Evaluation February 26, 2004 Memory Constrained Scaling °Scale so memory usage per processor stays fixed °Scaled Speedup: Time(1) / Time(p) for scaled up problem Hard to measure Time(1), and inappropriate Speedup MC (p) = °Can lead to large increases in execution time If work grows faster than linearly in memory usage e.g. matrix factorization -10,000-by 10,000 matrix takes 800MB and 1 hour on uniprocessor. With 1,000 processors, can run 320K-by-320K matrix, but ideal parallel time grows to 32 hours! -With 10,000 processors, 100 hours... Work(p) Time(p) x Time(1) Work(1) = Increase in Work Increase in Time

ECE669 L9: Workload Evaluation February 26, 2004 Scaling Summary °Under any scaling rule, relative structure of the problem changes with P PC scaling: per-processor portion gets smaller MC & TC scaling: total problem get larger °Need to understand hardware/software interactions with scale °For given problem, there is often a natural scaling rule example: equal error scaling

ECE669 L9: Workload Evaluation February 26, 2004 Types of Workloads Kernels: matrix factorization, FFT, depth-first tree search Complete Applications: ocean simulation, crew scheduling, database Multiprogrammed Workloads °Multiprog. ApplsKernels Microbench. Realistic Complex Higher level interactions Are what really matters Easier to understand Controlled Repeatable Basic machine characteristics Each has its place: Use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance

ECE669 L9: Workload Evaluation February 26, 2004 Coverage: Stressing Features °Easy to mislead with workloads Choose those with features for which machine is good, avoid others °Some features of interest: Compute v. memory v. communication v. I/O bound Working set size and spatial locality Local memory and communication bandwidth needs Importance of communication latency Fine-grained or coarse-grained -Data access, communication, task size Synchronization patterns and granularity Contention Communication patterns °Choose workloads that cover a range of properties

ECE669 L9: Workload Evaluation February 26, 2004 Coverage: Levels of Optimization °Many ways in which an application can be suboptimal Algorithmic, e.g. assignment, blocking Data structuring, e.g. 2-d or 4-d arrays for SAS grid problem Data layout, distribution and alignment, even if properly structured Orchestration -contention -long versus short messages -synchronization frequency and cost,... Also, random problems with “unimportant” data structures °Optimizing applications takes work Many practical applications may not be very well optimized 2n p 4n p

ECE669 L9: Workload Evaluation February 26, 2004 Concurrency °Should have enough to utilize the processors If load imbalance dominates, may not be much machine can do (Still, useful to know what kinds of workloads/configurations don’t have enough concurrency) °Algorithmic speedup: useful measure of concurrency/imbalance Speedup (under scaling model) assuming all memory/communication operations take zero time Ignores memory system, measures imbalance and extra work Uses PRAM machine model (Parallel Random Access Machine) -Unrealistic, but widely used for theoretical algorithm development °At least, should isolate performance limitations due to program characteristics that a machine cannot do much about (concurrency) from those that it can.

ECE669 L9: Workload Evaluation February 26, 2004 Steps in Choosing Problem Sizes °Variation of characteristics with problem size usually smooth So, for inherent comm. and load balance, pick some sizes along range °Interactions of locality with architecture often have thresholds (knees) Greatly affect characteristics like local traffic, artifactual comm. May require problem sizes to be added -to ensure both sides of a knee are captured But also help prune the design space

ECE669 L9: Workload Evaluation February 26, 2004 Our Cache Sizes (16x1MB, 16x64KB)

ECE669 L9: Workload Evaluation February 26, 2004 Multiprocessor Simulation °Simulation runs on a uniprocessor (can be parallelized too) Simulated processes are interleaved on the processor °Two parts to a simulator: Reference generator: plays role of simulated processors -And schedules simulated processes based on simulated time Simulator of extended memory hierarchy -Simulates operations (references, commands) issued by reference generator °Coupling or information flow between the two parts varies Trace-driven simulation: from generator to simulator Execution-driven simulation: in both directions (more accurate) °Simulator keeps track of simulated time and detailed statistics

ECE669 L9: Workload Evaluation February 26, 2004 Execution-driven Simulation °Memory hierarchy simulator returns simulated time information to reference generator, which is used to schedule simulated processes

ECE669 L9: Workload Evaluation February 26, 2004 Summary °Evaluate design tradeoffs many underlying design choices prove coherence, consistency °Evaluation must be based on sound understandng of workloads drive the factors you want to study representative scaling factors °Use of workload driven evaluation to resolve architectural questions