# Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith.

## Presentation on theme: "Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith."— Presentation transcript:

Benchmarking: Science? Art? Neither? J. E. Smith University of Wisconsin-Madison Copyright (C) 2006 by James E. Smith

01/06copyright J. E. Smith, 2006 2 Everyone has an opinion… Like most people working in research and development, Im a great benchmark hobbyist My perspective on the benchmark process Emphasis on the scientific (or non-scientific) aspects Accumulated observations and opinions

01/06copyright J. E. Smith, 2006 3 Benchmarking Process Steps: 1) Define workload 2) Extract Benchmarks from applications 3) Choose performance metric 4) Execute benchmarks on target machine(s) 5) Project workload performance for target machine(s) and summarize results

01/06copyright J. E. Smith, 2006 4 The Benchmarking Process (Science)

01/06copyright J. E. Smith, 2006 5 Extracting Benchmarks Total work in application environment Total work in benchmarks Fraction of work from each job type Fraction of work from each benchmark Perfect Scaling assumption: Perfect scaling is often implicitly assumed, but does not always hold.

01/06copyright J. E. Smith, 2006 6 Projecting Performance Define scale factor Constant work model: work done is the same regardless of machine used. What is projected time? (assume perfect scaling) Constant time model: time spent in each program is the same regardless of machine used. What is projected work? (assume perfect scaling)

01/06copyright J. E. Smith, 2006 7 Performance Measured as a Rate Rate = work/time For constant work model; (i.e. weighted harmonic mean rate) For constant time model; Because the t i are fixed, this is essentially a weighted arithmetic mean. What about geometric mean? neither science nor art It has a nice (but mostly useless) mathematical property.

01/06copyright J. E. Smith, 2006 8 Defining the Workload Who is the benchmark designed for? Purchaser In the best position (theoretically) Workload is well known In theory, should develop own benchmarks In practice, often does not Problem: matching standard benchmarks to workload Developer Typically use both internal and standard benchmarks Use standard benchmarks more often than is admitted Deals with markets (application domains) Needs to know the market designers paradox Only needs to satisfy decision makers in the organization

01/06copyright J. E. Smith, 2006 9 Designers Paradox Consider multiple application domains and multiple computer designs Computer 3 gives best overall performance BUT WONT SELL Customers in domain 1 will choose Computer 1 and customers in domain 2 will choose Computer 2 ApplicationComputer 1Computer 2Computer 3 Domain time (sec.) time (sec.) time (sec.) Domain 1 10 100 20 Domain 2 100 10 20 Total Time 110 110 40

01/06copyright J. E. Smith, 2006 10 Defining the Workload Who is the benchmark designed for? Researcher Faces biggest problems No market (or all markets) Can easily fall prey to designers paradox Must satisfy anonymous reviewers Often fall prey to conventional wisdom e.g. execution driven simulation == good trace-driven simulation == bad

01/06copyright J. E. Smith, 2006 11 Program Space Choosing the benchmarks from a program space (art) Where the main difference lies among user, designer, researcher User can use scenario-based or day in the life choice – e.g., Sysmark, Winstone Designer can combine multiple scenarios based on marketing input Researcher has a problem – the set of all programs is not a well-defined space to choose from All possible programs? Put them in alphabetical order and choose randomly? Modeling may have a role to play (later)

01/06copyright J. E. Smith, 2006 12 Extracting Benchmarks (scaling) Cutting real applications down to size May change relative time spent in different parts of program Changing data set can be risky Data-dependent optimizations in SW and HW Generating a special data set is even riskier A bigger problem w/ multi-threaded benchmarks (more later)

01/06copyright J. E. Smith, 2006 13 Metrics Constant work model harmonic mean Gmean gives equal reward for speeding up all benchmarks Easier to speedup programs with more inherent parallelism the already fast programs get faster Hmean gives greater reward for speeding up the slow benchmarks Consistent with Amdahls law You can pay for bandwidth Hmean at a disadvantage Will become a greater issue with parallel benchmarks Arithmetic mean gives greater reward for speeding up already-fast benchmark

01/06copyright J. E. Smith, 2006 14 Reward for Speeding Up Slow Benchmark (Gmean)

01/06copyright J. E. Smith, 2006 15 Reward for Speeding Up Slow Benchmark (Hmean)

01/06copyright J. E. Smith, 2006 16 Reward for Speeding Up Slow Benchmark (Amean)

01/06copyright J. E. Smith, 2006 17 Defining Work Some benchmarks (SPEC) work with speedup rather than rate This leads to a rather odd work metric That which the base machine can do in a fixed amount of time Changing the base machine therefore changes the amount of work Here is where Geom mean would seem to have an advantage, but it is hiding the lack of good weights Weights can be adjusted if the baseline changes How do we solve this? Maybe use non-optimizing compiler for a RISC ISA and count dynamic instructions (or run on a non-pipelined processor??)

01/06copyright J. E. Smith, 2006 18 Weights Carefully chosen weights can play an important role in making the benchmark process more scientific Using unweighted means ignores relative importance in a real workload Give a set of weights along with benchmarks Ideally give several sets of weights for different application domains With scenario-based benchmarking, weights can be assigned during scenario creation ApplicationBenchmark1Benchmark2Benchmark3 Domain weight weight weight Domain 1.5.3.2 Domain 2.1.6.3 Domain 3.7 0.3

01/06copyright J. E. Smith, 2006 19 Reducing Simulation Time For researcher, cycle accurate simulation can be very time consuming Especially for realistic benchmarks Need to cut down benchmarks Sampling Simulate only a portion of the benchmark May require warm-up Difficult to do with perfect scaling property Significant recent study in this direction

01/06copyright J. E. Smith, 2006 20 Reducing Simulation Time (Sampling) Random sampling More scientifically based Relies on Central Limit Theorem Can provide confidence measures Phase-based sampling Analyze program phase behavior to select representative samples More of an art… if that Phase characterization is not as scientific as one might think

01/06copyright J. E. Smith, 2006 21 Role of Modeling Receiving significant research interest In general can be used to cut large space to small space In general, can be used to select representative benchmarks Can be used by researchers for subsetting Based on quantifiable, key characteristics Empirical models vs. mechanistic models Empirical models guess at the significant characteristics Mechanistic models derive them Modeling superscalar-based computers not as complex as it seems

01/06copyright J. E. Smith, 2006 22 Interval Analysis Superscalar execution can be divided into intervals separated by miss events Branch miss predictions I cache misses Long d cache misses TLB misses, etc.

01/06copyright J. E. Smith, 2006 23 Branch Miss Prediction Interval total time = N/D + c dr + c fe N = no. instructions in interval d = decode/dispatch width c dr = drain cycles c fe = front-end pipeline length time = N/D time = pipeline length time = branch latency window drain time

01/06copyright J. E. Smith, 2006 24 Overall Performance Total Cycles = N total /D + ((D-1)/2)*(m iL1 + m br + m L2 ) + m ic * c iL1 + m br * ( c dr + c fe ) + m L2 * ( - W/D + c lr + c L2 ) N total – total number of instructions D – pipeline decode/issue/retire width W – Window size m ic – i-cache missesc iL1 – I cache miss latency m br – branch mispredictionsc dr – window drain time m L2 – L2 cache misses c fe – pipeline front-end latency (non-overlapped)c lr – load resoluton time

01/06copyright J. E. Smith, 2006 25 Model Accuracy Average error 3 % Worst error 6.5%

01/06copyright J. E. Smith, 2006 26 Modeling Instruction Dependences Slide window over dynamic stream and compute average critical path, K unit latency avg. IPC = W/K non-unit latency avg. IPC = W/K*avg. latency

01/06copyright J. E. Smith, 2006 28 Challenges: Multiprocessors Multiple cores are with us (And will be pushed hard by manufacturers) Are SPLASH benchmarks the best we can do? Problems Performance determines the dynamic workload Thread self-scheduling, spinning on locks, barriers Sampling Must respect lock structure Scaling Must respect Amdahls law (intra-benchmark) Get out front of the problems As opposed to what happened with uniprocessors

01/06copyright J. E. Smith, 2006 29 Challenges: Multiprocessors Metrics Should account for Amdahls law (Inter-benchmark) Harmonic mean even more important than with ILP Programming models Standardized thread libraries Automatic parallelization? SPEC CPU has traditionally been a compiler test as well as CPU test Consider NAS-type benchmarks Benchmarker codes application

01/06copyright J. E. Smith, 2006 30 Challenges: Realistic Workloads Originally SPEC was RISC workstation based Unix, compiled C What about Windows apps? No source code OS-specific BUT used all the time Consider Winstone/Sysmark In our research, we are weaning ourselves off SPEC and toward Winstone/Sysmark – It does make a difference, i.e. IA-32 EL study Scenario-based benchmarks Include context switches, OS code, etc. Essential for sound system architecture research

01/06copyright J. E. Smith, 2006 31 Conclusions A scientific benchmarking process should be a foundation Program space Weights Defining work Scaling Metrics Modeling has a role to play Many multiprocessor (multi-core) challenges Scenario-based benchmarking

Similar presentations