Presentation is loading. Please wait.

Presentation is loading. Please wait.

ARCHITECTURE PERFORMANCE EVALUATION Matthew Jacob SERC, Indian Institute of Science, Bangalore.

Similar presentations


Presentation on theme: "ARCHITECTURE PERFORMANCE EVALUATION Matthew Jacob SERC, Indian Institute of Science, Bangalore."— Presentation transcript:

1 ARCHITECTURE PERFORMANCE EVALUATION Matthew Jacob SERC, Indian Institute of Science, Bangalore

2 © MJT, IISc 2 Architecture Performance Evaluation 1. Introduction: Modeling, Simulation 2. Benchmark programs and suites 3. Fast simulation techniques 4. Analytical modeling

3 © MJT, IISc 3 Evaluating Computer Systems: When? Designer: During design Administrator: Before purchase Administrator: While tuning/configuring User: In deciding which system to use System not available System available

4 © MJT, IISc 4 Performance Evaluation 1. Performance measurement 2. Performance modeling

5 © MJT, IISc 5 Performance Evaluation. 1. Performance measurement  Time, space, power …  Using hardware or software probes  Example: Pentium hardware performance counters 2. Performance modeling

6 © MJT, IISc 6 Performance Evaluation.. 1. Performance measurement  Time, space, power …  Using hardware or software probes  Example: Pentium hardware performance counters 2. Performance modeling  Model  Representation of the system under study  A simplifying set of assumptions about how it behaves Interacts with the outside world Changes with time through the interactions between its own components

7 © MJT, IISc 7 Performance Evaluation…. 1. Performance measurement  Time, space, power …  Using hardware or software probes  Example: Pentium hardware performance counters 2. Performance modeling  Kinds of Models 1. Physical or scale model 2. Analytical model: Using mathematical equations 3. Simulation model: Computer based approach; using a computer program to mimic behaviour of system We will first look at Simulation, then at Analytical Modeling

8 © MJT, IISc 8 Simulation Imitation of some real thing, state of affairs, or process (Wikipedia) Using a system model instead of the actual physical system The act of simulating something generally entails representing certain key characteristics or behaviours of a selected physical or abstract system State of the system

9 © MJT, IISc 9 State State of a system  at a moment in time  a function of the values of the attributes of the objects that comprise the system Example: Consider a coffee shop, where there is a cashier and a coffee dispenser  State can be described by (Number of customers at Cashier, Number of Customers at Coffee dispenser)

10 © MJT, IISc 10 State Transition Diagram Change of state occurs due to 2 kinds of events  Arrival or Departure of a customer  Can label each state transition arc as A or D (0,0) (1,0) (2,0) (3,0) (0,1) (1,1) New Customer Arrives Customer Departs from System

11 © MJT, IISc 11 Event An incident or situation which occurs in a particular place during a particular interval of time  Example: Cashier is busy between times t 1 and t 2 time 0t1t1 t2t2

12 © MJT, IISc 12 Discrete Event An incident or situation which occurs at a particular instant in time  Example: Cashier becomes busy at time t 1  System state only changes instantaneously at such moments in time Discrete Event System Model  States  Discrete events and corresponding state changes time 0t1t1

13 © MJT, IISc 13 Discrete Event Simulation Involves keeping track of 1. System state 2. Pending events Each event has an associated time (Event type, Time) 3. Simulated time (Simulation Clock)

14 © MJT, IISc 14 The DES Algorithm Variables: SystemState, SimnClock, PendingEventList Initialize variables Insert first event into PendingEventList while (not done) { Delete event E with lowest time t from PendingEventList Advance SimnClock to that time t Update SystemState by calling event handler of event E }

15 © MJT, IISc 15 Example: Cashier at Coffee Shop Events? State? Event Handlers?

16 © MJT, IISc 16 Example: Cashier at Coffee Shop Events?  Arrival of customer, Departure of customer State?  boolean CashierBusy?  queue CashQueue Info in each queue item: arrival time of that customer Operations: EnQueue, DeQueue, IsEmpty  Keeping track of properties of interest e.g., Cashier utilization, Average wait time in cash queue Event Handlers?

17 © MJT, IISc 17 Example: Handler for Arrival (time t) if (CashierBusy?){ EnQueue(CashQueue, t ) } else { CashierBusy? = TRUE TimeCashierBecameBusy = t NumThroughQueue++ ScheduleEvent(D, t + SERVICETIME) }

18 © MJT, IISc 18 Example: Handler Departure (time t) if (IsEmpty(CashQueue)){ CashierBusy? = FALSE TotalCashierBusyTime += (t – TimeCashierBecameBusy) } else { next = DeQueue(CashQueue) NumThroughQueue++ TotalTimeInQueue += (t – next.arrivaltime) ScheduleEvent(D, t + SERVICETIME) }

19 © MJT, IISc 19 The DES Algorithm Variables: SystemState, SimnClock, PendingEventList Initialize variables Insert first event into PendingEventList while (not done) { Delete event E with lowest time t from PendingEventList Advance SimnClock to that time t Update SystemState by calling event handler of event E }

20 © MJT, IISc 20 Architectural Simulation Example: Simulation of memory system behaviour during execution of a given program  Objective: Average memory access time, Number of cache hits, etc. At least 3 different ways to do this

21 © MJT, IISc 21 Architectural Simulation. 1. Trace Driven Simulation 2. Stochastic Simulation 3. Execution Driven Simulation

22 © MJT, IISc 22 Architectural Simulation.. 1. Trace Driven Simulation  Trace: A log or record of all the relevant events that must be simulated Example: (R, 0x1279E, 1B), (R, 0xAB7800, 4B),…

23 © MJT, IISc 23 Architectural Simulation… 1. Trace Driven Simulation  Trace: A log or record of all the relevant events that must be simulated Example: (R, 0x1279E, 1B), (R, 0xAB7800, 4B),… 2. Stochastic Simulation  Driven by random number generators Example: Addresses are uniformly distributed between 0 and 2 32 -1; 45% of memory operations are Reads

24 © MJT, IISc 24 Architectural Simulation…. 1. Trace Driven Simulation  Trace: A log or record of all the relevant events that must be simulated Example: (R, 0x1279E, 1B), (R, 0xAB7800, 4B),… 2. Stochastic Simulation  Driven by random number generators Example: Addresses are uniformly distributed between 0 and 2 32 -1; 45% of memory operations are Reads 3. Execution Driven Simulation  Where you interleave the execution of the program (whose execution is being simulated) with the simulation of the target architecture

25 © MJT, IISc 25 Example: SimpleScalar A widely used execution driven architecture simulator (www.simplescalar.com) Tool set: compiler, assembler, linker, simulation and visualization tools Facilitates simulation of real programs on a range of modern processors  Fast functional simulator  Detailed out-of-order issue processor with non- blocking caches, speculative execution, branch prediction, etc. How fast are they? 10 MIPS 1 MIPS

26 © MJT, IISc 26 SimpleScalar. From Austin, Larsen, Ernst, IEEE Computer, Feb 2002 Program whose execution is being simulated Emulates execution of the instructions of the program System calls are executed on the host system where the simulation is running Interleaved with updating architectural state and statistics (e.g., P4 Linux) (MIPS)

27 © MJT, IISc 27 What programs are used? Performance can vary substantially from program to program To compare architectural alternatives, it would be good if a standard set of programs was used This has led to some degree of consensus on what programs to use in architectural studies Benchmark programs

28 © MJT, IISc 28 Kinds of Benchmark Programs 1.Toy Benchmarks  Factorial, Quicksort, Hanoi, Ackerman, Sieve 2.Synthetic Benchmarks  Dhrystone, Whetstone 3.Benchmark Kernels  DAXPY, Livermore loops 4.Benchmark Suites  SPEC benchmarks

29 © MJT, IISc 29 Synthetic Benchmarks: Whetstone Created in Whetstone Lab, UK, 1970s Synthetic, originally in Algol 60 Floating point, math libraries Synthetic Benchmarks: Dhrystone Pun on Whetstone; Weicker (1984) Integer performance “Typical" application mix of mathematical and other operations (string handling)

30 © MJT, IISc 30 Kernel Benchmarks: Livermore Loops Fortran DO loops extracted from frequently used programs at Lawrence Livermore National Labs, USA To assess floating point arithmetic performance http://www.netlib.org/benchmark/livermorec 1.Hydro fragment DO 1 L = 1, Loop DO 1 k = 1, n * 1 X(k) = Q + Y(k) * (R * ZX(k+10) + T * ZX(k+11)) 2.ICCG excerpt (Incomplete Cholesky Conjugate Gradient) 3.Inner product 4.Banded linear equations 5.Tri-diagonal elimination, below diagonal 6.General linear recurrence equations

31 © MJT, IISc 31 SPEC Benchmark Suites Standard Performance Evaluation Corporation  `Non-profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers’  `Develops suites of benchmarks and also reviews and publishes submitted results from member organizations and other benchmark licensees

32 © MJT, IISc 32 SPEC Consortium Members Acer Inc, Action S.A., AMD, Apple Inc, Azul Systems, Inc, BEA Systems, BlueArc, Bull S.A., Citrix Online, CommuniGate Systems, Dell, EMC, Fujitsu Limited, Fujitsu Siemens, Hewlett-Packard, Hitachi Data Systems, Hitachi Ltd., IBM, Intel, ION Computer Systems, Itautec S/A, Microsoft, NEC – Japan, NetEffect, Network Appliance. NVIDIA, Openwave Systems, Oracle, Panasas, Pathscale, Principled Technologies, QLogic Corporation, The Portland Group, Rackable Systems, Red Hat, SAP AG, Scali, SGI, Sun Microsystems, Super Micro Computer, Inc., SWsoft, Symantec Corporation, Trigence, Unisys

33 © MJT, IISc 33 SPEC Benchmark Suites … 1.CPU 2.Enterprise Services 3.Graphics/Applications 4.High Performance Computing 5.Java Client/Server 6.Mail Servers 7.Network File System 8.Web Servers

34 © MJT, IISc 34 Example: SPEC CPU2000 26 Programs with source code, input data sets, makefiles CINT2000 1.gzipC Compression 2.vprC FPGA Circuit Placement and Routing 3.gccC C Programming Language Compiler 4.mcfC Combinatorial Optimization 5.crafty C Game Playing: Chess 6.parser C Word Processing 7.eonC++ Computer Visualization 8.perlbmk C PERL Programming Language 9.gap C Group Theory, Interpreter 10.vortex C Object-oriented Database 11.bzip2 C Compression 12.twolf C Place and Route Simulator

35 © MJT, IISc 35 SPEC CPU2000 … CFP2000 1.wupwiseFortran 77 Quantum Chromodynamics 2.swim Fortran 77 Shallow Water Modeling 3.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field 4.applu Fortran 77 Parabolic/Elliptic PDEs 5.mesa C 3-D Graphics Library 6.galgel Fortran 90 Computational Fluid Dynamics 7.art C Image Recognition / Neural Networks 8.equake C Seismic Wave Propagation Simulation 9.facerecFortran 90 Face Recognition 10.ammp C Computational Chemistry 11.lucas Fortran 90 Number Theory / Primality Testing 12.fma3d Fortran 90 Finite-element Crash Simulation 13.sixtrack Fortran 77 Hi Energy Phys Accelerator Design 14.apsiFortran 77 Meteorology: Pollutant Distribution

36 © MJT, IISc 36 More Recently: SPEC CINT006 1.perlbenchCPERL Programming Language 2.bzip2 CCompression 3.gccCC Compiler 4.mcfCCombinatorial Optimization 5.gobmk CArtificial Intelligence: go 6.hmmer CSearch Gene Sequence 7.sjeng CArtificial Intelligence: chess 8.libquantum CPhysics: Quantum Computing 9.h264ref CVideo Compression 10.omnetpp C++Discrete Event Simulation 11.astar C++Path-finding Algorithms 12.xalancbmk C++XML Processing

37 © MJT, IISc 37 SPEC CFP2006 1.bwavesFortran Fluid Dynamics 2.gamessFortran Quantum Chemistry 3.milcC Physics: Quantum Chromodynamics 4.zeusmpFortran Physics/CFD 5.gromacsC/Fortran Biochemistry/Molecular Dynamics 6.cactusADM C/Fortran Physics/General Relativity 7.leslie3dFortran Fluid Dynamics 8.namdC++ Biology/Molecular Dynamics 9.dealllC++ Finite Element Analysis 10.soplexC++ Linear Programming, Optimization 11.povrayC++ Image Ray-tracing 12.calculixC/Fortran Structural Mechanics 13.GemsFDTDFortran Computational Electromagnetics 14.tontoFortran Quantum Chemistry 15.lbmC Fluid Dynamics 16.wrfC/Fortran Weather Prediction 17.sphinx3C Speech recognition

38 © MJT, IISc 38 Problem: SPEC program execution duration In term of instructions executed  CPU2000 Average: ~300 billion  Simulated at a speed of 1MIPS Programs to be simulated are getting larger  SPEC CPU2006: increase in program execution length by an order of magnitude Even more detailed simulation is needed  System level simulation, which takes operating system into account: 1000 times slower than SimpleScalar would take 4 days

39 © MJT, IISc 39 Approaches to Address this Problem Purpose of simulation: to estimate program CPI 1.Use (small) input data so that there is reduced execution time 2.Don’t simulate entire program execution  Example: Skip initial 1Billion instructions and then estimate CPI by simulating only the next 1Billion instructions 3.Simulate (carefully) selected parts of program execution on the regular input data  Example: SimPoint, SMARTS

40 © MJT, IISc 40 Reference: Wunderlich, Wenisch, Falsafi and Hoe, `SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling’, 30th ISCA (ACM/IEEE International Symposium on Computer Architecture) 2003 The Problem: A lot of computer architecture research is done through simulation Microarchitecture simulation is extremely time consuming

41 © MJT, IISc 41 Architecture Conferences 1. ISCA: International Symposium on Computer Architecture 2. ASPLOS: ACM Symposium on Architectural Support for Programming Languages and Operating Systems 3. HPCA: International Symposium on High Performance Computer Architecture 4. MICRO: International Symposium on Microarchitecture

42 © MJT, IISc 42 SMARTS Framework Complete program execution Not time line, but instruction line From Wunderlich et al, 30 th ISCA 2003

43 © MJT, IISc 43 SMARTS Framework From Wunderlich et al, 30 th ISCA 2003 Must simulate more than 1 instruction to estimate CPI Let U be the number of instructions simulated in a sample U, Sampling Unit size, Number of instructions that are simulated in detail in each sample

44 © MJT, IISc 44 SMARTS Framework. Must simulate more than 1 instruction to estimate CPI Let U be the number of instructions simulated in a sample U, Sampling Unit size, Number of instructions that are simulated in detail in each sample U N, Benchmark length in terms of Sampling Units From Wunderlich et al, 30 th ISCA 2003

45 © MJT, IISc 45 SMARTS Framework.. Systematic Sampling: Every k th sampling unit is simulated in detail From Wunderlich et al, 30 th ISCA 2003

46 © MJT, IISc 46 SMARTS Framework … Systematic Sampling: Every k th sampling unit is simulated in detail From Wunderlich et al, 30 th ISCA 2003

47 © MJT, IISc 47 SMARTS Framework …. Systematic Sampling: Every k th sampling unit is simulated in detail W, Number of instructions that detailed warming is done before each sample is taken From Wunderlich et al, 30 th ISCA 2003

48 © MJT, IISc 48 SMARTS Framework ….. Systematic Sampling: Every k th sampling unit is simulated in detail W, Number of instructions that detailed warming is done before each sample is taken From Wunderlich et al, 30 th ISCA 2003

49 © MJT, IISc 49 SMARTS Framework …… Systematic Sampling: Every k th sampling unit is simulated in detail W, Number of instructions that detailed warming is done before each sample is taken From Wunderlich et al, 30 th ISCA 2003 n, Total number of samples taken Functional Warming: Functional simulation + maintenance of selected microarchitecture state (such as cache hierarchy state, branch predictor state)

50 © MJT, IISc 50 From Wunderlich et al, 30 th ISCA 2003 Choice of Sample Size U

51 © MJT, IISc 51 Levels off after U = 1000 From Wunderlich et al, 30 th ISCA 2003

52 © MJT, IISc 52 Levels off after U = 1000 Previous approaches: Samples of 100M to 1B instructions From Wunderlich et al, 30 th ISCA 2003

53 © MJT, IISc 53 Large Samples Are Not Necessary Levels off after U = 1000 Previous approaches: Samples of 100M to 1B instructions From Wunderlich et al, 30 th ISCA 2003

54 © MJT, IISc 54 How Effective is SMARTS? From Wunderlich et al, 30 th ISCA 2003

55 © MJT, IISc 55 How Effective is SMARTS? Much faster simulation: 30x SimpleScalar From Wunderlich et al, 30 th ISCA 2003

56 © MJT, IISc 56 How Effective is SMARTS? Much faster simulation: 30x SimpleScalar Much lower average error, but 1.8 times slower than SimPoint From Wunderlich et al, 30 th ISCA 2003

57 © MJT, IISc 57 Analytical Modeling The Problem: A lot of computer architecture research is done through simulation Microarchitecture simulation is extremely time consuming Another solution: Analytical modeling Example: Karkhanis & Smith, A First-order Model of Superscalar Processors, 31st ISCA 2004, and doesn’t provide insight into what is happening in the processor

58 © MJT, IISc 58 Approach Objective: Analytical model for estimating superscalar processor program CPI  Inputs to the model: program characteristics Basic idea From Karkhanis, Smith 31st ISCA 2004 Steady state IPC

59 © MJT, IISc 59 Approach. Objective: Analytical model for estimating superscalar processor program CPI  Inputs to the model: program characteristics Basic idea From Karkhanis, Smith 31st ISCA 2004

60 © MJT, IISc 60 Approach.. Objective: Analytical model for estimating superscalar processor program CPI  Inputs to the model: program characteristics Basic idea From Karkhanis, Smith 31st ISCA 2004

61 © MJT, IISc 61 Approach.. Objective: Analytical model for estimating superscalar processor program CPI  Inputs to the model: program characteristics Basic idea Model the IPC loss due to these three major miss events They can be considered to be independent Steady state IPC From Karkhanis, Smith 31st ISCA 2004

62 © MJT, IISc 62 Approach.. Objective: Analytical model for estimating superscalar processor program CPI  Inputs to the model: program characteristics Basic idea Model the IPC loss due to these three major miss events They can be considered to be independent Steady state IPC From Karkhanis, Smith 31st ISCA 2004

63 © MJT, IISc 63 Important Input: IW Characteristic Relationship between number of instructions in instruction window and number of instructions that issue Used to calculate Power Law relationship “Starting with dependence statistics taken from instruction traces, the points on the IW curve … can be characterized by a set of relatively complex simultaneous non-linear equations” From Karkhanis, Smith 31st ISCA 2004

64 © MJT, IISc 64 Important Input: IW Characteristic. Relationship between number of instructions in instruction window and number of instructions that issue i = 1, N (size of instruction window) Probability that instruction j+i is dependent on instruction j From Karkhanis, Smith 31st ISCA 2004

65 © MJT, IISc 65 Branch Misprediction Penalty steady state IPC time (cycles) From Karkhanis, Smith 31st ISCA 2004

66 © MJT, IISc 66 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

67 © MJT, IISc 67 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

68 © MJT, IISc 68 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

69 © MJT, IISc 69 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

70 © MJT, IISc 70 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

71 © MJT, IISc 71 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

72 © MJT, IISc 72 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

73 © MJT, IISc 73 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

74 © MJT, IISc 74 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

75 © MJT, IISc 75 Branch Misprediction Penalty. From Karkhanis, Smith 31st ISCA 2004

76 © MJT, IISc 76 Branch Misprediction Penalty.. Front-end pipeline depth Ramp up Window Drain+ + Isolated Branch Misprediction Penalty = From Karkhanis, Smith 31st ISCA 2004

77 © MJT, IISc 77 ICache Miss Penalty From Karkhanis, Smith 31st ISCA 2004

78 © MJT, IISc 78 ICache Miss Penalty Miss delay - Window drain + Ramp up Isolated ICache Miss Penalty = From Karkhanis, Smith 31st ISCA 2004

79 © MJT, IISc 79 Long DCache Miss Penalty Isolated DCache Miss Penalty = Miss delay – ROB fill – Window drain + Ramp up From Karkhanis, Smith 31st ISCA 2004

80 © MJT, IISc 80 Accuracy of Model Average error 5.8% Maximum error 13% Higher than SMARTS but much faster From Karkhanis, Smith 31st ISCA 2004

81 © MJT, IISc 81 Lecture Summary Architecture evaluation studies make heavy use of simulation Simulation speedup through techniques like sampling is widely used Analytical modeling has been attempted too; it is much faster but less accurate Simulation speedup and model building are still areas of research activity


Download ppt "ARCHITECTURE PERFORMANCE EVALUATION Matthew Jacob SERC, Indian Institute of Science, Bangalore."

Similar presentations


Ads by Google