Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Slides:

Advertisements

Similar presentations

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Advertisements

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 Recap: Memory Hierarchy. 2 Unified vs.Separate Level 1 Cache Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Using Sampled and Incomplete Profiles David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.

1  1998 Morgan Kaufmann Publishers and UCB Performance CEG3420 Computer Design Lecture 3.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Arvind and Joel Emer Computer Science and Artificial Intelligence Laboratory M.I.T. Branch Prediction.

8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

Performance Chapter 4 P&H. Introduction How does one measure report and summarise performance? Complexity of modern systems make it very more difficult.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

ISLPED’99 International Symposium on Low Power Electronics and Design

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.

Architectural Impact of Stateful Networking Applications Javier Verdú, Jorge García Mario Nemirovsky, Mateo Valero The 1st Symposium on Architectures for.

Alpha Supplement CS 740 Oct. 14, 1998

Runtime Software Power Estimation and Minimization Tao Li.

Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.

Lec2.1 Computer Architecture Chapter 2 The Role of Performance.

Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin—Madison

Workload Design: Selecting Representative Program-Input Pairs Lieven Eeckhout Hans Vandierendonck Koen De Bosschere Ghent University, Belgium PACT 2002,

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.

Ghent University Veerle Desmet Lieven Eeckhout Koen De Bosschere Using Decision Trees to Improve Program-Based and Profile-Based Static Branch Prediction.

Computer Sciences Department University of Wisconsin-Madison

CS Lecture 20 The Case for a Single-Chip Multiprocessor

‘99 ACM/IEEE International Symposium on Computer Architecture

Understanding Performance Counter Data - 1

Phase Capture and Prediction with Applications

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

Serene Banerjee, Lizy K. John, Brian L. Evans

Adapted from the slides of Prof

Aliasing and Anti-Aliasing in Branch History Table Prediction

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Phase based adaptive Branch predictor: Seeing the forest for the trees

Project Guidelines Prof. Eric Rotenberg.

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information Systems (ELIS) Ghent University, Belgium CAECW’01, January 21, 2001

2 Outline Introduction Statistical Simulation –Statistical profiling –Synthetic trace generation Methodology Evaluation Conclusion

3 Introduction Architectural simulation –trace-driven or execution-driven –accurate –long simulation times –long traces to be stored Need for fast simulation techniques –take part of a full trace –analytical modeling –trace sampling –statistical simulation

4 Goal Previous work used SPEC benchmarks to evaluate statistical simulation In this talk we use both commercial and scientific workloads –SPECint, SPECfp, system traces, multimedia, X graphics, database

5 Statistical Simulation Three steps: –extract statistical profile from a program execution –generate synthetic trace from it –simulate on a trace-driven simulator Two major advantages: –statistical profile is more compact than full trace –fast simulation due to statistical nature  design space exploration in limited time

6 statistical profile Statistical Simulation real trace (e.g. SPEC benchmark) branch profiling cache profiling instruction profiling branch statistics cache statistics instruction statistics synthetic trace generator synthetic trace trace-driven simulator

7 Statistical Profiling Microarchitecture-independent statistics –instruction statistics Microarchitecture-dependent statistics –branch statistics –cache statistics Result: statistical simulation only to explore design options of processor core (cache and branch predictor are fixed)

8 Statistical Profiling Instruction Statistics Instruction mix (13 classes) Number of register operands Age of register operands –probability that register operand was produced  instructions before it in the trace (only RAW) Memory dependencies –probability that load is memory-dependent on the  -th store before it in the trace (only RAW)

9 Statistical Profiling Branch Statistics Six branch types –conditional branch, unconditional branch, call with offset, indirect jump, indirect call, return Distinction –branch prediction accuracy: refill pipeline on branch misprediction –branch target prediction accuracy: single- cycle bubble in pipeline on correct branch prediction but target misprediction

10 Statistical Profiling Cache Statistics D-cache statistics –L1 D-cache miss rate –L2 D-cache miss rate I-cache statistics –L1 I-cache miss rate –L2 I-cache miss rate

11 Synthetic Trace Generation Instruction-by-instruction through random number generation Determine instruction type number of operands age of register operands memory dependency branch behavior D-cache behavior I-cache behavior st add ld br mispredicted D-cache miss I-cache miss

12 Methodology: microarchitecture Out-of-order processor –8 and 16 issue –windows of 64 and 128 instructions McFarling branch predictor ‘small’ cache configuration –8KB DM L1 I-cache, 8KB DM L1 D-cache, 64KB 2WSA unified L2 cache ‘large’ cache configuration –32KB DM L1 I-cache, 64KB 2WSA L1 D-cache, 512KB 4WSA unified L2 cache Access time –L1 I-cache (1 cycle), L1 D-cache (2 cycles), L2 cache (10 cycles), main memory (80 cycles)

13 Methodology: benchmarks 8 SPECint95 benchmarks 5 SPECfp95 benchmarks (hydro2d, su2cor, swim, tomcatv, wave5) 8 IBS system traces (mpeg, jpeg, gs, verilog, gcc, sdet, nroff, groff) 4 MediaBench applications (g721, gs, gsm, mpeg2) 4 X graphics benchmarks (DooM, POVRay, Xanim, Quake) 2 TPC-D queries running on Postgres 6.3  ~ 200 million instructions / trace

14 Evaluation IPC prediction error = IPC real trace - IPC synthetic trace IPC real trace IPC real trace = IPC when running real trace on trace-driven simulator IPC synthetic trace = IPC when running synthetic trace generated from the statistical profile of the real trace Simulation speed: s IPC /x IPC less than 1% after simulating 1 million instructions

15 IPC prediction error (1) 157%135% -30% -20% -10% 0% 10% 20% 30% 40% hydro2d su2cor swim tomcatv wave5 mpeg jpeg gs verilog real_gcc sdet nroff groff g721_e gs gsm_e mpeg2 xanim xdoom xpovray xquake tpc-d.17 tpc-d.2 IPC prediction error SPECint95SPECfp95IBSMediaBenchX graphicsTPC-D li gcc compress go ijpeg vortex m88ksim perl 16-issue, 128-entry window, ‘small’ cache configuration high D-cache miss rate high D-cache miss rate

16 IPC prediction error (2) -30% -20% -10% 0% 10% 20% 30% li gcc compress go ijpeg vortex m88ksim perl hydro2d su2cor swim tomcatv wave5 mpeg jpeg gs verilog real_gcc sdet nroffgroff g721_e gs gsm_e mpeg2 xanim xdoom xpovray xquake tpc-d.17 tpc-d.2 IPC prediction error SPECint95SPECfp95IBSMediaBenchX graphicsTPC-D 16-issue, 128-entry window, ‘large’ cache configuration

17 IPC prediction error vs. static instruction count -40% -20% 0% 20% 40% 60% 80% 100% 120% 140% 160% static instruction count (number of instructions executed at least once) IPC prediction error w = 64; i = 8; 'small' cache w = 128; i = 16; 'small' cache w = 64; i = 8; 'large' cache w = 128; i = 16; 'large' cache DooM Quake DooM Quake gs (IBS) gcc gcc (IBS) mpeg (IBS) groff mpeg (IBS) groff nroff jpeg (IBS) verilog sdet nroff jpeg (IBS) verilog sdet TPC-D vortex go vortex go

18 Conclusion (1) Higher IPC prediction errors for applications with smaller static instruction count: –MediaBench applications –SPECfp95 benchmarks –2 X graphics benchmarks (POVRay and Xanim) –5 SPECint95 benchmarks

19 Conclusion (2) Smaller IPC prediction errors for applications with larger instruction footprint: –IBS system traces –TPC-D traces –2 X graphics benchmarks (DooM and Quake) –3 SPECint95 benchmarks (go, gcc, vortex)  IPC prediction error between -1% and 25%

20 Conclusion (3) Statistical simulation is a useful fast simulation technique for commercial workloads –due to higher variability in instructions –since commercial workloads have larger instruction footprint –which makes a statistical technique more powerful