SimpleScalar v3.0 Tutorial U. of Wisconsin, CS752, Fall 2004 Andrey Litvin (main source: Austin & Burger) (also Dana Vantrease’ slides)

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 SS and Pipelining: The Sequel Data Forwarding Caches Branch Prediction Michele Co, September 24, 2001.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Branch Prediction in SimpleScalar

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

A Configurable Simulator for OOO Speculative Execution Design & Implementation By Mustafa Imran Ali ID#

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Introduction to SimpleScalar (Based on SimpleScalar Tutorial) TA: Kyung Hoon Kim CSCE614 Texas A&M University.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

1 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CPSC 614 Texas A&M University.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Hyunjun Jang Texas A&M University.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Lecture: Out-of-order Processors

??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

Computer Structure Multi-Threading

PowerPC 604 Superscalar Microprocessor

Introduction to SimpleScalar

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Comparison of Two Processors

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Overview Prof. Eric Rotenberg

A Configurable Simulator for OOO Speculative Execution

The University of Adelaide, School of Computer Science

Lecture 9: Dynamic ILP Topics: out-of-order processors

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

SimpleScalar v3.0 Tutorial U. of Wisconsin, CS752, Fall 2004 Andrey Litvin (main source: Austin & Burger) (also Dana Vantrease’ slides)

Simulator Basics What is an architectural simulator? –Tool that reproduces the behavior of a computing device Why use a simulator? –Flexible Rapid design space exploration Tailor abstraction level to need –Cheap Why not use a simulator? –Slow –Correctness?

Functional vs. Performance Functional simulators implement the architecture - Perform the actual execution - Implement what programmers see Performance (or timing) simulators implement the microarchitecture. - Model system resources/internals - Measure time - Implement what programmers do not see

Trace vs. Execution Driven (2) Trace + Easy to Implement -Requires large disk files to store instruction stream -Limited state stored - Speculation? Multiprocessor? Execution - Hard to Implement + Allows access to full state information at any point during program execution +/- Execution requires inclusion of instruction set emulator and an I/O emulation module

Simplescalar Developed by Todd Austin with Doug Burger at UW, ’94-’96 Execution – driven Collection of simulators that emulate the microprocessor at different detail levels (functional, functional + cache / bpred, out-of-order cycle-timer, etc.) Tools: –C/Fortran compiler, assembler, linker (for PISA) –DLite: target-machine-level debugger –pipeline trace viewer

Advantages of SimpleScalar Fast (given priority over clarity) –4 MIPS functional, 300 KIPS OOO on 1.5 GHz host Relatively(!) simple - short learning curve Modular design Well documented and commented Popular - support community, extensions Limitations will be summarized at the end

Sim-Fast Bare functional simulator Does not account for the behavior of any part of the microarchitecture Optimized for speed

Sim-Safe Similar to sim-fast (slower) Implements some memory op safeguards –Memory alignment –Memory access permission Good for debugging sim-fast crashes

Sim-EIO External trace/checkpoint generator Functional simulator like sim-fast and sim- safe Implements checks like sim-safe

Sim-Profile Profiles by symbol and address Keeps track of and reports (many options) –Dynamic instruction counts –Instruction class counts –Usage of address modes –Profiles of the text & data segment

Sim-Cache/Sim-Bpred Functional core drives the detailed model of the cache/branch predictor Similar to trace-driven cache/bpred simulation Fast results for miss/misprediction rates No timing simulation/performance impact

Cache Implementation block size, # of sets, associativity, all customizable Replacement policies: random, FIFO, LRU 2-level cache hierarchy supported (easily extended) Unified/separate I/D L1 and L2 caches

Implemented Predictors Specifying branch predictor type -bpred Implemented predictors –nottaken always predicts not taken –taken always taken –perfect always right –bimod BTB with 2-bit counters –2lev 2-level adaptive branch predictor Specify level1 size, level 2 size, history size(# bits), XOR PC and history? GAg, GAp, PAg, PAp, gshare –comb combining (meta) predictor

Sim-Outorder Detailed performance simulator Out-of-order execution core Register renaming, reorder buffer, speculative execution 2-level cache hierarchy Branch prediction

Making a Fast Timing Simulator Hardware advantage: parallelism Software emulator advantage: free space for auxillary structures Functional and detailed performance parts can be decoupled

Simulator RUU vs. Logical RUU No consumer -> producer links (tags) Rather, producer->consumer links for efficient result broadcast at writeback Values not tracked (except address) “Completed” bit (ROB and IQ combined) Same struct used for LSQ

Other Simulator Structures ready_queue –Ready to issue instructions –Used at issue stage –Built at writeback –Limited issue bandwidth –Policy: mul/div, load/store, branch first, then oldest first

Other Simulator Structures fu_pool – functional units –Issue latency vs. operational latency –Both constant, hard-coded in sim-outorder –Read port latency more detailed (cache simulation) –Issue latency – busy counter at FU –Operational latency – event queue event_queue –ordered by cycle –schedules writeback events

Other Simulator Structures create_vector –Register renaming –Maps logical register to RUU entry (architectural register up to date if entry is null) –Updated at dispatch and writeback –Similar to Qi in Tomasulo –Backed up during dispatch if sim “divines” misspeculation

Stage Implementation (greater detail in SS hack guide and v2.0 tutorial) Fetch (at ruu_fetch()) –Fetch ins-s up to cache line boundary –Block on miss –Put in Fetch Queue (FQ) –Probe branch predictor for next cache line to fetch from

Stage Implementation Decode (at ruu_decode()) –Fetch from FQ –Decode –Rename registers –Put into RUU and LSQ –Update RUU dependency lists, renaming table –Sim (functional core): Execute instruction, update machine state Detect branch misprediction, backup state (checkpoint)

Stage Implementation Issue (at ruu_issue() and lsq_refresh()) –Ready queue -> Event queue –Order based on policy (see ready_queue slide) –Check and reserve FU’s –Mem. Ops check memory dependences No load speculation (“maybe” dependence respected)

Stage Implementation Writeback (at ruu_writeback()) –Get finished instructions from ready queue –Wake up (put in ready queue) instructions with ready operands (use dependence list) –Performance core detects misprediction here and rolls back the state

Stage Implementation Commit (at ruu_commit()) –Service D-TLB misses –Update register file (logically) and rename table –Retire stores to D-cache –Reclaim RUU/LSQ entries of retirees

Limitations I/O and other system calls –Only some limited functional simulation Lacks support for arbitrary speculation –Only branches can cause rollback –No speculative loads –Harder problem: Decoupling of functional and timing cores (for performance) complicates data speculation extensions

Limitations of Memory System No causality in memory system –All events are calculated at time of first access Simple TLB and virtual memory model –No address translation –TLB miss is a (small) fixed latency parallel with cache access Bandwidth, non-blocking –Modeled as n FU’s (read/write ports with fixed issue delay) Accurate if memory system is lightly utilized –SMT extensions? Overhaul required for multiprocessor simulation

Extensions trace cache value prediction SMT multiprocessor more target ISA / host OS support

Miscellaneous Enter “sim-outorder” with no options to get options list and usage examples Options database (options.h) –Interface to register an option and check if entered at initialization Stats database (stats.h) –Register new stats –Define secondary stats and track distributions

Miscellaneous