Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University.

Similar presentations


Presentation on theme: "6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University."— Presentation transcript:

1 6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University of Texas at Austin Electrical and Computer Engineering

2 6/15/06Derek Chiou, UT Austin, RAMP2 FAST Goals Fast: a s fast as possible 2-3 orders of magnitude slower than target? Fast enough to run real datasets to completion Interactive? Accurate: p roduce cycle-accurate numbers for modern microprocessors (Pentium M) Complete: r un unmodified operating systems, applications, ISAs,… Transparent: full visibility, no performance hit Inexpensive: n eed thousands Usable: q uick changes, use RTL to generate I/O: the MOST important part of systems

3 6/15/06Derek Chiou, UT Austin, RAMP3 Functional/Timing Partitioning Proven Partitioning Asim, Simplescalar, Timing- First, Memoized, etc. Simplifies simulator. Promotes reuse Same performance in software Asim at 10KHz Most of the time spent in timing model! Hardware??? Functional Model (ISA) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Fetch Decode Rename Reservation stations Scheduling window Reorder buffer …. Inst stream

4 6/15/06Derek Chiou, UT Austin, RAMP4 FAST Functional model could be Pure software (QEMU, Bochs, Simics, SimNow) Use JIT for performance, very fast No better hardware for executing ISA than processor Can operate under the covers (flush cache for example) Pure Hardware (Hoe et al) Hybrid (Hoe et al) Timing model very simple hardware Functional Model (ISA) Timing Model (Micro-architecture) Inst stream FPGA Full-System Simulator

5 6/15/06Derek Chiou, UT Austin, RAMP5 What is a FAST Timing Model? Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2

6 6/15/06Derek Chiou, UT Austin, RAMP6 More Complexity Caches/TLBs? Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)? “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions NO DATAPATH (and only part of control path)!!!!

7 6/15/06Derek Chiou, UT Austin, RAMP7 Driving a Timing Model iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models

8 6/15/06Derek Chiou, UT Austin, RAMP8 Complexity: BP iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis-speculate Rollback, restore BP only works in processor if it’s fairly accurate Degrades to trace driven! FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way

9 6/15/06Derek Chiou, UT Austin, RAMP9 Parallelism: Detect Problem & Rollback FM Memory FM TM Network TM Memory Model

10 6/15/06Derek Chiou, UT Austin, RAMP10 Functional Model Rollback Need to Rollback, force branch Rollback, restore and continue How? set_pc(inst_num, pc) Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC Sufficient Currently implemented with checkpoints ISA state, memory, peripherals Works for parallelism too BR

11 6/15/06Derek Chiou, UT Austin, RAMP11 RTL to Timing Model Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2 Timing model perfectly models RTL Verification???

12 6/15/06Derek Chiou, UT Austin, RAMP12 Current FAST System

13 6/15/06Derek Chiou, UT Austin, RAMP13 QEMU on Xilinx PowerPC

14 6/15/06Derek Chiou, UT Austin, RAMP14 Status x86 functional model boots Linux, targeting 80486 to Pentium D-like and beyond (Dam Sunwoo) Modified Bochs and QEMU Branch-predicted multi-function unit, OOO timing model compiles in Bluespec (FAST group) Synthesized for FPGA, 8.5K lines of code, rated Top 5 User! Memory, disk models Hope to have network model soon Have straight pipeline 486 model with TLBs and caches Preliminary statistics gathered in hardware timing model RTL-to-timing model (Nikhil Patil) Defining tools for ISA extension and timing model assembly

15 6/15/06Derek Chiou, UT Austin, RAMP15 Timing Model Resources OOO, superscalar, 2b branch prediction, five functional units, 32KB DCache [INTERFACE: Fast_if]+ [TM: IfcVB(interface bt. Bluespec & Verilog)/CmdQ/Fetch/Decode/Rename/Execute] : 26% of V2P30 (3593 slices) 22 Block RAMS (out of 136) ROB broken right now Early configurable cache model (state shouldn’t change much) 32KB 4-way set associative cache with 16B cache-lines 165 slices (1% of a 2VP30) 17 block RAMs (12% of a 2VP30) 2MB 4-way set-associative cache with 64B cache-lines 140 slices (1% of a 2VP30) 40 block RAMs (29% of a 2VP30)

16 6/15/06Derek Chiou, UT Austin, RAMP16 Current Performance Functional model Up to 500K x86 inst/sec today on V2P30 FPGA includes rollbacks assuming 5% mis-speculation Not that optimized 5MIPS unmodified 10M+ on 3.0GHz Pentium 4 DRC box should give this performance PowerPC ISA should be much faster! PowerPC on PowerPC Timing model Not bottleneck!

17 6/15/06Derek Chiou, UT Austin, RAMP17 Conclusions 1MHz to 100MHz, cycle-accurate, full-system, multiprocessor x86, x86-64, PowerPC, ARM, Sparc simulator Leverage extant full-system simulators FPGA timing models maximize performance and statistic gathering capabilities Pretty much any timing model seems to fit into a single FPGA (Pentium M in V2P30?) Uniprocesssor, multi-processor capable Tools can minimize creation/modification effort


Download ppt "6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University."

Similar presentations


Ads by Google