Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

Similar presentations


Presentation on theme: "© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,"— Presentation transcript:

1 © Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe, Hari Angepat University of Texas at Austin Electrical and Computer Engineering

2 Test of size 1/17/08RAMP Retreat 2 First, Some Terminology Host: the system on which a simulator runs Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM A Xilinx FPGA board Target: t he system being modeled Alpha 21264 processor Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM Host Simulator Target Your desktop Simplescalar (sim-alpha) Alpha 21264

3 Test of size 1/17/08RAMP Retreat 3 FAST Goals RTL-level cycle-accuracy Complex ISA capable (x86, PowerPC) Complex micro-architecture capable (Intel Core 2) Off-the-shelf OS, apps (Windows, MS Word, Linux) Lesson I learned from my grad school career Can run other stuff too of course MP-capable (scale with FPGA resources) Fast (10MIPS range) (Relatively) easy to implement, modify, extend

4 Test of size 1/17/08RAMP Retreat 4 FAST Prototype in Real Time

5 Test of size 1/17/08RAMP Retreat 5 FAST: Speculative Functional/Timing Partitioning Proven partitioning (FastSim) FM executes instructions to completion, pushes inst trace to TM FM insts used as TM fetched insts If functional insts != timing insts, TM forces FM to rollback Eg., branch mis-speculation, resolve, memory ordering Clean inst trace/rollback interface Optimize the common case! FM runs independently from TM when functional insts == timing insts! Easy to parallelize Better target uArch simulates faster Factorized, not partitioned! (FM + TM) < (monolithic simulator) FM fairly simple, only functionality TM fairly simple, only timing Functional Model (ISA + peripherals) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Caches Arbitration Pipelining Associativity …. Inst trace

6 Test of size 1/17/08RAMP Retreat 6 High Level FAST Architecture: A Parallelized Simulator Parallelized between FM & TM Parallel target would have a parallelized FM, each target core conceptually running on a separate host core Parallelized TM Parallelizes nicely in hardware (FPGA) Latency tolerant, infrequent round-trips Stats can be done in hardware! (no performance impact) trace Functional Model (ISA + peripherals) Timing Model (Micro-architecture) Host FPGA

7 Test of size 1/17/08RAMP Retreat 7 What Is A FAST Functional Model? Requirements Fast, Full System, generate instruction trace, support rollback Hardware functional models Fast, but FPGA implementation difficult to make complete x86, boots Windows? Simple, resource efficient FM sufficient (but need trace, rollback) Software functional models Bochs, QEMU, Simics, SimNow, SimOS, etc. Run on fastest hardware we know about to execute an ISA Full system

8 Test of size 1/17/08RAMP Retreat 8 What is a FAST Timing Model? Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2

9 Test of size 1/17/08RAMP Retreat 9 Modular Timing Models: Modules + Connectors Modules model timing functionality E.g., rename, caches, etc. Built hierarchically for extensibility CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB Many are essentially wires (e.g., ALU) Often written to execute one operation E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher associativity Simplifies implementation, tradeoff time for space Connectors connect modules Abstract timing from modules Throughput (input, output), delay, maxTransactions Stats and tracing Bill Reinhart

10 Test of size 1/17/08RAMP Retreat 10 Microcode Compiler Intention Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86 Uses the LLVM Compiler Infrastructure developed at UIUC http://www.llvm.org Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA Generates microcode that “runs” on the timing model Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported Average 1.27 uOps per handled dynamic x86 instruction Microcode ISA is simple load/store with some x86 specialization Build a functional model as a processor executing microcode ISA? Nikhil Patil

11 Test of size 1/17/08RAMP Retreat 11 Prototype Overview Software functional model Eventually hardware functional model, but software sim exists FPGA-based timing model written in Bluespec Complex OoO micro-architecture fits in a single FPGA DRC or XUP trace Functional Model Software Timing Model Bluespec HDL ProcessorFPGA DRC Computer HT Xilinx FPGA PowerPC 405

12 Test of size 1/17/08RAMP Retreat 12 Current Prototype Functional Model Derived from QEMU Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, … Prototype QEMU currently supports x86 Added tracing, rollback (implemented with checkpoint) Including I/O (keyboard, mouse, video) Hosts x86 machines PowerPC inside of an FPGA Dam Sunwoo, Jeb Keefe

13 Test of size 1/17/08RAMP Retreat 13 Current Prototype Timing Model: OOO Superscalar Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson

14 Test of size 1/17/08RAMP Retreat 14 Current Simulator Performance on DRC Includes Operating System Code

15 Test of size 1/17/08RAMP Retreat 15 FAST Future Work Improve performance (currently TM bottlenecked) TM (5MIPS-20MIPS) FM (10MIPS-100MIPS) Optimize simulator (10MIPS-20MIPS) Hardware FM (FPGA ~100MIPS)  Simple pipe running microcode ISA  Multithreaded (Protoflex) to improve throughput Real processor??? (debug for trace, transactional for rollback?) FAST-MP Need MP host PowerPC ISA support (this month) Power estimation capabilities in FPGAs Don’t slow down FAST performance

16 Test of size 1/17/08RAMP Retreat 16 FAST and RAMP-Classic: A Comparison FASTRAMP CoresArbitrary, complex cores from simple core+rollback+TM Full RTL, fits in FPGA ISAs/OSsArbitrary ISA/OS (x86/Windows, PowerPC, …) Depends on core (PowerPC, Sparc) AccuracyRTL cycle-accurate Host cycles >= target cycles enabling reuse of hardware resources Only accurate target RTL available (unless timing model used) Host cycle == target cycle SpeedComplexity/Resource tradeoffAs fast as the FPGA will run ScalabilityDepends on FPGA resources (TM costs) Depends on FPGA resources

17 Test of size 1/17/08RAMP Retreat 17 FAST-MP: RAMP as a functional model How to do FAST-MP? Need multicore host! RAMP will not be able to accurately model Intel Core X micro-architecture in RAMP Unless it becomes much simpler in-order pipeline even then, will Intel give us RTL? But, RAMP processor can execute ISA Target ISA == Host ISA Add trace, rollback capabilities Target ISA != Host ISA Simulate target ISA Hardware support for target ISA, trace, rollback? RAMP becomes a FAST functional model Use timing model to predict arbitrary behavior

18 Test of size 1/17/08RAMP Retreat 18 FAST & RAMP-White Shared Infrastructure FAST connectors as White connectors Stats, timing FAST modules as White modules Quad ported RAM CAM/Cache on top of quad-ported RAM Multi-host cycle caches Branch predictors Load/store units, bus interface units Interconnection networks Eventually, full processors? Executing microcode, software cracking? FAST TM as RAMP TM At that point, RAMP == FAST-MP

19 Test of size 1/17/08RAMP Retreat 19 Conclusions Split functional/timing can be cycle-accurate Roll back (FAST) or Timing-directed (timing-first?) (HASim) We believe that roll back can be done relatively cheaply Can be done in software at about order of magnitude impact Hardware support appears to be reasonable Log old values, playback (easy for simple core) Transactional processor? Simple cores + roll back as host for functional model? Virtualization for more threads is orthogonal, but multiplicative in resources

20 Test of size 1/17/08RAMP Retreat 20 Another View of FAST Old processor/system modeling new processor/system Leverage the fact that most of the time, functionality and order is identical Differences in timing between old and new modeled in timing model Roll back/re-execute to deal with differences Use current generation host as functional model for next generation target? Trace implemented by debug support Roll back implemented with transactional support

21 Test of size 1/17/08RAMP Retreat 21 Host == Target Modules in FAST FAST can leverage modules that are full and accurate implementations of modules Host TLB == target TLB Memory requests issued when timing model

22 Test of size 1/17/08RAMP Retreat 22 FAST-MP: Mixing PP IU

23 Test of size 1/17/08RAMP Retreat 23 What Is A FAST Functional Model? Requirements Fast Full System Generates instruction trace Supports rollback Hardware functional models (very fast) Real processor doesn’t support trace/rollback FPGA implementation difficult to make complete x86, boots Windows? Software functional models exist today Bochs, QEMU, Simics, SimNow, SimOS, etc. Relatively fast, full system Run on fastest hardware we know about to execute an ISA Can be modified to generate trace/support rollback

24 Test of size 1/17/08RAMP Retreat 24 trace Step 1: Improving Performance via Parallelization Parallel slowdown due to communication? FM runs ahead, speculatively, round-trip communication infrequent Round-trip communication only when (functional path != timing path) Microprocessors have same problem Multiple issue, deep pipelines only work if predicted path is correct FM like perfect front end of processor, real uArch (TM) slows it down The better the target micro-architecture, the faster the simulator Functional Model (ISA + peripherals) Timing Model (Micro-architecture) Host

25 Test of size 1/17/08RAMP Retreat 25 Current Prototype Functional Model Derived from QEMU Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, … Prototype currently supports x86 Added tracing, rollback (implemented with checkpoint) Including I/O (keyboard, mouse, video) Hosts x86 machines PowerPC inside of an FPGA PowerPC target by January about 1 month to port Dam Sunwoo, Jeb Keefe

26 Test of size 1/17/08RAMP Retreat 26 Current Prototype Timing Model Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson

27 Test of size 1/17/08RAMP Retreat 27 Microcode Compiler Intention Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86 Uses the LLVM Compiler Infrastructure developed at UIUC http://www.llvm.org Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA Generates microcode that “runs” on the timing model Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported Average 1.27 uOps per handled dynamic x86 instruction Nikhil Patil

28 Test of size 1/17/08RAMP Retreat 28 Current Simulator Performance on DRC Includes Operating System Code

29 Test of size 1/17/08RAMP Retreat 29 Performance Details Timing model is current bottleneck 100MHz host cycle (not pushing timing) Currently taking ~30 host (FPGA) cycles per target cycle, max about 54 cycles (currently max latency defines target clock) BP is a simple gshare predictor Functional model Unoptimized modified QEMU With perfect BP, immediate return from TM, 5.4MIPS FM/TM communication 469ns blocking read from Opteron on DRC (has gotten better) Poll every other basic block 13ns/word for burst write

30 Test of size 1/17/08RAMP Retreat 30 Some Related Work (there is a lot) Software Functional/timing partitioned Asim, current M5, Timing-First, Opal all timing model driven  Timing model tells functional model what to do and when to do it FastSim (Schnarr, et al, ASPLOS 98) Functional/timing, rollback when functional path != timing path But, instrumented binaries, not parallelized, no hardware Hardware HASim: Hardware ASim (Emer, et. al) Timing-first Seven points of communication between FM & TM  Requires infinitely renamed out-of-order FM Current supports a simplified MIPS ISA

31 Test of size 1/17/08RAMP Retreat 31 Conclusions/Future Work It works Current FAST simulator prototype 1.2MIPS (unoptimized), about 1000 times slower than target Timely: during architecture phase Complete: runs Windows, Linux Transparent: extensive, hardware-based stats Relatively inexpensive, easy to build and extend (Some) future work Optimize 5MIPS soon, 10MIPS-20MIPS later (hardware FM using uCode?) More realistic timing model & calibration Tattler: automatic bottleneck detection CMP/SMP targets

32 Test of size 1/17/08RAMP Retreat 32 FAST Relative Speeds (Current Prototype)

33 Test of size 1/17/08RAMP Retreat 33 X86 Micro-Op Coverage

34 Test of size 1/17/08RAMP Retreat 34 Number of uOps/x86 Instruction

35 Test of size 1/17/08RAMP Retreat 35 Outline What is FAST (1 minute) Targeted properties Demo (30 sec) Runs x86, Windows at speeds fast enough to interact How? Start with partition Proven strategy (FastSim) Rollback to handle branch mis-speculation Simplifies full-system capabilities, since the complexity of the full- system is encapsulated in the functional model Functional model can be full-system simulator or processor  FM passes trace to TM TM is simple  can model complex structures fairly low weight Improve performance Parallelize on FM/TM boundary Round-trip communication infrequent  Permit FM to run ahead speculatively  Doesn’t mis-speculation slow things down?  Learn from computer architecture (microcoded, pipelined, OoO)  Speculation permits computer system to run faster  Key is not to mis-speculate often  Parallelize on functional/timing boundary  Handles branch prediction, anything where functional path is not equal to target path  Rollback required from FM Bottleneck is timing model Parallelize TM?  Difficult to do in software  Practical limiations due to number of processors that communicate quickly Hardware (FPGA) is ideal  Parallelizes nicely  TM partition simplifies hardware  Latency tolerant, infrequent round-trips Prototype Overview of prototype Software functional model running on processor host TM in Bluespec on FPGA Describe some details DRC or XUP FM description TM description Block diagram Microcode compiler Performance Relative performance Bottlenecks Future work Conclusions

36 Test of size 1/17/08RAMP Retreat 36 Functional Model Modifications Need to Rollback, force branch Rollback, restore and continue How? set_pc(inst_num, pc) Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC Sufficient Can be implemented with checkpoints ISA state, memory, peripherals BR

37 Test of size 1/17/08RAMP Retreat 37

38 Test of size 1/17/08RAMP Retreat 38 Why

39 Test of size 1/17/08RAMP Retreat 39 Why Are Simulators Slow? Complexity 40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Multiple Decode Out-of-order issue I/O Multiple cores, processors Modeling PARALLELISM Can we parallelize simulator?

40 Test of size 1/17/08RAMP Retreat 40 Parallelizing Simulators Software Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical If a software simulator could be parallelized, use same techniques to make the target faster? Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability

41 Test of size 1/17/08RAMP Retreat 41 FPGA Modeling Complex Processors on FPGA(s) Compile RTL for FPGA Pentium fits in large FPGA 3.1M transistors (Lu, Intel) Issues Fit: Core 2 in a single FPGA? Impossible: Processors grow as fast as FPGAs Multiple FPGAs to model one processor Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify FPGA Pentium Pentium Core 2

42 Test of size 1/17/08RAMP Retreat 42 Strawman Solution Partition Break problem down into multiple pieces Target module boundaries obeyed? Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation Partitioned simulator, each partition running in potentially different host technology

43 Test of size 1/17/08RAMP Retreat 43 Simple Module-based Software/Hardware Partitioning Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006) 0x2 addr inst Instruction $/Mem Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data $/Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R I1I1 I2I2 bypass IS THERE A BETTER PARTITIONING?

44 Test of size 1/17/08RAMP Retreat 44 FAST Functional Model (ISA) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Fetch Decode Rename Reservation stations Scheduling window Reorder buffer …. Inst trace FPGA

45 Test of size 1/17/08RAMP Retreat 45 More Complexity Caches/TLBs? Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)? “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions NO DATAPATH only part of control path

46 Test of size 1/17/08RAMP Retreat 46 A Timing Model iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models Can Easily Execute in Parallel in Hardware

47 Test of size 1/17/08RAMP Retreat 47 Trace-Driven Simulation? Described a trace-driven simulator Accurate if “functional path” == “target path” “Ideal” processor Unfortunately, functional path is not always equal to timing path Even for simple pipelines Real processors speculatively execute, assuming their target path is the functional path

48 Test of size 1/17/08RAMP Retreat 48 Branch Prediction iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis- speculate Rollback, restore Branch predictor predictor in ISA simulator? BP only works in processor if it’s fairly accurate FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way

49 Test of size 1/17/08RAMP Retreat 49 Speculative Simulation FAST simulators themselves are speculative Speculate functional path == target path Timing model detects functional path != target path Forces functional model down wrong path OR Returns functional model to functional path Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization) Functional model reusable Timing model determines target behavior

50 Test of size 1/17/08RAMP Retreat 50 FAST Tool Flow Will be using Intel AWB for stats, etc.

51 Test of size 1/17/08RAMP Retreat 51 Performance Details Functional model Unoptimized modified QEMU Driving an FPGA-based timing model Standard producer/consumer parallel communication problems 469ns blocking round-trip from Opteron to FPGA Poll every other basic block BP is a simple gshare predictor Timing model (unoptimized) 100MHz host cycle right now (not pushing timing) Currently taking ~30 host cycles per target cycle, max about 54 cycles (currently max latency defines target clock) Complex timing model fits into single FPGA

52 Test of size 1/17/08RAMP Retreat 52 Outline Motivation FAST: Parallelized complex core, full-system simulator RAMP-White: Parallel hosts running parallel targets

53 Test of size 1/17/08RAMP Retreat 53 RAMP-White Requirements Coherent shared memory experimental platform Configurable coherence protocol, engine Scalable to the same level as other RAMP machines 1K eventual target Down to 2 Full system (OS, I/O, etc.) Intentions ISA/Architecture independent (like all RAMP efforts) Use different cores Integrate components from other RAMP participants A test-bed for sharing IP

54 Test of size 1/17/08RAMP Retreat 54 Texas Modifications to RAMP-White New code in Bluespec rather than Verilog/VHDL Many advantages including interfaces, configurability My group’s hardware development is exclusively Bluespec Free/low cost to academics (www.bluespec.com) Start with XUP board We had XUP before BEE2 Support Leon3 and PowerPC Started with embedded PowerPC but Linux, non-MP core issues Restarted with Leon3 Linux, MP-capable core Get RAMP-White infrastructure to work Plan to port back to embedded PowerPC, easy to move to soft PowerPC

55 Test of size 1/17/08RAMP Retreat 55 High-Level Architecture Philosophy Flexibility Avoid wasted work Easy changes Module-agnostic Processors, network, I/O, etc. Interfaces Complete set of necessary interfaces All communication via messages Fixed fields, but fields are configurable “shims” connect components to White infrastructure Use existing IP

56 Test of size 1/17/08RAMP Retreat 56 RAMP-White Block Diagram Network Router Intersection Unit (IU) Memory Controller (MC) IO & Platform Devices Processor Network Interface (NIU) Coherent $ Proc dependent

57 Test of size 1/17/08RAMP Retreat 57 Three Phase Approach to Hardware H1: Incoherent shared memory No hardware global cache, just global shared memory support Optional cache for local memory However, software can maintain coherence if necessary Network virtual memory Run a simulator on top of the processor Ring network H2: Ring-based coherence (scalable bus) Requires a coherent cache, IU awareness Running what is essentially a snoopy protocol True coherence engine not required But, very restricted communication Sufficient for testing, modeling many targets H3: General network-based coherence Requires general coherence engine, general network H4: Different cores IU P $$ MC I/O IU P $$ MC I/O C $ IU P $$ MC I/O IU P $$ MC I/O C $ IU P $$ MC I/O IU P $$ MC I/O C $

58 Test of size 1/17/08RAMP Retreat 58 Operating System Issues with SMP OS on embedded PowerPC Incoherent cache Load-reservation/store-conditional instructions not MP capable Also missing TLB Invalidation & OpenPIC (interprocessor interrupts, bring-up) How scalable anyways? (1K processors) Four phase approach using RAMP-White hardware O1: Separate OS per core (PowerPC) working for XUP Region of memory is global (mmap) Locks using regular loads/stores + sequential consistency O2: SMP OS on Leon3 Use RAMP-White scalable hardware across multiple FPGAs Snoopy cache O3: SMP OS on Leon3 using directory cache O4: SMP OS on PowerPC port

59 Test of size 1/17/08RAMP Retreat 59 Programmer View Sequential consistency PowerPC Global addresses labeled as uncached  Ordered accesses from PowerPC 405 Coherent global cache still uncached from processor Soft cores can be weaker User interface Terminal per core/OS if desired Mmap to map shared memory

60 Test of size 1/17/08RAMP Retreat 60 H1/O1 RAMP-White Hari Angepat did the work Components Written in Bluespec NIU code complete and tested 2 processor ring (PowerPC) IU code complete and tested Processor Slave (no coherence right now) PLB Master/slave interface (I/O) NIU interface Hardware intended to target different ISAs PLB master and slave shims written Some preliminary OS work Multi-image mmap interface running

61 Test of size 1/17/08RAMP Retreat 61 Current RAMP-White Phase 1 Intersection Unit (IU) IO & Platform Devices PPC 405/ Leon3 Network Interface (NIU) Memory Controller (MC) PLB shim Intersection Unit (IU) PPC 405/ Leon3 Network Interface (NIU) Linux

62 Test of size 1/17/08RAMP Retreat 62 Our Long Term Plans H1/O1, XUP working June 2007 PowerPC With multi-OS, limited device support H2/O2 BEE2, Leon3 Coherent cache, IU forwarding modifications SMP H3/O3 BEE2/BEE3, Leon3 Arbitrary network, cache coherency engine Getting network from Washington, Berkeley H4/O4 BEE2/BEE3, PowerPC Port back to PowerPC

63 Test of size 1/17/08RAMP Retreat 63 RAMP Conclusions RAMP-White architecture Phased approach minimizes wasted work Designed to be easy to modify for your purpose Many architectures only require modified coherence engine, maybe cache ISA/implementation agnostic Care taken to not be specific RAMP-White Phase 1 works Running on XUP Phase 2 close to working Running on BEE2

64 Test of size 1/17/08RAMP Retreat 64 Future Work: FAST + RAMP-White FAST is currently not scalable Need parallel functional models and parallel timing models Parallel functional model runs on parallel host Maintain speed Intention is to run FAST on top of RAMP-White Modify soft core to support checkpoint, rollback in hardware FAST provides cycle-accurate of complex cores RAMP provides high performance, scalable host FAST+RAMP provides scalable complex core simulator

65 Test of size 1/17/08RAMP Retreat 65 Acknowledgements Students Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Hari Angepat (RAMP-White) Eric Johnson (verification, Linux) Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale Software Bluespec Open-source full system simulators QEMU, Bochs

66 © Derek Chiou 66 Extra slides

67 Test of size 1/17/08RAMP Retreat 67 What is in a Trace? Conceptually, everything a functional model can produce Flattened opcode, virtual/physical address of instruction, virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc. Can be heavily compressed Eg., simulator TLB to avoid physical address

68 Test of size 1/17/08RAMP Retreat 68 Modules Modules predict what happens each cycle Which instruction is scheduled? Modules hierarchically constructed CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies) E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher associativity Simplifies implementation, tradeoff time for space Joonsoo Kim, Nikhil Patil

69 Test of size 1/17/08RAMP Retreat 69 Connector Interface Commit Indicates module is done with target cycle Done Indicates Connector is done with target cycle Enq, First, Deq Just like FIFO Enq Commit Done DeqCommitFirstDone Bill Reinhart

70 Test of size 1/17/08RAMP Retreat 70 32b Address in Shared Memory Machine?? 4GB possible per BEE2 FPGA Need more than 32b Eventually, hope for 64b soft-core processors For now two options: live with 4GB space Or, provide one more layer of translation Physical address in certain region is global virtual address Translated by hardware to node + physical address Also useful for multiple OSs in single memory OSs tend to assume they own physical address 0

71 Test of size 1/17/08RAMP Retreat 71 Node Architecture IU P P $$ MC I/O IU P $$ MC I/O C $ IU P $$ MC I/O IU P $$ MC I/O C $

72 Test of size 1/17/08RAMP Retreat 72 Generalized Architecture Proc IUNIUMC $ Mem OPB bridge Intersection Unit Network Interface Unit PLB Proc dependent Proc independent

73 Test of size 1/17/08RAMP Retreat 73 Intersection Unit Processor interface Slave Snoop Network interface Master (send) Slave (receive) Memory interface Master (issue memory requests) Hooks for coherency engine Bluespec nice to specify coherence engine Incoherent version is a special case Programmable memory regions Global (local and remote) Local translation Intersection Unit (IU) Memory Controller (MC) IO & Platform Devices Processor Network Interface (NIU) Coherent $

74 Test of size 1/17/08RAMP Retreat 74 Network Interface Unit Currently two virtual channels Split into two components Msg composition/Queuing Net transmit/receive Insert/extract for ring Intended to permit other net- specific transmit/receive One input/one output Creates a simple unidirectional ring Can interface to more advanced fabrics Intersection Unit (IU) Memory Controller (MC) IO & Platform Devices Processor Network Interface (NIU) Coherent $

75 Test of size 1/17/08RAMP Retreat 75 Sharing IP: Some Preliminary Experience We looked at RAMP-Red XUP Used some code (PLB master) Red-BEE is not ready to distribute Looking for switch code Berkeley’s code on CVS repository But, we can’t use memory controller because we don’t have BEE2 board yet Bluespec We are spinning almost all of our own code right now Would like to steal software OS (kernel proxy) SMP OS port Naming MPI reference design in BEE2 repository Is that RAMP-Blue? A central CVS repository for RAMP code?

76 Test of size 1/17/08RAMP Retreat 76 Sharing Over the Long Term Processor is shared Leon PowerPC MicroBlaze Everything else MC is shared Xilinx or Berkeley Coherent cache can be shared Transactional/traditional Borrow Stanford’s? Coherency engine can be shared CMU/Stanford IU functionality can be shared Trying to make ours general NIU can be shared Borrow half from Berkeley? Network can be shared Borrow Berkeley’s? Proc IUNIUMC $ Mem Peripherals CCE

77 Test of size 1/17/08RAMP Retreat 77 IU Internal Message Defaults PRI: High priority, Low priority CMD: Read, Write, Coherence, … PERM: Modified, Exclusive, Shared, Invalid SIZE: Byte, word, double word, cache-line GADDR: global address (translated by IU) DATA: dependent on size Bluespec permits easy modification for your protocol PRICMDPERMSIZETAG GADDR DATA

78 Test of size 1/17/08RAMP Retreat 78 Network Message PRI: High and Low DEST,SRC: destination, source of message SIZE: Total message size NETTAG: network tag (optional) CMD: network command (optional) MESSAGE: data PRIDESTSRCNETTAGCMD MESSAGE SIZE

79 Test of size 1/17/08RAMP Retreat 79 Intersection Unit Internals Intersection Unit Controller Memory Controller & DRAM Controller BRAMs ProcIONetProcIONet Global Address Translation hardware

80 © Derek Chiou 80 Fast, Full-System, Cycle-Accurate Computer Simulators via Parallelization Derek Chiou University of Texas at Austin Electrical and Computer Engineering

81 Test of size 1/17/08RAMP Retreat 81 My Ideal Computer Simulator Fast: as fast as possible 2-3 orders of magnitude slower than target? Fast enough to run real datasets to completion Interactive? Timely: enough time to make decisions Accurate: produce cycle-accurate numbers Complete: run unmodified operating systems, applications,… Transparent: full visibility, no performance hit Inexpensive: need thousands Flexible: quick changes, generate from RTL

82 Test of size 1/17/08RAMP Retreat 82 Current Software Simulators Performance-Detail tug-of-war Higher performance means less simulation tasks Higher detail means more simulation tasks SimulatorSlowdownSim 2 min ISA (SimNow, Simics)1-10010 min Caches10-100010 hrs OOO (ASim ~1-10KIPS)10K-1M1 year RTL100M-1B2000 years CMP????

83 Test of size 1/17/08RAMP Retreat 83 Why So Slow? Complexity 40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Decoder I/O Modeling PARALLELISM Parallelize simulator? Sampling, benchmarking are orthogonal

84 Test of size 1/17/08RAMP Retreat 84 Outline Motivation How to Parallelize? Functional Models and Timing Models Status, Conclusions

85 Test of size 1/17/08RAMP Retreat 85 Parallelizing Simulators Software Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical If a software simulator could be parallelized, use same techniques to make the target faster? Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability

86 Test of size 1/17/08RAMP Retreat 86 FPGA Modeling Complex Processors on FPGA(s) Compile RTL for FPGA Pentium fits in large FPGA 3.1M transistors (Lu, Intel) Issues Fit: Core 2 in a single FPGA? Impossible: Processors grow as fast as FPGAs Multiple FPGAs to model one processor Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify FPGA Pentium Pentium Core 2

87 Test of size 1/17/08RAMP Retreat 87 Strawman Solution Partition Break problem down into multiple pieces Target module boundaries obeyed? Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation Partitioned simulator, each partition running in potentially different host technology

88 Test of size 1/17/08RAMP Retreat 88 Simple Module-based Software/Hardware Partitioning Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006) 0x2 addr inst Instruction $/Mem Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data $/Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R I1I1 I2I2 bypass IS THERE A BETTER PARTITIONING?

89 Test of size 1/17/08RAMP Retreat 89 Our Partitioning: Functionality/Timing Boundaries Proven Software Partitioning Asim, FastSim, etc. Factorized, not partitioned! Functional simulators exist Timing model becomes very simple Promotes reuse Functional/Timing Simplifies timing model Latency tolerant Separate FM & TM Software functional model? Hardware timing model? Balance time/space Functional Model (ISA + peripherals) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Caches Arbitration Pipelining Associativity …. Inst trace

90 Test of size 1/17/08RAMP Retreat 90 Partition on ISA/Timing Proven Partitioning Asim, Simplescalar, Timing-First, FastSim, etc. Simplifies simulator. Promotes reuse Same performance in software Asim at 10KHz Most of the time spent in timing model! Hardware??? Functional Model (ISA) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Fetch Decode Rename Reservation stations Scheduling window Reorder buffer …. Inst trace

91 Test of size 1/17/08RAMP Retreat 91 FAST Functional Model (ISA) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Fetch Decode Rename Reservation stations Scheduling window Reorder buffer …. Inst trace FPGA

92 Test of size 1/17/08RAMP Retreat 92 What is in a Trace? Conceptually, everything a functional model can produce Flattened opcode, virtual/physical address of instruction, virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc. Can be heavily compressed Eg., simulator TLB to avoid physical address

93 Test of size 1/17/08RAMP Retreat 93 What is a FAST Timing Model? Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2 Stats gathering in hardware => no performance impact

94 Test of size 1/17/08RAMP Retreat 94 More Complexity Caches/TLBs? Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)? “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions NO DATAPATH only part of control path

95 Test of size 1/17/08RAMP Retreat 95 A Timing Model iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models

96 Test of size 1/17/08RAMP Retreat 96 Driving a Timing Model iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Memory & I/O timing models Functional Model Can Easily Execute in Parallel in Hardware

97 Test of size 1/17/08RAMP Retreat 97 Trace-Driven Simulation? What we’ve described is a trace-driven simulator Accurate if “functional path” == “target path” “Ideal” processor Unfortunately, functional path is not always equal to timing path Real processors speculatively execute, assuming their target path is the functional path

98 Test of size 1/17/08RAMP Retreat 98 Branch Prediction iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis- speculate Rollback, restore Branch predictor predictor in ISA simulator? BP only works in processor if it’s fairly accurate FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way

99 Test of size 1/17/08RAMP Retreat 99 Speculative Simulation FAST simulators themselves are speculative Speculate functional path == target path Timing model detects functional path != target path Forces functional model down wrong path OR Returns functional model to functional path Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization) Functional model reusable Timing model determines target behavior

100 Test of size 1/17/08RAMP Retreat 100 A Parallel Simulator Functional runs in parallel with timing Functional model can run in parallel OoO superscalar processor? Timing model runs in parallel in hardware Hard to not run in parallel Every register-to-register transition could potentially be parallelized

101 Test of size 1/17/08RAMP Retreat 101 Outline Motivation How to Parallelize? Functional Models, Timing Models and Tools Status, Related Work, Conclusions

102 Test of size 1/17/08RAMP Retreat 102 Functional Models Hardware functional models Difficult to make complete (use Protoflex methods?) Size Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA Can we use them? What modifications?

103 Test of size 1/17/08RAMP Retreat 103 Functional Model Modifications Need to Rollback, force branch Rollback, restore and continue How? set_pc(inst_num, pc) Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC Sufficient Can be implemented with checkpoints ISA state, memory, peripherals BR

104 Test of size 1/17/08RAMP Retreat 104 Our Functional Model Derived from QEMU Fast Boots Linux, Windows x86, x86-64, PowerPC, Sparc, ARM, MIPS, … We support x86 for now Added tracing, checkpoint, rollback Runs on x86 hosts as well as embedded PowerPC host inside FPGA PowerPC in the works Dam Sunwoo

105 Test of size 1/17/08RAMP Retreat 105 Modular Timing Models Modules Models timing functionality Rename, caches, etc. Connectors Attach modules together Abstract timing from modules Throughput (input, output), delay, maxTransactions Stats and tracing

106 Test of size 1/17/08RAMP Retreat 106 Modules Modules predict what happens each cycle Which instruction is scheduled? Modules hierarchically constructed CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies) E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher associativity Simplifies implementation, tradeoff time for space Joonsoo Kim, Nikhil Patil

107 Test of size 1/17/08RAMP Retreat 107 Connector Interface Commit Indicates module is done with target cycle Done Indicates Connector is done with target cycle Enq, First, Deq Just like FIFO Enq Commit Done DeqCommitFirstDone Bill Reinhart

108 Test of size 1/17/08RAMP Retreat 108 Current Timing Model

109 Test of size 1/17/08RAMP Retreat 109 FAST Tool Flow Will be using Intel AWB for stats, etc.

110 Test of size 1/17/08RAMP Retreat 110 Outline Motivation How to Parallelize? Functional Models and Timing Models Status, Related Work, Conclusions

111 Test of size 1/17/08RAMP Retreat 111 Execution Platforms DRC Computer Opteron + FPGA on HT Xilinx development boards XUP, ML-310 (embedded PowerPC) Research Accelerator for MP (RAMP) A shared infrastructure for MP development Uses BEE2/BEE3 FPGA boards 4/5 large FPGAs and fast off-board I/O Large collaboration between Berkeley, CMU (Hoe), MIT, Stanford, Texas, Washington and Intel.

112 Test of size 1/17/08RAMP Retreat 112 Simulator Performance on DRC Includes Operating System Code

113 Test of size 1/17/08RAMP Retreat 113 Comparison to Related Work

114 Test of size 1/17/08RAMP Retreat 114 Acknowledgements Students Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Eric Johnson (verification, Linux) Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale Software Bluespec Open-source full system simulators QEMU, Bochs

115 Test of size 1/17/08RAMP Retreat 115 Conclusions Fast: Expect 2MIPS-10MIPS soon 2-4 orders of magnitude slower than target Interactive Timely: quick model building Accurate: produce cycle-accurate numbers Complete: runs Linux/Windows + apps now, compile apps on standard machines Transparent: Hardware stats provides full visibility, no performance hit Inexpensive: $300/$600 XUP boards can fit many timing models $9K (academic) DRC is still cost effective per cycle Flexible: tools

116 © Derek Chiou 116 Backup Slides

117 Test of size 1/17/08RAMP Retreat 117 Connectors Abstract timing information from modules Cannot do so perfectly (state within modules) Characteristics Throughput (input and output), minimum delay, maximum outstanding Entries are equal and fully shiftable If you have distinct lanes, need separate connectors Provide stats gathering, tracing 512 entry trigger-based trace Stats funneled BACK through connector helps with place-and-route (Nikhil Patil) Bill Reinhart

118 Test of size 1/17/08RAMP Retreat 118 Example Connector Interface Code (in ALU Module) module mkALU#(ConsumerPort#(Tuple2#(Maybe#(PReg_t), ROBTag_t)) inQ, ProducerPort#(Tuple2#(Maybe#(PReg_t), Execute2ROB_t)) outQ) (ALU); rule pass; inQ.deq; match {.dest,.robtag } = inQ.first; outQ.enq(tuple2(dest, Execute2ROB_t{robtag:robtag, exception: Invalid})); endrule rule done (inQ.done || outQ.done); inQ.commit(True); outQ.commit(True); endrule endmodule ALU latency defined by Connector min-delay!

119 Test of size 1/17/08RAMP Retreat 119 FAST Tool Flow Will be using AWB for stats, etc.

120 Test of size 1/17/08RAMP Retreat 120 Microcode Compiler Uses the LLVM Compiler Infrastructure developed at UIUC ( http://www.llvm.org) http://www.llvm.org Compile the Bochs CPU (Bochs is another portable x86 full- system emulator.) Backend retargetted to the micro-op "ISA“ Intention is to automate generation of new ISA instructions, automatically retarget new micro-architectures Generates microcode that “runs” on the timing model Nikhil Patil

121 Test of size 1/17/08RAMP Retreat 121 Example Microcode Compile "ADD r/m32, r32" instruction. void BX_CPU_C::ADD_EdGd(bxInstruction_c *i) { Bit32u op2_32, op1_32, sum_32; op2_32 = BX_READ_32BIT_REG(i->nnn()); if (i->modC0()) { op1_32 = BX_READ_32BIT_REG(i->rm()); sum_32 = op1_32 + op2_32; BX_WRITE_32BIT_REGZ(i->rm(), sum_32); } else { read_RMW_virtual_dword(i->seg(), RMAddr(i), &op1_32); sum_32 = op1_32 + op2_32; write_RMW_virtual_dword(sum_32) } SET_FLAGS_OSZAPC_32(op1_32, op2_32, sum_32, BX_INSTR_ADD32) } When ModRM != 0xC0: %u0 = ADDRGEN LOADd %v0:[%u0] -> %u1 cc,%u1 = add %u1, %nnn STOREd %v0:[%u0] <- %u1 When ModRM == 0xC0: cc,%rm = add %rm, %nnn

122 Test of size 1/17/08RAMP Retreat 122 RTL to Timing Model Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2 Timing model perfectly models RTL Verification??? But, do you really want to do it this way?

123 Test of size 1/17/08RAMP Retreat 123 Traces to Timing Model Interface provides ability for functional model to pass instruction trace to timing model Storage that timing model components can read Compression TLB to eliminate physical address Cache static information (src/dest registers, opcode, etc.) Joonsoo Kim

124 Test of size 1/17/08RAMP Retreat 124 Adding Modules Easy to add modules to improve accuracy No functionality, only timing E.g., DRAM model Model banks, row register, control timing, refresh Results in some requests taking longer If modeling memory controller, reorder memory operations as well

125 Test of size 1/17/08RAMP Retreat 125 Long Term Plans/Impact CMP/SMP support Early power estimation (with Freescale) Simulator available for software development/tuning before hardware True co-development Performance/power revs of software Design derived from simulator Write cycle-accurate simulator, automatically generate design (at RTL + libraries level) Change the way computer systems are designed and evaluated

126 Test of size 1/17/08RAMP Retreat 126 Demo Compile code, run on simulator BP Perfect, gshare See difference in simulator performance #ALUs 1, 8 Decode Fetch R/S ROB Rename L1$ BrALUTLBLd/st$ L2$ BP

127 Test of size 1/17/08RAMP Retreat 127 Better Partitioning? Traditional partitioning on module boundaries Timing and functionality Is there a better way? 64b adder cannot be implemented as a single monolithic entity But, 64 1b adders very tractable  Can compose a 64b adder from 64 x 1b adders Is there a better partitioning for simulation?

128 Test of size 1/17/08RAMP Retreat 128 Current/Future Work Finish uni-processor version Very close, running on one platform Debug rest of pipeline Shared/coherent bus MP version Porting QEMU to MP host MP timing model Hardware functional model Tool chain

129 Test of size 1/17/08RAMP Retreat 129 FPGA Resources for TM OOO, branch prediction, ROB, two ALUs, 1 branch unit, 1 load/store unit (32 entry load/store queue), iTLB, dTLB 7221 slices (52% of a 2VP30) 9 block RAMs (6% of a 2VP30) NOTE: large structures have not yet been mapped to block RAMs (trace buffer, ROB) Configurable cache model (old Verilog version) 32KB 4-way set associative cache with 16B cache-lines 165 slices (1% of a 2VP30) 17 block RAMs (12% of a 2VP30) 2MB 4-way set-associative cache with 64B cache-lines 140 slices (1% of a 2VP30) 40 block RAMs (29% of a 2VP30) 2VP30 is an old FPGA found on a $600 list-price XUP board Current FPGAs 10 times as much logic (330K logic cells (2VP30 is around 30K)) 3 to 4 times as much block RAM

130 Test of size 1/17/08RAMP Retreat 130 Current Limitations No datapath Data speculation requires datapath Control path/data path crossover Informing loads, Query loads Can support But, can require additional communication Quite possible

131 Test of size 1/17/08RAMP Retreat 131 FAST Tool Flow

132 Test of size 1/17/08RAMP Retreat 132 RTL to Timing Model Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2 Timing model perfectly models RTL Verification???

133 Test of size 1/17/08RAMP Retreat 133 Hardware Related Work FAST is first FPGA-based functional/timing model partitioned simulator that we know of HASim (Emer et al) FPGA-based processor emulators Academic: RAMP, MIT UMUM, Scale, Stanford FAST, ATLAS, CMU, … Intel (Shih-Lien Lu) !Flexible (huge effort), !Accurate (old processors, not modeling everything)

134 Test of size 1/17/08RAMP Retreat 134 Software Related Work FastSim: Schnarr, Larus (1998) Direct execution with rollback, memoization for speed Simplescalar accuracy Emer, et al: Asim Intel cycle-accurate, about 10KHz PTLSim: Yourst Full system x86, about 200KHz Seems similar to Simplescalar in terms of accuracy Uses Xen to fast forward

135 Test of size 1/17/08RAMP Retreat 135 Simulator Users and Tradeoffs Architects: e.g., Matt Performance/Power/Reliability Timely, ~Accurate, Flexible !Fast, !Complete Software: e.g., Wen-Mei Development, tuning Fast, OS !Accurate, !Flexible, !Timely Implementation (RTL) Correctness Accurate, ~Timely !Fast, !Complete

136 Test of size 1/17/08RAMP Retreat 136 Stolen from http://research.microsoft.com/si/PPT/HardwareModelingInfrastructure.pdf Impossible????

137 Test of size 1/17/08RAMP Retreat 137 Microprocessors One instruction executes to completion before next starts Implementation is different leading to different performance

138 Test of size 1/17/08RAMP Retreat 138 Microprocessors In-Order Front Out-of-Order Middle In-Order End

139 Test of size 1/17/08RAMP Retreat 139 Current Software Simulators Performance-Detail tug-of-war Higher performance means less simulation tasks Higher detail means more simulation tasks SimulatorSlowdownSim 2 min ISA (SimNow, Simics)1-10010 min Caches10-100010 hrs OOO (ASim ~1-10KIPS)10K-1M1 year RTL100M-1B2000 years CMP????

140 Test of size 1/17/08RAMP Retreat 140 Why Build? (Anant) Software won’t work unless you are building hardware Motivation for software tools Large data sets Hard problems better understood and show up once you became building Have to solve hard problems More radical the idea, more important it is to build World only trusts end-to-end results Cycle simulator only becomes accurate after hardware gets precisely defined Needed for commercialization

141 Test of size 1/17/08RAMP Retreat 141 Functional Models Hardware functional models Difficult to make complete (use Protoflex methods?) Size Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA Can we use them? What modifications?


Download ppt "© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,"

Similar presentations


Ads by Google