Presentation is loading. Please wait.

Presentation is loading. Please wait.

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs

Similar presentations


Presentation on theme: "ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs"— Presentation transcript:

1 ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs
Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai No need to dwell PROTOFLEX Our work in this area has been supported in part by NSF, IBM, Intel, Sun, and Xilinx.

2 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael
Overview Part I: Basic ProtoFlex Concepts FPGA Virtualization techniques BlueSPARC FPGA-accelerated simulator Part II: Hands-on Technology Preview Access CMU’s remote FPGA infrastructure Compile and run code within an FPGA-accelerated simulation of 16-CPU UltraSPARC III server Run real-time CMP cache simulations 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

3 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael
Part I: Basic Concepts 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

4 Full-system Functional Simulators
Convenient exploration of real (or future) HW Can boot OS, run commercial apps But too slow for large-scale (>100-way) MP studies Existing functional simulators single-threaded High instrumentation overhead Get rid of the first part Everyone should know MP research is interesting Parallel software tools (existing don’t scale) Big systems Existing SW simulators don’t scale I’d like to open my talk by first telling you about the current state of full-system, functional simulation. Full-System simulators are software-based tools that are widely used in both the computer Architecture and software research communities. These software tools can provide a very convenient Functional model of either real or non-existent hardware systems. These simulators also provide enough detail to Execute unmodified commercial operating systems and applications with no available source code. However, with the recent Multicore scaling trend, existing simulation tools are being pushed to the limit as we begin to study Systems incorporating up to hundreds or even thousands of processors. 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

5 FPGA-based simulation
Advantages Fast and flexible Low HW instrumentation performance overhead Sounds like “Hammer”, where’s the “nail” in this slide Slide 2, MP picture Slide 3, have same MP picture (but w/ FPGAs) 2 problems should come out obviously in this slide 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

6 FPGA-based simulation
Sounds like “Hammer”, where’s the “nail” in this slide Slide 2, MP picture Slide 3, have same MP picture (but w/ FPGAs) 2 problems should come out obviously in this slide Caveat: simulation w/ FPGAs more complex than SW Large-MP configurations (>100)  expensive to scale Full-system support need device models + full ISA 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

7 Reducing Complexity Hybrid Full-System Simulation
Multiprocessor Host Interleaving 1 2 P P P P P P P P P CPU P Cut down on the # of words These points get re-iterated 3x over in the talk Concrete illustration that’s consistent w/ figures in slide 2 and 3 Memory Devices 4-way P 4-way P Common-case behaviors Uncommon behaviors Memory 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

8 Outline Hybrid Full-System Simulation Multiprocessor Host Interleaving
BlueSPARC Implementation Conclusion 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

9 Hybrid Full-System Simulation
transplant PC Host Simulator FPGA host I/O instr P P P P P 3 CD FIBRE TERM 2 Memory SCSI GFX NIC PCI 1 3 ways to map target components FPGA-only Simulation-only Transplantable CPUs can fallback to SW by “transplanting” between hosts Only common-case instructions/behaviors implemented in FPGA Complex, rare behaviors relegated to software Said too much in this slide Point is really: transplants Cpu-only, FPGA-only eliminate from discussion talked too much in first half of talk Need to script 1st slide or two Rare + complex don’t always go together I/o instr not actually hard, its side-effects are Consistently use “full-system simulator host” throughout the talk PC Host looks like it’s actually running the devices Identify that components are simulated target components 1 2 3 Transplants reduce full-system complexity 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

10 Multiprocessor Host Interleaving
Advantages: Trade away FPGA throughput for smaller implementation Decouple logical simulated size from FPGA host size Host processor in FPGA can be made very simple # processors to model in functional simulator # soft cores implemented in FPGA 1-to-1 mapping between target and host CPUs P P P P P P P P P 1-to-1 P P P P P P P P Combine with slide 9 once we move this to slide 2 Too many new terms in this talk “Host Interleaving” 4-to-1 host interleaving P 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

11 FPGA Host Processor Architecturally executes multiple contexts
Existing multithreaded micro-architectures are good candidates Our design: Instruction-Interleaved Pipeline Switch in new processor context on each cycle Simple, efficient design w/ no stalling or forwarding Long-latency tolerance (e.g., cache miss, transplants) Coherence is “free” between contexts mapped to same pipeline 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

12 Outline Hybrid Full-System Simulation Host MP Interleaving
BlueSPARC Implementation Conclusion 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

13 BlueSPARC Simulator Functionally models 16-CPU UltraSPARC III Server
HW components are hosted on BEE2 FPGA platform Berkeley Emulation Engine 2 • 5 Virtex-II Pro 70s (~66 KLUTs) • 2 embedded PowerPCs per FPGA • Only use 2 out of 5 FPGAs (pipeline + DDR controllers) Virtutech Simics used as full-system I/O simulator 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

14 BlueSPARC Simulator Virtex II Pro 70 Core 2 Duo
16-way Pipeline PowerPC (Hard core) Simulated I/O Devices (using full-system simulator) Processor Bus Cut down the “top” portion Indicate latencies to PowerPC + Simics 12, 15, 14, 13 Interface to 2nd FPGA (memory) Ethernet Simics simulates devices, memory-mapped I/O and DMA (~ µsec) PowerPC simulates unimplemented SPARC instructions (~10-50 µsec) 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

15 Hybrid host partitioning choices
BlueSPARC PowerPC405 Simics on PC • Integer ALU • Register Windows • Traps/Interrupts • I-/D-MMU • Memory + Atomics • 36% special SPARC insns • Multimedia • Floating Point • Rare TLB operations • 64% special SPARC insns PCI bus ISP2200 Fibre Channel I21152 PCI bridge IRQ bus Text Console SBBC PCI device Serengeti I/O PROM Cheerio-hme NIC SCSI bus Just simply “summarize” what % of what is implemented Reduce # of points, eliminate SPARC-based stuff Make 3 columns to be consistent with This slide goes before pipeline slide Swap 14 to 13 Implementation = 60% of basic ISA, 36% of special SPARC insns, 0% devices 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

16 BlueSPARC host microarchitecture
Audience won’t know what “Simics” is… Use “full system simulator” instead of “Simics” Shorten “anecdote” 64-bit ISA, SW-visible MMU, complex memory  high # of pipeline stages 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

17 BlueSPARC Simulator (continued)
Processing Nodes 16 64-bit UltraSPARC III contexts 14-stage instruction-interleaved pipeline L1 caches Split I/D, 64KB, 64B, direct-mapped, writeback Non-blocking loads/stores 16-entry MSHR, 4-entry store buffer Clock frequency 90MHz on Xilinx V2P70 Main memory 4GB total Resources (Xilinx V2P70) 33.5 KLUTs (50%), 222 BRAMs (67%) w/o stats+debug 43.2 KLUTs (65%), 238 BRAMs (72%) Instrumentation All internal state fully traceable Attachable to FPGA-based CMP cache simulator Statistics 25K lines Bluespec HDL, 511 rules, 89 module types Software Runs unmodified Solaris + closed-source binaries Re-consider what I’m highlighting – compared to what I talk about Highlight Bluespec, resource Reduce precision of the numbers kBs of BRAMs EDA tools aren’t interesting Checkpointing is orthogonal – can get rid fo this 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

18 39x speedup on average over Simics-trace
Performance Too many discussion points Graph should go to 90 MIPS If important point, should be on the slide List exact slowdown relative to real system Don’t need “Perf comparable to Simics-fast” 39x speedup on average over Simics-trace 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

19 One more thing… Our latest application w/ BlueSPARC: Trace CMP … …
L1 Caches L2 Cache Instruction Caches 8 ways Statistics Memory References Cache Contents Data Caches 8-way Pseudo-LRU Statistics Statistics 16 64KB 2-way split I&D L1 caches 4MB 8-way L2 cache 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

20 BlueSPARC Demo on BEE2 BEE2 Platform
4 DDR2 Controllers + 4 GB memory Demo application On-Line Transaction Processing benchmark (TPC-C) in Oracle Runs in Solaris 8 (unmodified binary) FPGA + Memory directly loaded from Simics checkpoint Ethernet (to Simics on PC) Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging) BEE2 Platform 20 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

21 Summary Two techniques for reducing complexity Future work
Hybrid full-system simulation MP host interleaving Future work Timing extensions Larger-scale implementation (hundreds of CPUs) Run-time instrumentation tools 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

22 Part II: Hands-on Tutorial
27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

23 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

24 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

25 Setup Procedures Divide up into FOUR groups
Each group needs 1 laptop with: Remote Desktop Connection for Windows XP Instructions: user/login: ramp/ramp 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

26 Live CMP Cache Statistics
Statistics updated in real-time on webpage: Click on tabs to watch statistics for any of the four BEE2 boards as they run simulations 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael

27 Thanks! Any questions? Acknowledgements echung@ece.cmu.edu
Acknowledgements We would like to thank our colleagues in the RAMP and TRUSS projects. 27-Nov-18 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael


Download ppt "ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs"

Similar presentations


Ads by Google