Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture Lab at ProtoFlex: An Architectural Exploration Vehicle Using FPGA-Accelerated Full-System Multiprocessor Simulations Eric S. Chung,

Similar presentations


Presentation on theme: "Computer Architecture Lab at ProtoFlex: An Architectural Exploration Vehicle Using FPGA-Accelerated Full-System Multiprocessor Simulations Eric S. Chung,"— Presentation transcript:

1 Computer Architecture Lab at ProtoFlex: An Architectural Exploration Vehicle Using FPGA-Accelerated Full-System Multiprocessor Simulations Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, and Ken Mai P ROTO F LEX Our work in this area has been supported in part by NSF, IBM, Intel, Sun, Xilinx, and Microsoft.

2 Field Programmable Gate Arrays What are FPGAs? –Sea of programmable logic, RAMs –Capacity: 30 RISC cores, 0.5-3MB –Speed: 100-300MHz 7/9/20162 Interconnect switch Lookup Tables (LUTs) FI 0 I 1 I 2 I 3 I 4 I 5 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 … LUTs can be used as a 6- input gate or a 64x1 RAM Truth Table BlockRAMs 64 x 1-bit SRAM I0I0 I1I1 I2I2 I3I3 I4I4 I5I5 F Typical applications –ASIC replacement, logic emulation –Communications, auto industry, consumer electronics, defense, etc.

3 The ProtoFlex Simulator History –Project started (circa 2007) to build a scalable, full-system multiprocessor simulator using Field-Programmable Gate Arrays Key Features –Functional simulation of 16-way UltraSPARC III server (~50-90 MIPS) –Using hybrid simulation, runs real server apps + Solaris OS –Employs multithreading to virtualize # CPUs per FPGA core 3 Hybrid Simulation Virtualization 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael

4 The ProtoFlex Simulator User perceives UltraSPARC III functional simulator –Software user interface –Applications load directly from Simics checkpoints –Features: state viewing, scripting, single-stepping, terminal, statistics 4 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael

5 Availability: Now What we’re giving out –Source RTL, scripts for SPARCV9 CPU model –XUPV5 Reference Design for EDK 10.1 –Virtutech Simics plug-ins for hybrid simulation –Top-level SW controller, user command-line interface –Documentation through online wiki Why open source? –Demonstration of FPGAs as viable architecture research vehicle –Facilitate adoption of hybrid simulation & host multithreading –Encourage building on top of our work 5

6 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael6 Tutorial Agenda Part I: Core Concepts (45min) –FPGA Virtualization techniques –Simulator implementation Part II: Hands-on Tutorial (2 hr) –Download, familiarize with tool flows –Access CMU’s remote FPGA infrastructure –Compile, run code within FPGA-simulated 4-CPU UltraSPARC III server

7 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael7 Part I: Core Concepts

8 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 8 Full-system Functional Simulators Convenient exploration of real (or future) HW –Can boot OS, run commercial apps But too slow for large-scale (>100-way) MP studies –Existing functional simulators single-threaded –High instrumentation overhead

9 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 9 FPGA-based simulation Advantages –Fast and flexible –Low HW instrumentation performance overhead

10 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 10 FPGA-based simulation Caveat: simulation w/ FPGAs more complex than SW –Large-MP configurations (>100)  expensive to scale –Full-system support  need device models + full ISA

11 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 11 Reducing Complexity Hybrid Full-System Simulation Multiprocessor Host Interleaving 21 CPU P Memory Devices P Common-case behaviors Uncommon behaviors Memory 4-way P P P P P P P P P

12 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 12 3 P Hybrid Full-System Simulation 3 ways to map target components FPGA-only Simulation-only Transplantable CPUs can fallback to SW by “transplanting” between hosts –Only common-case instructions/behaviors implemented in FPGA –Complex, rare behaviors relegated to software 1 2 3 PP Memory CD FIBRE GFX NIC PCI TERM SCSI PC Host Simulator FPGA host 1 2 I/O instr P P transplant Transplants reduce full-system complexity

13 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 13 Multiprocessor Host Interleaving # processors to model in functional simulator 1-to-1 mapping between target and host CPUs # soft cores implemented in FPGA 1-to-1 P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P 4-to-1 host interleaving P P P P P P P P P P P P P P P P P P P P Advantages: Trade away FPGA throughput for smaller implementation Decouple logical simulated size from FPGA host size Host processor in FPGA can be made very simple

14 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 14 FPGA Host Processor Architecturally executes multiple contexts –Existing multithreaded micro-architectures are good candidates Our design: Instruction-Interleaved Pipeline –Switch in new processor context on each cycle –Simple, efficient design w/ no stalling or forwarding –Long-latency tolerance (e.g., cache miss, transplants) –Coherence is “free” between contexts mapped to same pipeline

15 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 15 Outline Introduction The ProtoFlex Simulator Applications Platform Release

16 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 16 Top-level Simulator View FPGA Core 2 Duo BlueSPARC Core PowerPC (or Microblaze) Ethernet (or PCIe) Interface to Memory Processor Bus Simulated I/O Devices (using full-system simulator) Simics simulates devices, memory- mapped I/O and DMA (~500-1000 µsec) Nearby core simulates unimplemented SPARC instructions (~10-50 µsec)

17 The BlueSPARC Core Mirrors the Virtutech Simics model –64-bit SPARCV9 ISA + US III extensions –8 register windows, 4 global register files –512-entry D-TLB, 128 I-TLB Implementation –14-stage, multi-threaded pipeline, switch context on each cycle –On Virtex-5, XST~148MHz, Placed & routed @ 100MHz –Parameterized non-blocking caches –FP + rare MMU instructions are SW-emulated by nearby uBlaze 17 I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM)

18 Why UltraSPARC III? Popular ISA used in architecture community –www.virtutech.com/academia/selected-pubs.htmlwww.virtutech.com/academia/selected-pubs.html –Many simulation users will find this model familiar High-quality reference model from Virtutech –Simics has complete & validated reference model (CPU and system devices) –Runs commercial out-of-box software and OS –Simics system simulation supports ~ 300 CPUs in target model Backwards compatibility –US III workloads brought up in Simics usable in ProtoFlex 18

19 Core Design Statistics Runs 100MHz on V5 –Synthesizes up to 148MHz using standard tools (ISE XST) Logic usage –23.5 KLUTs (11.3% LX330T) BRAM usage –120 BRAMs for 16-context configuration (37% LX330T) Future optimizations –Paging structures to SRAM or DRAM can reduce BRAM by significant amount 19

20 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 20 Hybrid host partitioning choices BlueSPARCPowerPC405Simics on PC Integer ALU Register Windows Traps/Interrupts I-/D-MMU Memory + Atomics 36% special SPARC insns Multimedia Floating Point Rare TLB operations 64% special SPARC insns PCI bus ISP2200 Fibre Channel I21152 PCI bridge IRQ bus Text Console SBBC PCI device Serengeti I/O PROM Cheerio-hme NIC SCSI bus Implementation = 60% of basic ISA, 36% of special SPARC insns, 0% devices

21 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 21 BlueSPARC on the BEE2 Platform Functionally models 16-CPU UltraSPARC III Server –HW components are hosted on BEE2 FPGA platform Virtutech Simics used as full-system I/O simulator 5 Virtex-II Pro 70s (~66 KLUTs) 2 embedded PowerPCs per FPGA Only use 2 out of 5 FPGAs (pipeline + DDR controllers) Berkeley Emulation Engine 2

22 Functional CMP Cache Model Functional Branch Predictor Model HW vs. SW Simulation Performance Simulation Infrastructure: –Full-system functional simulator (e.g. Simics) –Instrumentation (e.g. functional cache model) Functional Branch Predictor Model Functional CMP Cache Model ? BlueSPARC Simics SW-based HW-based SWHW NO Instrumentation Speedup: 1.23x MIPS

23 Functional CMP Cache Model Functional Branch Predictor Model HW vs. SW Simulation Performance Simulation Infrastructure: –Full-system functional simulator (e.g. Simics) –Instrumentation (e.g. functional cache model) Functional Branch Predictor Model Functional CMP Cache Model ? BlueSPARC Simics SW-based HW-based SWHW WITH Instrumentation Speedup: 37x MIPS

24 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 24 Outline Introduction The ProtoFlex Simulator Applications Platform Release

25 Using ProtoFlex Add passive monitors –Counters, histograms –Roll your own 25 I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM) Trace-based simulation –Collect dynamic traces –Feed traces to functional-first timing model Sampled Program Monitoring –Use micro-blaze (or PPC) to monitor core/memory state –Unintrusive profiling w/o changes to target SW Counters Histogram Tracker Histogram Trackers Timing Model FPGA Hard/Soft Core (PowerPC or MicroBlaze)

26 Applications of ProtoFlex Examples –Functional-first CMP cache coherency model for first-order timing models and functional warming [TRETS’09] –Real-time stack trace profiling –CMP interconnect model (in progress) –Realistic CPU traffic generators (in progress) … running real 16-CPU server workloads –Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K 26 Piranha CMP Cache (First-Order Timing Model) Statistics + Warmed Coherency & Tag States

27 What does the RTL look like? We use Bluespec System Verilog (high-level, synthesizable HDL) –4-8 weeks learning curve for normal HDL users –Requires BSV compiler (free for academics) –Paper in MEMOCODE’09 describes BSV coding/validation of core Sample code: 27 rule split_ALU_pipeline (True); … p1 = piperegs[DECODE]; piperegs[ALU1] <= doALUStage1(p1, alu_ifc); p2 = piperegs[ALU1]; piperegs[ALU2] <= doALUStage2(p2, alu_ifc); … endrule rule merged_ALU_pipeline (True); … p1 = piperegs[DECODE]; p_tmp1 = doALUStage1(p1, alu_ifc); p_tmp2 = doALUStage2(p_tmp1, alu_ifc); piperegs[ALU] <= p_tmp2; … endrule 2-stage ALU 1-stage ALU

28 Other Simulator Features Changing Core Parameters –Number of CPU contexts –Cache sizes –Merge/split pipeline stages –Enable/disable modules for profiling & debugging –Clock frequency (tested @ 10 MHz – 100 MHz) –Set optimal LUTRAM size (16 = V2P, 64 = V5) –Choose LUTRAMs or BRAMs for any CPU state 28

29 Current Limitations Simulator is currently non-deterministic Simics checkpoint saving not included in release No hardware DP FPU No coherency between multiple BlueSPARCs 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 29

30 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 30 Outline Introduction The ProtoFlex Simulator Applications Platform Release

31 Platform Release: XUPV5 Why XUPv5? –Inexpensive (~$750), easily accessible –Standard tool flows (EDK, ISE) –Reference design portable to other platforms – just drop in our ‘pcores’ Supporting other platforms –Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP) –Contributions are welcomed! 31

32 Required Equipment 32 BlueSPARC PFMON + Simics Bitstream Loader + RS232 Terminal

33 33 XUPv5 Overview Virtex-5 LX110T DDR2 Memory –up to 2GB 1ΜΒ SRAM PCI Express x1 Serial Port

34 XUPv5 Plugged into Host System 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 34

35 SRAM Controller Multi-Port Memory Controller Reference Design Block Diagram 35 PLB BlueSPARCMicroBlaze PCI Express BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM

36 Controller Multi-Port Memory Controller BlueSPARC 36 PLB MicroBlaze BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM PCI Express Wrapped in EDK IP core –connects to PLB & NPI Runs @ 100MHz 4 CPU contexts 64KB I&D L1 caches BlueSPARC

37 SRAM Controller Multi-Port Memory Controller BlueSPARC 37 PLB BlueSPARCMicroBlaze PCI Express Serial Port XUPv5 LX110T DRAM SRAM PCI Express 81% utilization –Core 51% (76 out of 148) –Rest 30% (45 out of 148) BRAM

38 SRAM Controller Multi-Port Memory Controller PCI Express 38 PLB BlueSPARCMicroBlaze BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM 200 MB/sec bandwidth ~10 usec RTT latency Socket Abstraction PCI Express

39 SRAM Controller DDR2 Memory Controller 39 PLB BlueSPARCMicroBlaze Ethernet BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM PCI Express 1.5GB/s peak BW 115ns latency Multiple ports/interfaces Multi-Port Memory Controller

40 40 Linux PC Software requirements –OpenSuSE Linux 11.0 –Bluespec SystemVerilog Compiler v2008.11.C –Xilinx ISE 10.1, Xilinx EDK 10.1 –Virtutech Simics 3.0.22 –Synopsys VCS Y-2006.06 (optional for Verilog simulations) We provide –Linux PCI Express DMA driver –Simics hybrid simulation plug-in modules –ProtoFlex MONitor tool (PFMON)

41 41 Linux PC Runs PFMON (ProtoFlex MONitor) –Orchestrates communication between Simics & BlueSPARC –Provides CLI interface to simulator (like Simics Console) Runs Simics –Handles I/O, FPGA Core and memory initialization BlueSPARC PFMON Linux PC Simics

42 42 From RTL to Running System Bluespec  Verilog – Bluespec compiler – ~30 minutes Verilog  Bitstream – Xilinx EDK – ~ 3 hours Bitstream  Working System – Stream mem. image over PCIe – ~ 1 minute (for 1GB image) Bluespec code Verilog code Bitstream 1 2 3 1 2 Working System 3

43 XUPv5 Caveats Differences from BEE2 platform –Uses Virtex-5 LX110T  no PowerPC core, fewer BRAMs –Uses soft microblaze core instead  5X slower than PPC –Lower memory capacity (1 GB max) Our implementation on XUPv5 –Uses PCI express instead of ethernet  fast mem loading –Limited to simulating 4-CPU system –Not enough resources for “extras” (e.g., cache simulator) 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 43

44 9-Jul-16RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 44 BlueSPARC Demo on BEE2 44 Demo application –On-Line Transaction Processing benchmark (TPC-C) in Oracle –Runs in Solaris 8 (unmodified binary) –FPGA + Memory directly loaded from Simics checkpoint BEE2 Platform

45 45 BEE2 Demo Configuration 4 GB memory (4 DDR2 Controllers) Ethernet (to Simics on PC) Virtex-II Pro 70 (BlueSPARC & PowerPC) RS232 (Debugging) FACS (FPGA Accelerated Cache Simulator) Memory references

46 Demo Web-based Real-time Viewing of Statistics 46

47 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 47 Part II: Hands-on Tutorial

48 Objectives Download open source code Familiarize with standard tool flows Implement an FPGA design from scratch Prepare Simics workload and checkpoint Load Simics checkpoint into FPGA Extract runtime statistics 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 48

49 CMU Remote Setup 7/9/2016 49

50 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 50 Groups We have a total of FOUR remote workstations –Please divide evenly into 4 groups –We will assign a group number to you (1-4) Each group will need two laptops: 1) Remote Desktop Connection (Windows XP) 2) Web browser for displaying online instructions

51 Instructions Navigate to www.ece.cmu.edu/~protoflex 7/9/2016IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 51 1) Click Here 2) Register new account here

52 Thank you! Acknowledgments –Funding for this work provided by: NSF CCF-0811702, NSF CNS- 0509356, C2S2 Marco Center, SUN, and Microsoft –Paul Hartke & Xilinx for FPGA donations & email support Topics & References –Hybrid Simulation and FPGA Core Multithreading A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs [FPGA’08] –Functional-First, First-Order Timing Model for a CMP cache ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs [TRETS’09] –FPGA Core Design & Validation Using Flight Data Recorder Implementing a High-performance Multithreaded Microprocessor: A Case Study in High-level Design and Validation [MEMOCODE’09] 52

53 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 53 Thanks! Any questions? echung@ece.cmu.edu http://www.ece.cmu.edu/~protoflex Acknowledgements We would like to thank our colleagues in the RAMP and TRUSS projects.


Download ppt "Computer Architecture Lab at ProtoFlex: An Architectural Exploration Vehicle Using FPGA-Accelerated Full-System Multiprocessor Simulations Eric S. Chung,"

Similar presentations


Ads by Google