Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,

Similar presentations


Presentation on theme: "Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,"— Presentation transcript:

1 Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai {echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu P ROTO F LEX/ S IM F LEX

2 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 2 Multiprocessor Functional Simulation Functionally simulating one processor in software is slow Simulating many processors is of course even slower Parallelism of FPGAs can scale up functional MP simulation perf  conduct large-scale (>64-way) SW research, cache simulations, perf sampling studies, etc. But we can’t forfeit full-ISA, full-system fidelity (run stock OS) Memory PCI Bus Ethernet controller Graphics card I/O MMU controller Disk DMA controller IRQ controller Terminal SCSI controller CPU FPGAs FPGAs = unprecedented level of scalability but full-system building effort can outweigh any benefits FPGAs = unprecedented level of scalability but full-system building effort can outweigh any benefits

3 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 3 Cpu Mem Combining FPGAs and simulators 3 ways to map target object to hybrid-simulation host Emulation-only Simulation-only Transplantable Transplant runtime system –target processors switch modes between FPGA & simulator hosts –processors need not execute 100% in FPGA mode e.g., implement only the frequently used ISA subset in FPGA 1 2 3 Target design FPGASimulator Mem Disk 1 2 3 Cpu I/O instr DMA Advantages: Leverage full-system simulators for reference designs Infrequent, complex behaviors remain simulated: TLB misses, block memory instrs, disk I/O instrs, SCSI disks, graphics, … Advantages: Leverage full-system simulators for reference designs Infrequent, complex behaviors remain simulated: TLB misses, block memory instrs, disk I/O instrs, SCSI disks, graphics, …

4 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 4 It Really Works + = SUN 3800 Server (1x UltraSPARC III, Solaris 8) Xilinx XUP Virtex-II Pro 30 Virtutech Simics (commercial simulator) Transplant & message interface Ethernet Simics UltraSPARC Simulated target devices Our SPARCV9 core Embedded PowerPC DDR memory developed in 6 months x86 also works BlueSPARC specs: 7k lines Bluespec UltraSPARC III ISA Validated against Simics w/ real apps (e.g., Solaris 8, SPEC2000, DB2, Oracle, etc.) 41% all instr groups implemented + MMU 8kB I/D direct-mapped caches multi-cycle func model (CPI ideal = 5 @ 100MHz) 16K LUTs (50% of XUP Virtex-II Pro 30) BlueSPARC specs: 7k lines Bluespec UltraSPARC III ISA Validated against Simics w/ real apps (e.g., Solaris 8, SPEC2000, DB2, Oracle, etc.) 41% all instr groups implemented + MMU 8kB I/D direct-mapped caches multi-cycle func model (CPI ideal = 5 @ 100MHz) 16K LUTs (50% of XUP Virtex-II Pro 30)

5 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 5 coverage=99.9999% CPI raw = 1 coverage=99.9999% CPI raw = 1 coverage=99.99999% CPI=1,000 CPI effective = 1.1 Reality check: transplants are expensive! ( 10ms=1,000,000 cycles ) –given CPI = 1 @ 100 Mhz (100 MIPS), 1 transplant per 1 million instructions increases CPI to 2 (50 MIPS) Recall lessons in hierarchical cache design … Hierarchical transplants –Run “simulator kernel” on nearby embedded PowerPC –write SW to cover the entire ISA –only I/O operations need full transplant to SIMICS (a 10x reduction in our case) Is this the best we can do? FPGA fabric full-system SIMICS coverage=100% CPI tplant =1,000,000 Embedded PPC ISAsim CPI effective = 2 Advantages: Now it makes sense to optimize towards CPI raw = 1 You actually need fewer instructions in hardware (especially beneficial for x86) Advantages: Now it makes sense to optimize towards CPI raw = 1 You actually need fewer instructions in hardware (especially beneficial for x86)

6 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 6 Demo

7 How to build a 1024-node MP functional emulator, without building 1024 nodes?

8 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 8 How fast do you need to simulate? In the uniprocessor world up to 100x slowdown for interactive software research (e.g. Simics) 1k to 10k slowdown for design exploration (e.g. cache simulation) Aggregate Throughput “fast enough” for 1024-way arch. studies

9 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 9 Different approaches to scale to 1K Even for 1K-node MP, only 1000 to 10,000 MIPS (aggregate) to do useful work The obvious approach –build fast ISA core (estimate 100 MIPS per core) –physically replicate the core 1000 times  10x to 100x faster than needed, why spend effort and area on perf I don’t need? The better approach  think in terms of MIPS –build 100 MIPS ISA emulation engine supporting multiple contexts –map 100 simulated processors onto single engine –with just 10 physical engines, I can emulate 1000-way system (10 x 100 MIPS = 1000 MIPS)

10 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 10 P ROTO F LEX MP Build 1000-MIPS simulator from 10s of emulation engines –multiplex large # of emulated contexts onto few emulation engines Decide # of emulation engines to build from desired performance, not from # nodes to emulate N-way target system P-way FPGA emulation engines, P<<N Memory CPU CPU N CPU CPU P

11 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 11 Interleaved Emulation Engine Statically interleaved emulation engine (ala HEP) –issue new instr from new context per cycle  maximize engine throughput –simple pipeline (no fwding or interlock if # context > # pipe stages) –deeper pipelines for higher frequency (or complex x86 instrs) –hide the latency of memory and transplants It is actually easier to optimize instruction throughput Open issues –How to manage very large # of contexts? Do we have to dynamically “page” clusters of contexts in and out of the engine? –How to “fake” memory capacity? How much DRAM to emulate 1000-node system?

12 Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 12 Conclusion Contributions –hybrid transplant simulation reduces FPGA development effort –proof-of-concept demonstrates up to 16 MIPS on select SPECINT  plan to run TPC-C on DB2 and Oracle on BEE2 (not enough DRAM on XUP) Future work –1024-way system on 10-way interleaved emulation engines Thanks! Questions? echung@ece.cmu.edu P ROTO F LEX /S IM F LEX (http://www.ece.cmu.edu/~simflex)


Download ppt "Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,"

Similar presentations


Ads by Google