Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Datorteknik BusInterfacing bild 1 Bus Interfacing Processor-Memory Bus –High speed memory bus Backplane Bus –Processor-Interface bus –This is what we usually.
Computer Architecture Lab at Combining Simulators and FPGAs “An Out-of-Body Experience” Eric S. Chung, Brian Gold, James C. Hoe, Babak Falsafi {echung,
Computer Architecture Lab at Building a Synthesizable x86 Eriko Nurvitadhi, James C. Hoe, Babak Falsafi S IMFLEX /P ROTOFLEX.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.
G Robert Grimm New York University Disco.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,
RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.
Reconfigurable Computing in the Undergraduate Curriculum Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina.
Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
Virtual Machine Monitors CSE451 Andrew Whitaker. Hardware Virtualization Running multiple operating systems on a single physical machine Examples:  VMWare,
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
1 CS503: Operating Systems Spring 2014 Dongyan Xu Department of Computer Science Purdue University.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Computer Architecture Lab at 1 FPGAs and Bluespec: Experiences and Practices Eric S. Chung, James C. Hoe {echung,
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, and Mendel Rosenblum Summary By A. Vincent Rayappa.
Supporting Multi-Processors Bernard Wong February 17, 2003.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Computer Architecture Lab at ProtoFlex: An Architectural Exploration Vehicle Using FPGA-Accelerated Full-System Multiprocessor Simulations Eric S. Chung,
Lecture 2. A Computer System for Labs
Virtualization.
Virtual Machine Monitors
CMSC 611: Advanced Computer Architecture
Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus
Lynn Choi School of Electrical Engineering
Memory COMPUTER ARCHITECTURE
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
Instant replay The semester was split into roughly four parts.
Morgan Kaufmann Publishers Memory & Cache
Derek Chiou The University of Texas at Austin
Address Translation for Manycore Systems
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Combining Simulators and FPGAs “An Out-of-Body Experience”
ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs
Presented by: Eric Carty-Fickes
Performance of computer systems
Today’s agenda Hardware architecture and runtime system
Instruction Level Parallelism (ILP)
Performance of computer systems
Chapter 4 Multiprocessors
Performance of computer systems
Presentation transcript:

Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai {echung, enurvita, jhoe, babak, P ROTO F LEX/ S IM F LEX

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 2 Multiprocessor Functional Simulation Functionally simulating one processor in software is slow Simulating many processors is of course even slower Parallelism of FPGAs can scale up functional MP simulation perf  conduct large-scale (>64-way) SW research, cache simulations, perf sampling studies, etc. But we can’t forfeit full-ISA, full-system fidelity (run stock OS) Memory PCI Bus Ethernet controller Graphics card I/O MMU controller Disk DMA controller IRQ controller Terminal SCSI controller CPU FPGAs FPGAs = unprecedented level of scalability but full-system building effort can outweigh any benefits FPGAs = unprecedented level of scalability but full-system building effort can outweigh any benefits

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 3 Cpu Mem Combining FPGAs and simulators 3 ways to map target object to hybrid-simulation host Emulation-only Simulation-only Transplantable Transplant runtime system –target processors switch modes between FPGA & simulator hosts –processors need not execute 100% in FPGA mode e.g., implement only the frequently used ISA subset in FPGA Target design FPGASimulator Mem Disk Cpu I/O instr DMA Advantages: Leverage full-system simulators for reference designs Infrequent, complex behaviors remain simulated: TLB misses, block memory instrs, disk I/O instrs, SCSI disks, graphics, … Advantages: Leverage full-system simulators for reference designs Infrequent, complex behaviors remain simulated: TLB misses, block memory instrs, disk I/O instrs, SCSI disks, graphics, …

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 4 It Really Works + = SUN 3800 Server (1x UltraSPARC III, Solaris 8) Xilinx XUP Virtex-II Pro 30 Virtutech Simics (commercial simulator) Transplant & message interface Ethernet Simics UltraSPARC Simulated target devices Our SPARCV9 core Embedded PowerPC DDR memory developed in 6 months x86 also works BlueSPARC specs: 7k lines Bluespec UltraSPARC III ISA Validated against Simics w/ real apps (e.g., Solaris 8, SPEC2000, DB2, Oracle, etc.) 41% all instr groups implemented + MMU 8kB I/D direct-mapped caches multi-cycle func model (CPI ideal = 100MHz) 16K LUTs (50% of XUP Virtex-II Pro 30) BlueSPARC specs: 7k lines Bluespec UltraSPARC III ISA Validated against Simics w/ real apps (e.g., Solaris 8, SPEC2000, DB2, Oracle, etc.) 41% all instr groups implemented + MMU 8kB I/D direct-mapped caches multi-cycle func model (CPI ideal = 100MHz) 16K LUTs (50% of XUP Virtex-II Pro 30)

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 5 coverage= % CPI raw = 1 coverage= % CPI raw = 1 coverage= % CPI=1,000 CPI effective = 1.1 Reality check: transplants are expensive! ( 10ms=1,000,000 cycles ) –given CPI = 100 Mhz (100 MIPS), 1 transplant per 1 million instructions increases CPI to 2 (50 MIPS) Recall lessons in hierarchical cache design … Hierarchical transplants –Run “simulator kernel” on nearby embedded PowerPC –write SW to cover the entire ISA –only I/O operations need full transplant to SIMICS (a 10x reduction in our case) Is this the best we can do? FPGA fabric full-system SIMICS coverage=100% CPI tplant =1,000,000 Embedded PPC ISAsim CPI effective = 2 Advantages: Now it makes sense to optimize towards CPI raw = 1 You actually need fewer instructions in hardware (especially beneficial for x86) Advantages: Now it makes sense to optimize towards CPI raw = 1 You actually need fewer instructions in hardware (especially beneficial for x86)

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 6 Demo

How to build a 1024-node MP functional emulator, without building 1024 nodes?

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 8 How fast do you need to simulate? In the uniprocessor world up to 100x slowdown for interactive software research (e.g. Simics) 1k to 10k slowdown for design exploration (e.g. cache simulation) Aggregate Throughput “fast enough” for 1024-way arch. studies

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 9 Different approaches to scale to 1K Even for 1K-node MP, only 1000 to 10,000 MIPS (aggregate) to do useful work The obvious approach –build fast ISA core (estimate 100 MIPS per core) –physically replicate the core 1000 times  10x to 100x faster than needed, why spend effort and area on perf I don’t need? The better approach  think in terms of MIPS –build 100 MIPS ISA emulation engine supporting multiple contexts –map 100 simulated processors onto single engine –with just 10 physical engines, I can emulate 1000-way system (10 x 100 MIPS = 1000 MIPS)

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 10 P ROTO F LEX MP Build 1000-MIPS simulator from 10s of emulation engines –multiplex large # of emulated contexts onto few emulation engines Decide # of emulation engines to build from desired performance, not from # nodes to emulate N-way target system P-way FPGA emulation engines, P<<N Memory CPU CPU N CPU CPU P

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 11 Interleaved Emulation Engine Statically interleaved emulation engine (ala HEP) –issue new instr from new context per cycle  maximize engine throughput –simple pipeline (no fwding or interlock if # context > # pipe stages) –deeper pipelines for higher frequency (or complex x86 instrs) –hide the latency of memory and transplants It is actually easier to optimize instruction throughput Open issues –How to manage very large # of contexts? Do we have to dynamically “page” clusters of contexts in and out of the engine? –How to “fake” memory capacity? How much DRAM to emulate 1000-node system?

Jan 11, 2007Eric S. Chung / RAMP 2007 Retreat 12 Conclusion Contributions –hybrid transplant simulation reduces FPGA development effort –proof-of-concept demonstrates up to 16 MIPS on select SPECINT  plan to run TPC-C on DB2 and Oracle on BEE2 (not enough DRAM on XUP) Future work –1024-way system on 10-way interleaved emulation engines Thanks! Questions? P ROTO F LEX /S IM F LEX (