Fast and Cycle-Accurate Modeling of a Multicore Processor 指導教授：周哲民學生：陳佑銓 CAD Group Department of Electrical Engineering National Cheng Kung University.

Slides:

Advertisements

Similar presentations

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.

Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.

Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.

1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.

Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.

Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

Constructive Computer Architecture Tutorial 4: SMIPS on FPGA Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T04-1.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Department of Electrical Engineering National Cheng Kung University

On Virtualization of Reconfigurable Hardware in Distributed Systems 指導教授：周哲民學生：陳佑銓 CAD Group Department of Electrical Engineering National Cheng.

1 Chapter 2. The System-on-a-Chip Design Process Canonical SoC Design System design flow The Specification Problem System design.

TM Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow JAN Asher Berkovitz Yaniv.

CSCE 430/830 Course Project Guidelines By Dongyuan Zhan Feb. 4, 2010.

An Introduction to Software Architecture

1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,

CAD Techniques for IP-Based and System-On-Chip Designs Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {

Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

집적회로 Spring 2007 Prof. Sang Sik AHN Signal Processing LAB.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Design methodologies.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and Artificial Intelligence Laboratory MIT Based on the work of.

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

FPL Sept. 2, 2003 Software Decelerators Eric Keller, Gordon Brebner and Phil James-Roxby Xilinx Research Labs.

FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator 1 黃翔 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

AN ASYNCHRONOUS BUS BRIDGE FOR PARTITIONED MULTI-SOC ARCHITECTURES ON FPGAS REPORTER: HSUAN-JU LI 2014/04/09 Field Programmable Logic and Applications.

PTII Model  VHDL Codegen Verification Project Overview 1.Generate VHDL descriptions for Ptolemy models. 2.Maintain bit and cycle accuracy in implementation.

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.

Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.

Out-of-Order OpenRISC 2 semesters project Semester B: OR1200 ISA Extension Final B Presentation By: Vova Menis-Lurie Sonia Gershkovich Advisor: Mony Orbach.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.

Computer Architecture Organization and Architecture

Application-Specific Customization of Soft Processor Microarchitecture

A High Performance SoC: PkunityTM

Application-Specific Customization of Soft Processor Microarchitecture

UNISIM (UNIted SIMulation Environment) walkthrough

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

Presentation transcript:

Fast and Cycle-Accurate Modeling of a Multicore Processor 指導教授：周哲民學生：陳佑銓 CAD Group Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C 2013/5/20

NCKU EE CAD ASIC Lab 2 NCKU Outline  Abstract  Introduction  Cycle-accurate simulation, simplifications and refinements  Implementation methodology  Flexible simulation platform  Related work  Conclusion

NCKU EE CAD ASIC Lab 3 NCKU Abstract  An ideal simulator allows an architect to swiftly explore design alternatives and accurately determine their impact on performance.  Design exploration requires simulators to be easily modifiable, and accurate performance estimates require detailed models.  In this paper we present Arete, an FPGA-based processor simulator, which offers high performance along with accuracy and modifiability.  We begin with a cycle-level specification of a multicore architecture which includes realistic in-order cores and detailed models of shared, coherent memory and on-chip network.  Arete delivers a performance of up to 11 MIPS per core. We run a subset of the PARSEC benchmark suite on top of off-the-shelf SMP Linux, and achieve an average performance of 55 MIPS for an 8-core model  We also describe two significant architectural explorations:  one involving three different branch predictors  the other requiring major modifications to the cache-coherence protocol.

NCKU EE CAD ASIC Lab 4 NCKU Introduction(1/3)  Performance modeling plays a critical role in the design and development of microprocessors.  There is an ever-rising need for fast, accurate and flexible simulators to explore new architectural ideas and evaluate their impact on performance.  as processor architectures get more complex, it becomes more difficult to implement processor simulators which are both accurate and have high performance.  the availability of large FPGAs and new high-level synthesis tools has provided a new opportunity for cycle-accurate simulations.

NCKU EE CAD ASIC Lab 5 NCKU Introduction(2/3)  These FPGA-based cycle-accurate simulators are able to provide three-orders of magnitude improvement in performance over software simulators  The initial effort to develop such FPGA simulators is somewhat greater than that required for software simulators, but it still is a far cry from the effort needed to develop a processor chip.  Also, it is possible to design these FPGA simulators in such a way that they are amenable to modular refinement, and facilitate the generation of simulators for many different variants of a base architecture.  In this paper we present Arete, an FPGA-based cycle-accurate simulator for a multicore PowerPC architecture.  We developed this simulator adhering to a cycle-level specification of the architecture.

NCKU EE CAD ASIC Lab 6 NCKU Introduction(3/3)  For the purpose of efficient FPGA implementation we used the LI- BDN technique [5] which helps to improve the FPGA cycle time and to reduce the FPGA resource requirements by using multiple FPGA cycles to simulate one cycle of the target architecture.  Our simulator is also suitable for architectural exploration. We demonstrate this by evaluating three different branch prediction schemes and by extending the cache-coherence scheme to provide software with better control over the contents of the caches.  To our knowledge Arete is the first cycle-accurate FPGA-based multicore processor simulator which includes both a realistic core architecture and a detailed cache-coherence engine.

NCKU EE CAD ASIC Lab 7 NCKU Cycle-accurate simulation, simplifications and refinements(1/2)  The term “cycle-accurate simulation” is used in literature to characterize many different types of simulations.  In this paper we define it as a simulation that conforms to the cycle- by-cycle behavior of the target design.  The behavior may be characterized in terms of the values of all the state elements of a machine (registers, memories, etc.) for every clock cycle.  Cycle-accurate simulators tend to be both slow and complex.  To overcome these obstacles, architects often simplify the target design.

NCKU EE CAD ASIC Lab 8 NCKU Cycle-accurate simulation, simplifications and refinements(2/2)  Once the cycle-by-cycle behavior of a model (which may include target simplifications) has been specified, the specification can be transformed into a netlist.  This netlist can be used to program an FPGA,  but it may require too many FPGA resources or present an unacceptably long critical path.  In order to reduce the resource requirements and shorten the critical path, an implementation may use several FPGA cycles to simulate one model cycle while preserving model timing accuracy.

NCKU EE CAD ASIC Lab 9 NCKU Implementation methodology(1/3)  We employ the LI-BDN [5]( Bounded Dataflow Networks and Latency- Insensitive) technique to implement our model on FPGA because it enables the use of implementation refinements while preserving the cycle-accuracy of the model and guaranteeing the absence of deadlocks from the implementation.  We give a brief overview of the LI-BDN technique using the example in Figure 1.

NCKU EE CAD ASIC Lab 10 NCKU Implementation methodology(2/3)  Debugging using the LI-BDN technique : The major requirement for debugging a large and complex model is to have the ability to freeze it in a particular model cycle so that a precise snapshot of all the state can be obtained.  Such an ability is similar to taking a snapshot of the architectural state of an out-of-order processor for precise exceptions

NCKU EE CAD ASIC Lab 11 NCKU Implementation methodology(3/3)  We make use of the module from Figure 1(a) to demonstrate how its LI-BDN implementation can facilitate debugging.  we add a 1-bit input port and a 1-bit output port to the module, as shown in Figure 2(a). Every model cycle, the module produces 1 or 0 on the new output port, and ignores the new input port.  We then transform the module into an LI-BDN and attach the external interface of the new ports to some logic, as shown in Figure 2(b).  The logic can freeze the module in model cycle n by dequeuing n times from the FIFO attached to the new output port, and enqueuing n-1 times into the FIFO attached to the new input port. A debugger can now either read or assign the value of the state in the n th model cycle. Also, any such transformed module can be frozen independently of the rest of the model.

NCKU EE CAD ASIC Lab 12 NCKU Flexible simulation platform(1/10)  The design and implementation of Arete provides simulation speed and accuracy along with ease of modification and portability.  We started by writing a cycle-level specification of the processor, and then employed the LI-BDN technique to incorporate various implementation refinements which helped achieve an efficient FPGA implementation.  In the process, we built a library of components which may be used for FPGA implementations of other models.  We used Bluespec System Verilog (BSV) [8] to develop Arete.

NCKU EE CAD ASIC Lab 13 NCKU Flexible simulation platform(2/10)  A. Processor Architecture  The processor makes use of a tiled architecture where the number of tiles is a synthesis parameter that is specified according to the resources available on a particular FPGA platform. PowerPC core PowerPC core L2$ Network Controller Dir Ctrl DRAM

NCKU EE CAD ASIC Lab 14 NCKU Flexible simulation platform(3/10)  Core: The core comprises of a 64-bit, in-order PowerPC pipeline and implements the Power ISA—Embedded Environment [9].  The pipeline is designed to provide a high degree of flexibility, and includes the following features.  Pipeline stages can be split or combined without modifying the rest of the pipeline because the stages are designed to be latency-tolerant.  The mechanism to handle change in instruction flow allows any stage to perform branch prediction, branch resolution or exception handling.  Any stage can read the register file and the various special purpose registers, but only the last stage updates them when committing instructions.  Updated register values are fully bypassed, but the pipeline may still stall due to read-after-write hazards.

NCKU EE CAD ASIC Lab 15 NCKU Flexible simulation platform(4/10)  Each core has private instruction and data L1 caches with a pipelined hit latency of 1 model cycle.  These caches are parameterized for associativity, line size, number of entries and replacement policy.  One of the key features of the core’s design is its modularity.  It can support a completely different RISC ISA with appropriate modifications confined to the decode and the MMU modules. Excep Handler ALUMem- 2 Branch Resol Addr Calc TLB

NCKU EE CAD ASIC Lab 16 NCKU Flexible simulation platform(5/10)  Shared memory and cache-coherence: We have designed and implemented a hierarchical, directory-based MSI protocol to provide cache-coherence.  The protocol maintains a set of invariants which guarantee the absence of deadlocks.

NCKU EE CAD ASIC Lab 17 NCKU Flexible simulation platform(6/10)  We have arranged the main memory in a distributed and shared manner where each tile has fast access to the region of main memory to which it is directly connected,  but it has to traverse the network layer to access those regions which are connected to other tiles.  Off-chip main memory is incorporated into Arete as an LI-BDN module.  This enables us to model its access latency which is another runtime parameter of the model.  A private region of DRAM is used to implement the directory state in the main memory which provides cache-coherence among L2 caches.

NCKU EE CAD ASIC Lab 18 NCKU Flexible simulation platform(7/10)  On-Chip network: The current implementation of the network architecture supports a bidirectional, all-to-all topology.  It is capable of handling four types of traffic:  cache-coherence, inter-core messaging, debugging and display Cache Coherence Inter-core Messaging Debugging Display

NCKU EE CAD ASIC Lab 19 NCKU Flexible simulation platform(8/10)  Flexibility :Due to our platform’s modularity and parameterization, we were able to conduct two significant and distinct architectural explorations on Arete with limited effort.  The design, verification and evaluation of three different branch prediction schemes required only 2 man-days worth of work.  A significant overhaul of the cache-coherence protocol to support software management of caches was carried out in 30 man-days.

NCKU EE CAD ASIC Lab 20 NCKU Flexible simulation platform(9/10)  Portability: the model communicates with three external resources: a Xilinx multi-ported memory controller (MPMC) which provides access to DRAM, a Microblaze soft core which runs debugging software, and a PC which provides access to a text terminal. PowerPC Model Xilinx MPMC MicroBlaze DRAM PC

NCKU EE CAD ASIC Lab 21 NCKU Flexible simulation platform(10/10)  For a particular FPGA platform, we wrap the interfaces to the three resources in order to present latency-insensitive, request-response interfaces to the model.  We have ported Arete to three FPGA boards: XUPv5, ML605 and BEE3.  This portability does not require any modifications to the design of the model; one only needs to specify appropriate values of certain parameters before synthesis.  Simulation infrastructure : We have attempted to provide a comprehensive simulation infrastructure for architectural exploration and verification.  We make use of the debugging feature enabled by the use of the LI-BDN technique to build a debugging environment for Arete.  The debugging software handles low-level model initialization and provides access to all model state during simulation.

NCKU EE CAD ASIC Lab 22 NCKU Related work(1/2)  Rsim [15] is a discrete event-driven simulator written in C++ and C, and provides detailed models of out-of-order superscalar processors connected via coherent shared memory.  It does not run an operating system and only models user-level activity of applications.  Simics [16] is a popular commercial functional simulator which, on the other hand, can boot an operating system and run applications on top of it.  Simics can be coupled with detailed execution-driven performance models like Gems [17], and M5 [18].  Gems and M5 provide accurate models of the memory hierarchy and the on- chip network for a multi-core system allowing detailed evaluation of these components.  A recent multicore processor simulator called Graphite [24] targets systems with thousands of cores.  It relaxes cycle-accuracy to attain a higher simulation speed ranging in tens of MIPS.  Unlike Arete, Graphite is not a full system simulator, and it does not run an operating system.

NCKU EE CAD ASIC Lab 23 NCKU Related work(2/2)  In the RAMP GOLD [4] effort, Tan et. al have demonstrated a 64- core shared-memory target architecture.  They have built a detailed memory model which does not include cache- coherence.  They have a perfect core model which only stalls due to cache misses, and their network model comprises of a magic crossbar.  Pellauer’s technique uses what are called A-Ports [7], which are FIFOs connecting modules.  Their methodology is similar to LI-BDNs, but they do not enforce the conditions needed to avoid deadlocks the way the LI-BDNs do.  Chiou’s FAST simulator [3] is split between a QEMU-based [26] functional emulator and an FPGA-based accurate timing model.  They have also developed a multicore simulator using a functional-timing split [27].

NCKU EE CAD ASIC Lab 24 NCKU Conclusion(1/2)  We have presented a fast and cycle-accurate simulator for a multicore PowerPC architecture.  The simulator accurately models a shared memory subsystem which includes a cache-coherence engine.  We employed several novel ideas to provide a user-friendly simulation infrastructure, which others may want to adopt.  A distributed debugging environment using the LI-BDN technique enables us to independently freeze any module in any model cycle.  The use of standardized interfaces makes it possible to port Arete to multiple FPGA platforms without any modifications.  Functionally-identical partitions and a distributed protocol for assigning identifiers makes it possible to use one configuration file for all the FPGAs in a multi-FPGA platform.

NCKU EE CAD ASIC Lab 25 NCKU Conclusion(2/2)  Moving forward, we are developing a new high-level hardware description language that allows architects to conveniently specify the cycle-by-cycle behavior of a target design.  One of the goals of this work is to generate efficient synthesizable RTL from these specifications.  Another goal is to develop a tool that will automatically transform these specification into LI-BDNs.  We are also extending Arete to facilitate research on hardware- software co-design.  One of the key challenges in this area of research is to figure out the optimal hardware-software partitioning of algorithms for performance and power.  Due to its modularity Arete can readily accommodate algorithm-specific hardware accelerators for exploring many such partitions.

NCKU EE CAD ASIC Lab 26 NCKU Thanks for your attention.