A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor Hamid Noori †, Farhad Mehdipour ‡, Kazuaki Murakami †, Koji.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Pipelining and Parallelism Mark Staveley

Kyushu University Las Vegas, June 2007 The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.

Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

NISC set computer no-instruction

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

1 of 14 Lab 2: Design-Space Exploration with MPARM.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Application-Specific Customization of Soft Processor Microarchitecture

How will execution time grow with SIZE?

Hamid Noori*, Farhad Mehdipour†, Norifumi Yoshimastu‡,

Dynamically Reconfigurable Architectures: An Overview

Serial versus Pipelined Execution

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Automatic Tuning of Two-Level Caches to Embedded Applications

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B. Javadi, H. Honda, K. Inoue and K. J. Murakami Department of Information Science and Electrical Engineering, KYUSHU UNIVERSITY, Fukuoka, JAPAN

2/XXXII Outline Reconfigurable Instructions Set Processors A Combined Analytical and Simulation-Based Model (CAnSO) Model Extraction and Calibration Basic Model Definitions Speedup Formulations Simplification and Calibration Experiments Experimental Setup Model Validation Design Space Exploration Using CAnSO Effects of Modifications Conclusions and Future Work

3/XXXII Designing Embedded Systems Embedded Microprocessors Application-Specific Integrated Circuits (ASICs) Application-Specific Instruction set Processors (ASIPs) Extensible Processors

4/XXXII Extensible Processors Mechanism Acceleration by using CFU a hardware is augmented to the base processor Executes hot portions of applications

5/XXXII Extensible Processors Base processor (BP)'s fixed instruction set + Custom Instructions Goals Improving the performance and energy efficiency Maintaining compatibility and flexibility CPU Instruction Dispatcher Register File +&x LD/ST CFU1CFU2 LD/ST: Load / Store CFU: Custom Functional Unit

6/XXXII Custom Instructions Instruction set customization  hardware/software partitioning (Identifying critical segments in applications) Custom Instructions (CIs) are extracted from critical segments of an application and executed on a Custom Functional Unit (CFU) A CI can be represented as a DFG Critical segments: Most frequently executed (Hot) portions of the applications

7/XXXII Extensible Processors Drawbacks: Lack of flexibility Long time and cost of designing and verifying Many issues associated with designing a new processor from scratch: longer time-to-market and significant NRE (Non-Recurring Engineering) costs Solution Using a Reconfigurable Functional Unit (RFU) instead of fixed architecture CFU

8/XXXII Reconfigurable Processors Microprocessor Reconfigurable Logic Reconfigurable Processor

9/XXXII Processor coupling Coprocessor Processor RFU Memory Attached Processor Bridge

10/XXXII Reconfigurable Instruction Set Processors (RISPs) Adding and generating custom instructions after fabrication Using a reconfigurable FU(RFU) instead of custom FU CPU Instruction Dispatcher Register File +&x LD/ST CFU1CFU2 RFU Config Mem CFU: Custom Functional Unit RFU: Reconfigurable Functional Unit

11/XXXII How a RISP Works GPP: General Purpose Processor RAC=RFU: Reconfigurable Accelerator subiu $25,$25, lbu$13,0($7) lbu$2,0($4) sll$2,$2,0x a0sra $14,$2,0x a8addiu$4,$4,1 4006b0srl$8,$2,0x1c 4006b8sll$2,$8,0x2 4006c0addu$2,$2,$ c8lw$2,0($2) 4006d0xori$13,$13,1 4006d8addu$10,$10,$ subiu $25,$25, sll$2,$2,0x a0sra $14,$2,0x lbu$13,0($7) 4006e0bgez$10,4006f0... A Hot Basic Block RISP

12/XXXII RISP Benefits and Drawbacks Benefits Specialized datapath Shared hardware Higher Speedup Less power consumption Drawbacks More area Difficult to use

13/XXXII Performance Evaluation of a RISP Performance evaluation of a RISP challenges designing of a RISP architecture optimizing an existing arch. for an objective function For a designer obtaining optimum system configuration is desirable a performance analysis in terms of the performance metrics (speedup, area and so on) is required Performance evaluation models Structural models: includes empirical studies based on measurements and simulations of the target system Analytical models: incorporates a system (usually simplified) structure to obtain mathematically solvable models

14/XXXII Fraction of Dynamic Instructions in Applications the RAC is responsible for executing almost 30% of dynamic instructions of applications in average

15/XXXII Model Extraction and Utilization

16/XXXII General Template of a RISP

17/XXXII Basic Model Definitions Base Processor an in-order general five-stage RISC processor RAC a coarse-grained tightly-coupled reconfigurable hardware CIs are indexed for direct accessing of the configuration bit-stream The content of all registers are sent to the RAC (Shared RF) Controlling configurations Hardware-based: starting address of CI and index to the config. Mem. is stored in a CAM for quick retrieval Software-based: starting address of a CI is replaced with a special instruction Memory accesses Control instructions

18/XXXII Single and Continuous Executions Single Execution Continuous Execution

19/XXXII Speedup Formulation Latency of execution of CIi instructions on the BP frequency of jth occurrence of CIi Fraction of instructions executing on BP Overall Speedup Execution time on the RAC Latency of RAC and the overhead reconfiguration time Total no. of executions of CIi Execution time on the BP

20/XXXII The Effect of CI Length Large CIs Including more instructions than the no. of available resources in the RAC Temporal Partitioning Dividing larger CIs to a number of smaller CIs

21/XXXII Side-Effects Control Instructions the rate of miss-predicted branches might be reduced  higher speedup Instruction Cache Misses no need for fetching instructions belonging to the CIs access and miss rates to instruction cache are reduced BP fraction reduces  speedup increases variation in branch/cache miss- predictions/misses no. of penalty cycles for branch miss- predictions/cache misses

22/XXXII RF’s Input/Output Ports Register file is shared between BP and RAC Additional clock cycles for reading/writing from/to the RF no. of CIi’s inputs no. of RF’s write ports no. of RF’s read ports

23/XXXII The Assumed RAC Architecture

24/XXXII RAC’s Delay All FUs in the RAC implement similar operations Each mux receives all outputs of the FUs in upper rows and Outputs from its adjacent FUs at the same row

25/XXXII Simplification and Calibration Control instructions are not supported Reduction in instruction cache accesses as well as cache misses average reduction in access to i-cache is almost 17% average i-cache miss rate is almost 3%. Average i-Cache Accesses: 17% Misses: 3%

26/XXXII Simplification and Calibration Fraction of Single & Continuous Executions the ratio of single to continuous execution  43%

27/XXXII Experimental Setup Fourteen applications of Mibench automotive, security, consumer, network, telecommunication CIs (DFGs) are extracted from applications Simplescalar’s cycle-accurate simulator is extended to simulate a reconfigurable instruction set processor Model Establishment simulating all applications collecting required information model simplification and calibration ~ 4 hours to completion on a PC: Dual Core, Intel 2GB RAM

28/XXXII Model Validation Average variation= 2% Average variation= 22%

29/XXXII Design Space Exploration Using CAnSO The design of a RAC including different components entails a multitude of design parameters Examining 100 design points using 14 applications: Simulation: 17 days CAnSO: 4 hours Using CAnSO, re-simulation is not needed after establishing the model

30/XXXII Using CAnSO for Design Space Exploration of the RAC Increasing the width of RAC increases speedup Width> 6: no more speedup is achievable the small heights  very low speedup Height> 5: RAC’s longer critical path delay  speedup declines

31/XXXII Effect of Modifications Applying modification to the design  - Small time is required for repeating the simulation - Each iteration of the CAnSO takes less than a minute

32/XXXII C ONCLUSION Reconfigurable instruction set processors A combined analytical and simulation-based model (CAnSO) Suitable for exploring a large design space for the accelerator Sufficient flexibility in a rapid evaluation of modified target architectures Substantially reduce the design or optimization time while preserving a reasonable accuracy Proves less than 2% variation in evaluation results Uncalibrated CAnSO depicts 22% difference in average Future work: Expanding CAnSO to support control instructions Considering more complicated RAC architectures