Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical.

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

Xtensa C and C++ Compiler Ding-Kai Chen
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
MotoHawk Training Model-Based Design of Embedded Systems.
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
A Hybrid Energy-Estimation Technique for Extensible Processors Fei, Y.; Ravi, S.; Raghunathan, A.; Jha, N.K. IEEE Transactions on Computer-Aided Design.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Configurable System-on-Chip: Xilinx EDK
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Dynamically Reconfigurable Architectures: An Overview Juanjo Noguera Dept. Computer Architecture (DAC-UPC)
1-1 Embedded Software Development Tools and Processes Hardware & Software Hardware – Host development system Software – Compilers, simulators etc. Target.
Verification of Configurable Processor Cores Marines Puig-Medina, Gulbin Ezer, Pavlos Konas Design Automation Conference, 2000 Page(s): 426~431 presenter:
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
FPGA-Based System Design: Chapter 4 Copyright  2004 Prentice Hall PTR HDL coding n Synthesis vs. simulation semantics n Syntax-directed translation n.
B212/MAPLD 2005 Craven1 Configurable Soft Processor Arrays Using the OpenFire Processor Stephen Craven Cameron Patterson Peter Athanas Configurable Computing.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
11 Using SPIRIT for describing systems to debuggers DSDP meeting February 2006 Hobson Bullman – Engineering Manager Anthony Berent – Debugger Architect.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training module provides an overview of optimization techniques used in.
Instituto de Informática and Dipartimento di Automatica e Informatica Universidade Federal do Rio Grande do Sul and Politecnico di Torino Porto Alegre,
University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
REXAPP Bilal Saqib. REXAPP  Radio EXperimentation And Prototyping Platform Based on NOC  REXAPP Compiler.
Automated Design of Custom Architecture Tulika Mitra
Presenter : Ching-Hua Huang 2013/7/15 A Unified Methodology for Pre-Silicon Verification and Post-Silicon Validation Citation : 15 Adir, A., Copty, S.
A New Method For Developing IBIS-AMI Models
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.
F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Computer Organization and Design Computer Abstractions and Technology
Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,
1 chapter 1 Computer Architecture and Design ECE4480/5480 Computer Architecture and Design Department of Electrical and Computer Engineering University.
Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Programmable Hardware: Hardware or Software?
Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof
Evaluating Register File Size
Selective Code Compression Scheme for Embedded System
Embedded Systems Design
Chapter 1: Introduction
Department of Electrical & Computer Engineering
Dynamically Reconfigurable Architectures: An Overview
To DSP or Not to DSP? Chad Erven.
Department of Electrical Engineering Joint work with Jiong Luo
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Presentation transcript:

Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical Engineering Princeton University ++ : NEC Laboratories America, Inc.

Outline SoC design constraints Background  Previous work in ASIP design  Xtensa platform  Manual custom instruction generation procedure Automatic custom instruction generation flow Experimental results Conclusions

SoC Design Constraints Time to market Cost Performance Power Cost-performance trade-off Flexibility ……

Comparison of Different Approaches ASICASIPGPP Time to market Cost Performance Power Cost-performance Flexibility Very good + Good -- Very bad

Domain Specific Processor (DSP) General Embedded Processor MIPS/mw 1-10 MIPS/mw MIPS/mw MOPS/mw Energy Efficiency Flexibility ASIC ASIP (Xtensa) Domain Specific Processor (AMD-K6E) MIPS/mW 1-10 MIPS/mW MIPS/mW MOPS/mW Energy Efficiency Flexibility Flexibility vs. Energy Efficiency

Previous Work in ASIP Design ASIP architectures and overall design methodologies  [Huang, 1994], [Adams, 1996], [Fisher, 1999], [Kucukcakar, 1999] Application-specific instruction set selection  [Choi, 1999], [Gschwind, 1999], [Arnold, 1999] Low power ASIP design  [Kalambur, 1997], [Dougherty, 1999], [Ishihara, 2000], [Sami, 2001] Commercial offerings  Xtensa, ARCtangent, Jazz, SP-5flex, Carmel

Processor Controls TRACE Port JTAG Tap Control On Chip Debug Align and Decode Coprocessor Register File Coprocessor Execution Units Window Register File ALU & Address Generation MAC 16 Designer Defined Instruction Execution Unit Instruction Memory or Cache & Tags Branch Logic & Instruction Fetch Date Memory or Cache &Tags Processor Interface Write Buffer Timers 1 to n Special Function Register Access Data Address Watch 0 to n Instruction Address Watch 0 to n Instruction Base ISA Feature Configurable Function Optional Function Configurable & Optional Function Extensible Data Instruction Address Data Address Exception Support Interrupt Control Memory Protection Unit Source: Xtensa Architecture

Xtensa Processor Design Flow Processor Configuration Inputs Designer-Defined Instruction Descriptions Configuration File Configured GNU C/C++ Compiler Configured GNU Assembler/ Disassembler Configured Instruction Set Simulator/Emulator Configured Processor HDL Area, Power and Timing Estimation Logic Synthesis (Synopsys or Ambit) Block Place/Route (Avant! Or Cadence) Timing Verification Hardware Profile Application Specific Compile, Assemble, Link Application Simulation with ISS and/or Emulator Software Debugging/Profiling Application Source Code Sample Application Data Optimized Software Optimized Hardware Generator Output Internal Database Design data Use of Generated Data Source:

Manual Custom Instruction Generation Procedure Identify potential new instructions Describe custom instructions Insert custom instructions Verify functional correctness Profile, read source code Understand source code Rewrite source code Slow and error-prone

Contributions of Our Work Automatic custom instruction selection  Application program to extensible processors with custom instructions Features  Efficient design space search  Use accurate information from instruction set simulator and synthesis  Bridge the gap between automatic synthesized and manually designed architectures

Automatic Custom Instruction Generation Flow

Example Illustration of Template Generation

Key Observations for Pruning Higher the weight of the template, higher the potential for improvement --- Amdahl’s law Scope for optimization determined by computation --- No. of cycles needed for executing the template Scope for optimization determined by read/write ports limitation --- Additional cycles needed for extra reading/writing of input/output variables

Pruning Algorithm Ranking criterion:  OriginalTime: Fraction of the total execution time of the original program spent in the template (weight)  In, Out: Number of inputs and outputs of the template, respectively  α, β: Number of inputs/outputs encoded in the instruction  γ: No. of cycles needed for executing the template Higher priority means greater potential for speed up

12.73 Template Generation with Pruning Ranked pool of seed templates Highest priority Threshold: 0.1 Template set

Template Generation with Pruning Highest priority Threshold: 0.1 Template set Ranked pool of seed templates

Template Generation with Pruning Highest priority Threshold: 0.1 Template set Ranked pool of seed templates

Template Generation with Pruning Highest priority Threshold: 0.1 Template set Ranked pool of seed templates

No. of Templates vs. Threshold Ratio

Automatic Custom Instruction Generation Flow

Automatic Custom Instruction Generation Flow (Contd.)

Custom Instruction Insertion Care must be taken to insert custom instructions into appropriate places without affecting program’s functional correctness If custom instructions need extra inputs (outputs), care must be taken to select appropriate variables to write to (read from) user-defined registers

Example Illustration of Custom Instruction Insertion

Example Illustration of Custom Instruction Insertion (Contd.) (a) (b).... offset = t + 1; for (i=0; i<100; i++) { j =.... result = offset + i * j; } offset = t + 1; for (i=0; i<100; i++) { j =.... result = CustomInstr(i,j); }.... WUR(offset,0);

Automatic Custom Instruction Generation Flow

Custom Instruction Combination Selection --- Problem Statement Given a set of non-overlapping custom instructions, with each instruction having several versions, find a version for each instruction such that performance is maximized while area is under a certain threshold

Custom Instruction Combination Selection --- Flow Chart

Automatic Custom Instruction Generation Flow

Experimental Methodology C Program Automatic Custom Instruction Generation Aristotle Xtensa TIE Compiler Synopsys Design Compiler Xtensa GNU Profiler Custom Processor (HDL Description) NEC CB11 TIE Tensilica Processor Generator Synopsys Design Compiler Modified C program Cross Compiler ISS Sente Wattwatcher AreaClock Period Execution Cycles Power

Experimental Results (Contd.) Average Performance improvement: 3.4X Energy reduction: 3.2X Energy*delay reduction: 12.6X Area increase: 1.8%

Conclusions Automatic custom instruction synthesis for ASIPs  Template generation/selection  Custom instruction insertion  Custom instruction combination selection Experimental results  3.4X average performance improvement  12.6X average energy*delay reduction