1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Original Development Team The Compiler and Architecture Research Group (formerly part of Hewlett-Packard Laboratories) Illinois Microarchitecture Project.
Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Platforms, ASIPs and LISATek Federico Angiolini DEIS Università di Bologna.
2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.
Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical.
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Automated Design of Custom Architecture Tulika Mitra
1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Configurable, reconfigurable, and run-time reconfigurable computing.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
CPE 626 Advanced VLSI Design Lecture 2 Aleksandar Milenkovic
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.
Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,
A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.
A Floating Point Divider for Complex Numbers in the NIOS II Presented by John-Marc Desmarais Authors: Philipp Digeser, Marco Tubolino, Martin Klemm, Daniel.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.
Real-Time System-On-A-Chip Emulation.  Introduction  Describing SOC Designs  System-Level Design Flow  SOC Implemantation Paths-Emulation and.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Programmable Hardware: Hardware or Software?
Application-Specific Customization of Soft Processor Microarchitecture
FPGAs in AWS and First Use Cases, Kees Vissers
Introduction to cosynthesis Rabi Mahapatra CSCE617
Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.
A High Performance SoC: PkunityTM
Application-Specific Customization of Soft Processor Microarchitecture
Research: Past, Present and Future
Presentation transcript:

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18

2 Overview 1. background: extensible processors 2. design flow: C to custom processor silicon 3. instruction selection: bandwidth/area constraints 4. application-specific processor synthesis 5. results: 3x area delay product reduction 6. current and future work + summary

3 1. Instruction-set extensible processors ● base processor + custom logic – partition data-flow graphs into custom instructions data out ALU Register File data in

4 Previous work ● many techniques, e.g. – Atasu et al. (DAC 03) – Goodwin and Petkov (CASES 03) – Clark et al. (MICRO 03, HOT CHIPS 04) ● current challenges – optimality and robustness of heuristics – complete tool chain: application to silicon – research infrastructure for custom processor design

5 2. Custom processor research at Imperial ● focus on effective optimization techniques – e.g. Integer Linear Programming (ILP) ● complete tool-chain – high-level descriptions to custom processor silicon ● open infrastructure for research in – custom processor synthesis – automatic customization techniques ● current tools – optimizing compiler (Trimaran) for custom CPUs – custom processor synthesis tool

6 Application to custom processor flow Application Source (C) Template Generation Template Selection Area Constraint Generate Custom Unit Generate Base CPU Processor Description ASIC Tools Area, Timing

7 Custom instruction model output ports Register File input ports Input Register Pipeline Register Output Register

8 3. Optimal instruction identification ● minimize schedule length of program data flow graphs (DFGs) ● subject to constraints – convexity: ensure feasible schedules – fixed processor critical path: pipeline for multi-cycle instructions – fixed data bandwidth: limited by register file ports ● steps: based on Integer Linear Pogramming (ILP) a. template generation b. template selection

9 a. Template generation X 1. Solve ILP for DFG to generate a template 2. Collapse template to a single DFG node 3. Repeat while (objective > 0)

10 b. Template selection ● determine isomorphism classes – find templates that can be implemented using the same instruction – calculate speed-up potential of each class ● solve Knapsack problem using ILP – maximize speedup within area constraint

11 Optimizing compilation flow Application in C/C++ Impact Front-end CDFG Formation a) Template Generation b) Template Selection MDES Generation Assembly Code and Statistics Instruction Replacement Scheduling, Reg. Allocation Elcor Backend Gain Data Bandwidth Constraints Data Bandwidth Constraints Area Constraints Synopsys Synthesis Area VHDL

12 4. Application-specific processor synthesis ● design space exploration framework – Processor Component Library – specialized structural description ● prototype: MIPS integer instruction set – custom instructions – flexible micro-architecture ● evaluate using actual implementation – timing and area

13 Processor synthesis flow Custom Data paths from compiler FE Processor Component Library ● merging ● add state registers ● processor interface ● pipeline description ● parameters FE EX MW interface ● data in/out ● stall control Custom Processor

14 Implementation ● based on Python scripts – structural meta-language for processors – combine RTL (Verilog/VHDL) IP blocks – module generators for custom units ● generate 100s of designs automatically – ASIC processor cores – complete system on FPGA: CPU + memory + I/O

15 5. Results ● cryptography benchmarks: C source – AES decrypt, AES encrypt, DES, MD5, SHA ● 4/5 stage pipelined MIPS base processor – 0.225mm 2 area, 200 MHz clock speed – single issue processor – register file with 2 input ports, 1 output port ● processors synthesized to 130nm library – Synopsys DC and Cadence SoC Encounter – also synthesize to Xilinx FPGA for testing

16 AES Decryption Processor 130nm CMOS 200MHz 0.307mm 2 35% area cost (mostly one instruction) 76% cycle reduction

17 AES Decryption Processor 130nm CMOS 200MHz 0.307mm 2 35% area cost (mostly one instruction) 76% cycle reduction

18 Execution time 4 inputs, 1 output 4 inputs, 4 outputs 4 inputs, 2 outputs 4 inputs, 1 output 76% reduction 63% reduction 43% reduction Register file in all cases: 2 input ports, 1 output port

19 Timing 48% of designs meet timing at 200MHz without manual optimization

20 Area (for maximum speedup) 35% 28% 42% 93% 23%

21 6. Current and future work ● support memory access in custom instructions – automate data partitioning for memory access – automate SIMD load/store instructions for state registers ● use architectural techniques e.g. shadow registers – improve bandwidth without additional register file ports ● study trade-offs for VLIW style – multiple register file ports – multiple issue and custom instructions ● extend compiler: e.g. ILP model for cyclic graphs – adapt software pipelining for hardware

22 Summary ● complete flow from C to custom processor ● automatic instruction set extension – based on integer linear programming – optimize schedule length under constraints ● application-specific processor synthesis – complete flow: permits real hardware evaluation ● up to 76% reduction in execution cycles – 3x area delay product reduction ● max speedup: 23% to 93% area overhead