08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.

Slides:

Advertisements

Similar presentations

ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)

Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Compiler techniques for exposing ILP

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Houshmand Shirani-mehr 1,2, Tinoosh Mohsenin 3, Bevan Baas 1 1 VCL Computation Lab, ECE Department, UC Davis 2 Intel Corporation, Folsom, CA 3 University.

Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &

Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee.

08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine Conditional.

Instruction Level Parallelism (ILP) Colin Stevens.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.

08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine High-Level.

Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A C-to-VHDL Parallelizing High-Level.

Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.

Multiscalar processors

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.

Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.

Center for Embedded Computer Systems University of California, Irvine Dynamic Common Sub-Expression Elimination during Scheduling.

Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.

SPARK Accelerating ASIC designs through parallelizing high-level synthesis Sumit Gupta Rajesh Gupta

DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Parallelism Processing more than one instruction at a time. Pipelining

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

Extreme Makeover for EDA Industry

Automated Design of Custom Architecture Tulika Mitra

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

HYPER: An Interactive Synthesis Environment for Real Time Applications Introduction to High Level Synthesis EE690 Presentation Sanjeev Gunawardena March.

EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.

Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

George Mason University Finite State Machines Refresher ECE 545 Lecture 11.

Code Optimization.

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

CDA 3101 Spring 2016 Introduction to Computer Organization

Introduction to cosynthesis Rabi Mahapatra CSCE617

Instruction Scheduling for Instruction-Level Parallelism

Architectural-Level Synthesis

Architecture Synthesis

HIGH LEVEL SYNTHESIS.

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Dynamic Hardware Prediction

ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.

Lecture 11: Machine-Dependent Optimization

Presentation transcript:

08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh GuptaAlex Nicolau Supported by Semiconductor Research Corporation Center for Embedded Computer Systems University of California, Irvine Strategic CAD Labs Design Technologies Intel Inc, Hillsboro Coordinating Transformations for High-Level Synthesis of High Performance Microprocessor Blocks

2 Classical High Level Synthesis From C to CDFG to Architecture  Classical HLS targets ASIC designs  Target of this work: Microprocessor Block Design  A new domain for the application of high-level synthesis  New synthesis methodology has been developed  Focus is on code transformations to improve QOR

3 Characteristics of ASIC Design  Large designs  Several ALUs, Multipliers  Controller (FSM)  Register File  Multi-cycle implementation  Intermediate results stored in latches or pipeline registers

4 HL-Synthesis of ASIC Designs  Large designs  Multi-cycle implementation Implications on High-Level Synthesis Methodology  Resource constrained  Extraction of parallelism constrained by area limitations  Speculation may lead to additional registers  More conservative with transformations such as loop unrolling

5 Microprocessor Architecture Register File Instruction Decode Deeply Pipelined Execution Unit Specialized Unit  Microprocessors:  Deeply pipelined  Complex blocks within pipeline stages  Previous work:  Pipeline scheduling  Mapping applications to a microprocessor architecture

6 Characteristics of Microprocessor Blocks  Small, Complex Units  Several small computation blocks  Intermix of control and data logic  Single or Dual cycle implementation  Inputs and outputs are stored in memory elements

7 HL-Synthesis of Microprocessor Blocks Small designs with high performance requirements Implications on High-Level Synthesis Methodology  Area constraints are lax  Extract maximal parallelism  All loops have to be unrolled  Pack all operations into a small number of cycles and in the shortest cycle time  Operations within behavior are chained together with no intermediate latching  Changes the stress on which transformations are “useful” and how they must be applied

8 Loop Unrolling  Loop unrolling is usually restricted for ASIC designs  Leads to code explosion  In terms of hardware, it means Large FSM controllers Complex interconnect logic  For Microprocessor Blocks  Loops represent a programming convenience  Whole loop is scheduled in one/two cycles All iterations have to execute within one/two cycles  In hardware, the loop will be unrolled anyway i = 0 i < N Loop Body LB(i) i = i +1 Pipeline Registers One Cycle Pipeline Registers

9 Fully Unroll Loops i = 0 i < N Loop Body LB(i) i = 0 1 st Iteration LB(0) 2 nd Iteration LB(1) N th Iteration LB(N-1) Unroll Loop i = i +1

10 Chaining Operations Across Conditional Boundaries

11 Inserting “Wire-Variables” to enable Chaining BB 1BB 2 BB 3 BB 0 TrueFalse X = a + b Z = X + d X= c Cond BB 1BB 2 BB 3 BB 0 TrueFalse Wv = a + b X = Wv Z = Wv + d Wv = c X = Wv Cond ALU ab Cond c d ZX Wv Wv is mapped to a wire; all other variables are mapped to registers

12 Supporting Transformations: Beyond Basic Block Code Motions If Node TF Conditional Speculation Reverse Speculation Speculation Across Hierarchical Blocks

13 A Case Study: Instruction Length Decoder  Validated this methodology using a design derived from the Instruction Length Decoder of the Intel Pentium® class of processors  Takes a stream of instructions from memory  Decodes the length of these instructions Has to look at up to 4 bytes at a time  Has to execute in one cycle  Implemented this methodology along with supporting transformations in the Spark high-level synthesis (HLS) framework  Takes a behavioral description in C as input and produces synthesizable VHDL  Has various supporting code optimizations Constant propagation, Dead code elimination

14 Basic Instruction Length Decoder: Initial Description Length Contribution 1 Need Byte 4 ? Need Byte 2 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 = Total Length Of Instruction Length Contribution 2Length Contribution 3Length Contribution 4  Single Cycle implementation  Natural behavioral description is sequential and slow  Must be parallelized and compacted into one cycle with low clock time

15 Instruction Length Decoder: Parallelized Description  Speculatively calculate the length contribution of all 4 bytes at a time  Determine actual total length of instruction based on this data Need Byte 4 ? Need Byte 2 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 = Total Length Of Instruction

16 Instruction Length Decoder: Parallelized Description Byte 1Byte 2Byte 3 Byte 4 Byte 1 Insn. Len Calc Byte 3 Insn. Len Calc Byte 5 Insn. Len Calc Byte 2 Insn. Len Calc Byte 4 Insn. Len Calc Byte 5  Speculatively calculate length of instructions assuming a new instruction starts at each byte  Do this calculation for all bytes in parallel  Traverse from 1 st byte to last  Determine length of instructions starting from the 1 st till the last  Discard unused calculations

17 Steps Involved in Synthesis of the ILD  Speculatively calculate all possible lengths of an instruction at byte “i”  Achieved by speculative code motions  Speculatively calculate length of instructions assuming an instruction starts at each byte  Achieved by loop unrolling, loop index variable elimination and speculative code motions  Pack all operations into one cycle  Achieved by chaining all operations across conditional boundaries Step-by-step code refinement is presented in the paper

18 Initial: Multi-Cycle Sequential Architecture Length Contribution 1 Need Byte 4 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 Length Contribution 2 Length Contribution 3 Length Contribution 4 Need Byte 2 ?

19 ILD Synthesis: Resultant Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Multi-cycle Sequential Architecture Single cycle Parallel Architecture Single cycle Parallel Architecture

20 Conclusions  Demonstrated a high-level synthesis methodology for a new domain: Microprocessor Block Design  Small number of Cycles  Short Cycle Times  Extract Maximal Parallelism  Aggressive Speculative Code Motions  Unrolling loops fully + other loop transformations  Pack all operations in behavior into a few cycles  Chaining operations across conditionals  Implemented in the Spark HL Synthesis Framework  Takes C input and produces synthesizable VHDL  Industrial case study: Instruction Length Decoder  Ongoing work => Broaden the application base of this methodology and develop more supporting transformations Very Low Latency

21 Thank You !

22 Additional Slides

23 Loop Index Variable Elimination i = 0 R1(i) = Op1(i) R1(i+1) = Op1(i+1) R1(i+N-1) = Op1(i+N-1) Propagate Constant i = 0 R1(0) = Op1(0) R1(1) = Op1(1) R1(N-1) = Op1(N-1) i = 0

24 Original Specification Speculatively Calculate all possible lengths at i Speculate Data Calculation Control Logic

25 After Speculative Calculation at each byte Unroll Loop Propagate Loop Index Var Speculative Calculation of All Instruction Lengths Assuming an Instruction Starts at each Byte

26 ILD: Final Architecture

27 ILD: Algorithmic Description Calculate LC1 if Calculate LC2 if Calculate LC3 Yes Length = LC1 Length = LC1+ LC2 Length = LC1+ LC2 + LC3 if Calculate LC3 Yes Length = LC1+ LC2 + LC3 + LC4 No Do in a loop Starting with 1 st byte till the N th Byte Need 2 nd Byte ? Need 3 rd Byte ? Need 4 th Byte ?