Presentation is loading. Please wait.

Presentation is loading. Please wait.

08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.

Similar presentations


Presentation on theme: "08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh."— Presentation transcript:

1 08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh GuptaAlex Nicolau Supported by Semiconductor Research Corporation Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark Strategic CAD Labs Design Technologies Intel Inc, Hillsboro http://www.intel.com/research/scl Coordinating Transformations for High-Level Synthesis of High Performance Microprocessor Blocks

2 2 Classical High Level Synthesis From C to CDFG to Architecture  Classical HLS targets ASIC designs  Target of this work: Microprocessor Block Design  A new domain for the application of high-level synthesis  New synthesis methodology has been developed  Focus is on code transformations to improve QOR

3 3 Characteristics of ASIC Design  Large designs  Several ALUs, Multipliers  Controller (FSM)  Register File  Multi-cycle implementation  Intermediate results stored in latches or pipeline registers

4 4 HL-Synthesis of ASIC Designs  Large designs  Multi-cycle implementation Implications on High-Level Synthesis Methodology  Resource constrained  Extraction of parallelism constrained by area limitations  Speculation may lead to additional registers  More conservative with transformations such as loop unrolling

5 5 Microprocessor Architecture Register File Instruction Decode Deeply Pipelined Execution Unit Specialized Unit  Microprocessors:  Deeply pipelined  Complex blocks within pipeline stages  Previous work:  Pipeline scheduling  Mapping applications to a microprocessor architecture

6 6 Characteristics of Microprocessor Blocks  Small, Complex Units  Several small computation blocks  Intermix of control and data logic  Single or Dual cycle implementation  Inputs and outputs are stored in memory elements

7 7 HL-Synthesis of Microprocessor Blocks Small designs with high performance requirements Implications on High-Level Synthesis Methodology  Area constraints are lax  Extract maximal parallelism  All loops have to be unrolled  Pack all operations into a small number of cycles and in the shortest cycle time  Operations within behavior are chained together with no intermediate latching  Changes the stress on which transformations are “useful” and how they must be applied

8 8 Loop Unrolling  Loop unrolling is usually restricted for ASIC designs  Leads to code explosion  In terms of hardware, it means Large FSM controllers Complex interconnect logic  For Microprocessor Blocks  Loops represent a programming convenience  Whole loop is scheduled in one/two cycles All iterations have to execute within one/two cycles  In hardware, the loop will be unrolled anyway i = 0 i < N Loop Body LB(i) i = i +1 Pipeline Registers One Cycle Pipeline Registers

9 9 Fully Unroll Loops i = 0 i < N Loop Body LB(i) i = 0 1 st Iteration LB(0) 2 nd Iteration LB(1) N th Iteration LB(N-1) Unroll Loop i = i +1

10 10 Chaining Operations Across Conditional Boundaries

11 11 Inserting “Wire-Variables” to enable Chaining BB 1BB 2 BB 3 BB 0 TrueFalse X = a + b Z = X + d X= c Cond BB 1BB 2 BB 3 BB 0 TrueFalse Wv = a + b X = Wv Z = Wv + d Wv = c X = Wv Cond ALU ab Cond c d ZX Wv Wv is mapped to a wire; all other variables are mapped to registers

12 12 Supporting Transformations: Beyond Basic Block Code Motions + + + If Node TF Conditional Speculation Reverse Speculation Speculation Across Hierarchical Blocks

13 13 A Case Study: Instruction Length Decoder  Validated this methodology using a design derived from the Instruction Length Decoder of the Intel Pentium® class of processors  Takes a stream of instructions from memory  Decodes the length of these instructions Has to look at up to 4 bytes at a time  Has to execute in one cycle  Implemented this methodology along with supporting transformations in the Spark high-level synthesis (HLS) framework  Takes a behavioral description in C as input and produces synthesizable VHDL  Has various supporting code optimizations Constant propagation, Dead code elimination

14 14 Basic Instruction Length Decoder: Initial Description Length Contribution 1 Need Byte 4 ? Need Byte 2 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 = + + + Total Length Of Instruction Length Contribution 2Length Contribution 3Length Contribution 4  Single Cycle implementation  Natural behavioral description is sequential and slow  Must be parallelized and compacted into one cycle with low clock time

15 15 Instruction Length Decoder: Parallelized Description  Speculatively calculate the length contribution of all 4 bytes at a time  Determine actual total length of instruction based on this data Need Byte 4 ? Need Byte 2 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 = + + + Total Length Of Instruction

16 16 Instruction Length Decoder: Parallelized Description Byte 1Byte 2Byte 3 Byte 4 Byte 1 Insn. Len Calc Byte 3 Insn. Len Calc Byte 5 Insn. Len Calc Byte 2 Insn. Len Calc Byte 4 Insn. Len Calc Byte 5  Speculatively calculate length of instructions assuming a new instruction starts at each byte  Do this calculation for all bytes in parallel  Traverse from 1 st byte to last  Determine length of instructions starting from the 1 st till the last  Discard unused calculations

17 17 Steps Involved in Synthesis of the ILD  Speculatively calculate all possible lengths of an instruction at byte “i”  Achieved by speculative code motions  Speculatively calculate length of instructions assuming an instruction starts at each byte  Achieved by loop unrolling, loop index variable elimination and speculative code motions  Pack all operations into one cycle  Achieved by chaining all operations across conditional boundaries Step-by-step code refinement is presented in the paper

18 18 Initial: Multi-Cycle Sequential Architecture Length Contribution 1 Need Byte 4 ? Need Byte 3 ? Byte 1Byte 2Byte 3 Byte 4 Length Contribution 2 Length Contribution 3 Length Contribution 4 Need Byte 2 ?

19 19 ILD Synthesis: Resultant Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Multi-cycle Sequential Architecture Single cycle Parallel Architecture Single cycle Parallel Architecture

20 20 Conclusions  Demonstrated a high-level synthesis methodology for a new domain: Microprocessor Block Design  Small number of Cycles  Short Cycle Times  Extract Maximal Parallelism  Aggressive Speculative Code Motions  Unrolling loops fully + other loop transformations  Pack all operations in behavior into a few cycles  Chaining operations across conditionals  Implemented in the Spark HL Synthesis Framework  Takes C input and produces synthesizable VHDL  Industrial case study: Instruction Length Decoder  Ongoing work => Broaden the application base of this methodology and develop more supporting transformations Very Low Latency

21 21 Thank You !

22 22 Additional Slides

23 23 Loop Index Variable Elimination i = 0 R1(i) = Op1(i) R1(i+1) = Op1(i+1) R1(i+N-1) = Op1(i+N-1) Propagate Constant i = 0 R1(0) = Op1(0) R1(1) = Op1(1) R1(N-1) = Op1(N-1) i = 0

24 24 Original Specification Speculatively Calculate all possible lengths at i Speculate Data Calculation Control Logic

25 25 After Speculative Calculation at each byte Unroll Loop Propagate Loop Index Var Speculative Calculation of All Instruction Lengths Assuming an Instruction Starts at each Byte

26 26 ILD: Final Architecture

27 27 ILD: Algorithmic Description Calculate LC1 if Calculate LC2 if Calculate LC3 Yes Length = LC1 Length = LC1+ LC2 Length = LC1+ LC2 + LC3 if Calculate LC3 Yes Length = LC1+ LC2 + LC3 + LC4 No Do in a loop Starting with 1 st byte till the N th Byte Need 2 nd Byte ? Need 3 rd Byte ? Need 4 th Byte ?


Download ppt "08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh."

Similar presentations


Ads by Google