Presentation is loading. Please wait.

Presentation is loading. Please wait.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

Similar presentations


Presentation on theme: "Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations."— Presentation transcript:

1 Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark Coordinated Coarse-Grain and Fine-Grain Optimizations for High-Level Synthesis Topic Defense Sumit Gupta

2 High Level Synthesis M e m o r y ALU Control Data path d = e - fg = h + i If Node TF c x = a + b c = a < b j = d x g l = e + x x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture

3 High-level Synthesis Well-researched area: from early 1980’s – so what’s new ? Well-researched area: from early 1980’s – so what’s new ? Level of design entry has moved up from schematic entry to coding in hardware description languages (VHDL, Verliog, C) Level of design entry has moved up from schematic entry to coding in hardware description languages (VHDL, Verliog, C) No comprehensive synthesis framework No comprehensive synthesis framework Few and scattered optimizations: mostly algebraic and at operation level of granularity Few and scattered optimizations: mostly algebraic and at operation level of granularity Results presented for scheduling Results presented for scheduling Effects on logic synthesis not understood Effects on logic synthesis not understood Small, synthetic benchmarks: primarily data-intensive DSP algorithms Small, synthetic benchmarks: primarily data-intensive DSP algorithms Quality of synthesis results severely effected by complex control flow Quality of synthesis results severely effected by complex control flow Nested ifs and loops not handled or handled poorly Nested ifs and loops not handled or handled poorly Poor understanding of the interaction between source- level and fine grain “compiler” transformations Poor understanding of the interaction between source- level and fine grain “compiler” transformations

4 Focus of this Work Target Applications: Target Applications: Behavioral descriptions with complex and nested conditionals and loops; for example: Behavioral descriptions with complex and nested conditionals and loops; for example: mixed data and control-intensive multimedia and image processing applications mixed data and control-intensive multimedia and image processing applications control-intensive microprocessor blocks: resource rich, few highly packed cycles. control-intensive microprocessor blocks: resource rich, few highly packed cycles. Objectives: Objectives: Improve quality of HLS results by concurrency enhancement Improve quality of HLS results by concurrency enhancement Improve controllability of the HLS solutions Improve controllability of the HLS solutions

5 Characteristics of Target Applications Moderately Control-intensive behaviors Moderately Control-intensive behaviors Operations that execute under conditions Operations that execute under conditions Entire behaviors within nested loops Entire behaviors within nested loops Programming styles significantly effect quality of results: Programming styles significantly effect quality of results: Placement of operations and control-flow Placement of operations and control-flow Choice of control flow: Nesting of ifs and loops Choice of control flow: Nesting of ifs and loops A need for high-level and compiler transformations A need for high-level and compiler transformations To overcome the variance due to programming style To overcome the variance due to programming style Increase resource utilization in the presence of conditionals Increase resource utilization in the presence of conditionals Exploit mutual exclusivity of operations to enhance resource sharing Exploit mutual exclusivity of operations to enhance resource sharing Maximally Parallelize Operations under given Resource Constraints Maximally Parallelize Operations under given Resource Constraints

6 Recent Related Work Code motions in the presence of conditionals Code motions in the presence of conditionals Condition Vector List Scheduling [Wakabayashi 89] Condition Vector List Scheduling [Wakabayashi 89] Symbolic Scheduling [Radivojevic 96] Symbolic Scheduling [Radivojevic 96] WaveSched Scheduler [Lakshminarayana 98] WaveSched Scheduler [Lakshminarayana 98] Basic Block Control Graph Scheduling [Santos 99] Basic Block Control Graph Scheduling [Santos 99] Limitations Limitations Arbitrary nesting of conditionals and loops not handled or handled poorly Arbitrary nesting of conditionals and loops not handled or handled poorly Ad hoc optimizations Ad hoc optimizations Not part of a complete synthesis system Not part of a complete synthesis system Limited analysis of logic and control costs Limited analysis of logic and control costs

7 Parallelizing Compiler Background Scheduling for increasing instruction-level parallelism Scheduling for increasing instruction-level parallelism Percolation Scheduling Percolation Scheduling Can produce optimal schedule given enough resources Can produce optimal schedule given enough resources Trailblazing Trailblazing Hierarchical Code Motion Technique Hierarchical Code Motion Technique Trace Scheduling, Superblock and Hyperblock Scheduling Trace Scheduling, Superblock and Hyperblock Scheduling Loop Transformations Loop Transformations Loop Invariant Code Motion Loop Invariant Code Motion Loop Pipelining Loop Pipelining Induction Variable Analysis Induction Variable Analysis Loop fusion, interchange, distribution Loop fusion, interchange, distribution Partial evaluation Partial evaluation CSE, Copy Propagation, Constant Folding CSE, Copy Propagation, Constant Folding

8 In the Context of High-Level Synthesis Cost Models are different Cost Models are different Operation and Resource Models Operation and Resource Models Non-sequential designs Non-sequential designs Transformations have implications on hardware Transformations have implications on hardware Non-trivial control costs Non-trivial control costs Operation duplication leads to flexible scheduling ; however, can lead to higher control costs Operation duplication leads to flexible scheduling ; however, can lead to higher control costs Mutual exclusivity of operations Mutual exclusivity of operations Resource Sharing Resource Sharing

9 Coarse and Fine-Grain Code Optimizations Beyond Basic Block Code Motions Beyond Basic Block Code Motions Speculation Speculation Reverse Speculation Reverse Speculation Early Condition Execution Early Condition Execution Conditional Speculation Conditional Speculation Dynamic Common Sub-expression Elimination Dynamic Common Sub-expression Elimination Loop Unrolling Loop Unrolling Loop Index Variable Elimination Loop Index Variable Elimination Chaining Operations across Conditionals Chaining Operations across Conditionals

10 Concurrency Enhancement by Code Motions + + If Node TF TF ++ Reverse Speculation Conditional Speculation __ + Across Hierarchical Blocks _ _ a b c Hierarchical Task Graph Representation of Control-Data Flow Graph Resource Utilization

11 Concurrency Enhancement by Code Motions + + If Node TF TF ++ Reverse Speculation Conditional Speculation __ + Across Hierarchical Blocks _ _ a b c Hierarchical Task Graph Representation of Control-Data Flow Graph Resource Utilization   Leads to Higher Resource Utilization   Shorter Schedule Lengths   Leads to Higher Resource Utilization   Shorter Schedule Lengths

12 Scheduling Heuristic BB 1BB 2 BB 0 BB 5BB 6 BB 4 BB 3 BB 7 + + + Speculate c b d + Across HTG Across HTG Speculate Across HTG + a Get Available Ops Get Available Ops a, b, c, d a, b, c, d Determine Code Motions Required Determine Code Motions Required Assign Cost to each Operation Assign Cost to each Operation Schedule Op with lowest Cost Schedule Op with lowest Cost

13 BB 1BB 2 BB 0 BB 5BB 6 BB 4 BB 3 BB 7 + + c b + a + d Scheduling Heuristic BB 1BB 2 BB 0 BB 5BB 6 BB 4 BB 3 BB 7 + + + c b d + + Across HTG Conditional Speculation + a + d

14 Dynamic Common Sub-expression Elimination BB 1BB 2 BB 0 a = b + c BB 5BB 6 BB 4 d = b + c BB 3 BB 7 Speculate BB 1BB 2 BB 0 a = dcse BB 5BB 6 BB 4 d = dcse BB 3 BB 7 dcse = b + c

15 Interconnect minimization by resource binding Minimize the complexity of steering logic Minimize the complexity of steering logic Multiplexors and demultiplexors Multiplexors and demultiplexors Introduce additional interconnect constraints/costs during resource binding Introduce additional interconnect constraints/costs during resource binding Operation and Variable binding have been formulated as network flow problems Operation and Variable binding have been formulated as network flow problems

16 Operation Binding + a b c + e b f ALU ea cf b Bind Operations with the same inputs or outputs to the same functional unit

17 Variable Binding ALU ea cf b Bind Variables that are inputs or outputs to same functional unit to the same registers

18 Variable Binding ALU ea cf b Bind Variables that are inputs or outputs to same functional unit to the same registers

19 Implementation SPARK High Level Synthesis Framework

20 Experimental Setup Benchmarks derived from several industrial designs Benchmarks derived from several industrial designs MPEG-1 Prediction Block MPEG-1 Prediction Block ADPCM Encoder ADPCM Encoder Several image processing passes from GIMP software Several image processing passes from GIMP software Synthesized using Spark Synthesized using Spark Number of States in FSM Number of States in FSM Cycles on Longest Path in Design Cycles on Longest Path in Design RTL VHDL from Spark synthesized using Synopsys RTL VHDL from Spark synthesized using Synopsys Critical Path Length (ns) => dictates Clock Period Critical Path Length (ns) => dictates Clock Period Unit Area (in terms of synthesis library used) Unit Area (in terms of synthesis library used)

21 HLS Results for Code Motions Within Basic Blocks Within BBs, Across Hierarchical Blocks Within BBs, Across Hier Blocks, Speculation Within BBs, Across Hier Blocks, Speculation, Early Condition Execution Within BBs, Across Hier Blocks, Speculation, Early Cond Exec, Conditional Speculation Allowed Code Motions Overall Performance gains of up to 50 % in controller size and longest path cycles Number of States In FSM Controller Cycles on Longest Path through Design

22 Logic Synthesis Results for Code Motions Within Basic Blocks Within BBs, Across Hierarchical Blocks, Speculation Within BBs, Across Hier Blocks, Speculation, Early Condition Execution Within BBs, Across Hier Blocks, Speculation, Early Cond Exec, Conditional Speculation Allowed Code Motions Enabling all code motions leads to Enabling all code motions leads to Reduced Circuit Delays: upto 50 % Reduced Circuit Delays: upto 50 % Increased Area/interconnect costs: Increased Area/interconnect costs: Reduced by interconnect aware resource binding Reduced by interconnect aware resource binding Enabling all code motions leads to Enabling all code motions leads to Reduced Circuit Delays: upto 50 % Reduced Circuit Delays: upto 50 % Increased Area/interconnect costs: Increased Area/interconnect costs: Reduced by interconnect aware resource binding Reduced by interconnect aware resource binding

23 Critical Path Total Delay Unit Area Critical Path Total Delay Unit Area Naïve Resource Binding Interconnect Minimizing Resource Binding Reductions in area of between 15-32 % Fairly constant critical path lengths and circuit delay Reductions in area of between 15-32 % Fairly constant critical path lengths and circuit delay Results after Interconnect Minimization

24 Synthesis Results with Dynamic CSE No CSE With CSE With Dynamic CSE With CSE & Dynamic CSE

25 DCSE Synthesis Results: Pred0 No CSE With CSE With Dynamic CSE With CSE & Dynamic CSE Delays reduce by up to 40 % Area reduces by up to 35 % Register Usage Reduces ! Delays reduce by up to 40 % Area reduces by up to 35 % Register Usage Reduces !

26 Priority-based List Scheduling Heuristic Priority-based List Scheduling Heuristic Allows control of Code Motions employed Allows control of Code Motions employed Dynamic application of CSE and Copy Propagation Dynamic application of CSE and Copy Propagation Summary of Work Done Speculative Code Motions Speculative Code Motions Code Motion Techniques Code Motion Techniques Trailblazing Trailblazing Compiler Passes Compiler Passes Copy & Constant Propagation Copy & Constant Propagation Dead Code Elimination Dead Code Elimination Common SubExpression Elimination Common SubExpression Elimination Dynamic Renaming Dynamic Renaming Loop Unrolling Loop Unrolling Loop Index Variable Elimination Loop Index Variable Elimination Chaining across Conditional blocks Interconnect Minimizing Resource Binding Interconnect Minimizing Resource Binding FSM Generation FSM Generation Non-trivial in the presence of chaining across conditionals and multi-cycle operations Non-trivial in the presence of chaining across conditionals and multi-cycle operations VHDL Generation VHDL Generation

27 Future Directions Interactive GUI: ability to Interactive GUI: ability to Specify scheduling decisions Specify scheduling decisions Timing Constraints Timing Constraints Loop Pipelining Heurisitic Loop Pipelining Heurisitic Loop Transformations Loop Transformations Loop Fusion Effects of Code Motions on Power Effects of Code Motions on Power Ability to model Complex Resources Ability to model Complex Resources Pipelined Resources Pipelined Resources Loop Pipelining Heurisitic Loop Pipelining Heurisitic Loop Transformations Loop Transformations Loop Fusion Loop Fusion Analysis of Effects of Code Motions on Power Analysis of Effects of Code Motions on Power More Transformations targeting Microprocessor Functional Blocks More Transformations targeting Microprocessor Functional Blocks Loop Invariant Code Motion Loop Invariant Code Motion

28 Thank You

29 Publications Dynamic Common Sub-Expression Elimination during Scheduling in High-Level Synthesis S. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, To appear in the International Symposium on System Synthesis, October 2002 Dynamic Common Sub-Expression Elimination during Scheduling in High-Level Synthesis S. Gupta, M. Reshadi, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, To appear in the International Symposium on System Synthesis, October 2002 Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor Blocks S. Gupta, T. Kam, M. Kishinevsky, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, Design Automation Conference, June 2002 Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor Blocks S. Gupta, T. Kam, M. Kishinevsky, S. Rotem, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, Design Automation Conference, June 2002 Conditional Speculation and its Effects on Performance and Area for High-Level Synthesis S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2001 Conditional Speculation and its Effects on Performance and Area for High-Level Synthesis S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, ISSS 2001 Speculation Techniques for High Level synthesis of Control Intensive Designs Speculation Techniques for High Level synthesis of Control Intensive Designs S. Gupta, N. Savoiu, S. Kim, N.D. Dutt, R.K. Gupta, A. Nicolau, DAC 2001 Analysis of High-level Address Code Transformations for Programmable Processors Analysis of High-level Address Code Transformations for Programmable Processors S. Gupta, M. Miranda, F. Catthoor, R. K. Gupta, DATE 2000 Book Chapter: ASIC Design, S. Gupta, R. K. Gupta, ASIC Design, S. Gupta, R. K. Gupta, Chapter 64, The VLSI Handbook, Edited by Wai-Kai Chen, Under Submission to Journal: Using Global Code Motions to Improve the Quality of Results for High-Level Synthesis, Using Global Code Motions to Improve the Quality of Results for High-Level Synthesis, S. Gupta, N. Savoiu, N.D. Dutt, R.K. Gupta, A. Nicolau, submitted to TCAD

30 Additional Slides

31 SPARK Core Strengths Focus on Focus on Transformations that increase amount of parallelism available in the source description Transformations that increase amount of parallelism available in the source description Tightly integrate with parallelizing compiler transformations Tightly integrate with parallelizing compiler transformations Provide a HLS “toolbox” for the micro-architect Provide a HLS “toolbox” for the micro-architect Develop transformations that Develop transformations that Limit effects of control-flow Limit effects of control-flow Generalized code motions Generalized code motions Reduce data dependencies Reduce data dependencies Renaming, loop unrolling, loop index variable elimination Renaming, loop unrolling, loop index variable elimination

32 SPARK Framework Customizable extensible scheduler Customizable extensible scheduler Range of transformations in modular toolbox Range of transformations in modular toolbox Percolation, trailblazing, loop pipelining (RDLP) Percolation, trailblazing, loop pipelining (RDLP) Selected under heuristics and/or user control Selected under heuristics and/or user control Code motion, loop transformations Code motion, loop transformations Input in C and output to synthesizable RTL VHDL Input in C and output to synthesizable RTL VHDL Flow from architecture design to synthesis Flow from architecture design to synthesis Quality of results measured in terms of Quality of results measured in terms of Scheduling results: cycles in longest path Scheduling results: cycles in longest path Controller size: number of states in FSM Controller size: number of states in FSM Logic synthesis results: critical path length,unit area Logic synthesis results: critical path length,unit area

33 Summary of Work Done Developed a set of code transformations targeted towards HLS Developed a set of code transformations targeted towards HLS Implemented in a complete high-level synthesis framework Implemented in a complete high-level synthesis framework Implemented supporting compiler passes Implemented supporting compiler passes Produce synthesizable VHDL output from input C Produce synthesizable VHDL output from input C Analyzed effects of transformations on final logic synthesis results Analyzed effects of transformations on final logic synthesis results Applied to moderately complex industrial benchmarks Applied to moderately complex industrial benchmarks

34 Ongoing Work Loop Transformations Loop Transformations Loop Invariant Code Motion Loop Invariant Code Motion Loop Pipelining Heuristics Loop Pipelining Heuristics Loop Fusion Loop Fusion High-level Power analysis of transformations High-level Power analysis of transformations Can Power consumption be reduced despite increased resource utilization Can Power consumption be reduced despite increased resource utilization

35 BB 1BB 2 BB 0 BB 5BB 6 BB 4 BB 3 BB 7 + + + c b d + a Scheduler Heuristic BB 1BB 2 BB 0 BB 5BB 6 BB 4 BB 3 BB 7 + + + + Speculate c a b d + Across HTG Across HTG Speculate Across HTG + Across HTG Conditional Speculation + a1a1 + a2a2 Reverse Speculate


Download ppt "Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations."

Similar presentations


Ads by Google