Presentation is loading. Please wait.

Presentation is loading. Please wait.

Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.

Similar presentations


Presentation on theme: "Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the."— Presentation transcript:

1 Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark Loop Shifting and Compaction for the High-Level Synthesis of Designs with Complex Control Flow Supported by Semiconductor Research Corporation & Intel Inc Sumit Gupta Nikil Dutt,, Rajesh Gupta, Alex Nicolau

2 Copyright Center for Embedded Computer Systems 2004 2 High Level Synthesis M e m o r y ALU Control Data path d = e - fg = h + i If Node TF c x = a + b c = a < b j = d x g l = e + x x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture Problem # 1 : Poor quality of HLS results beyond straight-line behavioral descriptions Poor/No controllability of the HLS results Problem # 2 :

3 Copyright Center for Embedded Computer Systems 2004 3 Related Work: Loop/Software Pipelining Loop pipelining explored extensively in compiler community Loop pipelining explored extensively in compiler community Early work in high-level synthesis (HLS) was on innermost loops of DSP applications Early work in high-level synthesis (HLS) was on innermost loops of DSP applications Loop winding, loop folding, percolation based synthesis, rotation scheduling Loop winding, loop folding, percolation based synthesis, rotation scheduling Loop pipelining during synthesis of designs with complex control flow has received less attention Loop pipelining during synthesis of designs with complex control flow has received less attention Control and Interconnect (multiplexer) costs are major concern Control and Interconnect (multiplexer) costs are major concern Most analysis limited to number of cycles in schedule Most analysis limited to number of cycles in schedule We found Circuit Area and Delay can be worse if pipelining is not applied judiciously We found Circuit Area and Delay can be worse if pipelining is not applied judiciously  Our aim: Exploit inter-loop iteration parallelism Look beyond scheduling results: in Circuit Area and Delay Look beyond scheduling results: in Circuit Area and Delay For large designs with complex control flow (nested conditionals/loops) For large designs with complex control flow (nested conditionals/loops)

4 Copyright Center for Embedded Computer Systems 2004 4 Loop/Software Pipelining: Basics d c e b a Iteration 4 a Iteration 5 d c e b a d c e b a d c e b a Iteration 1 Iteration 2 Iteration 3 d c e b a Find a repeating pattern across loop iterations Find a repeating pattern across loop iterations Re-structure loop with this pattern Re-structure loop with this pattern

5 Copyright Center for Embedded Computer Systems 2004 5 Loop/Software Pipelining: Basics d c e b a d c e b a d c e b a Iteration 1 Iteration 2 Iteration 3 d c e b aLoopEpilogue RepeatingPattern LoopPrologue

6 Copyright Center for Embedded Computer Systems 2004 6 Loop Pipelining: Benefits Powerful technique for exploiting parallelism across loop iterations Powerful technique for exploiting parallelism across loop iterations Very useful for optimizing DSP kernels Very useful for optimizing DSP kernels However, for high-level synthesis, there are hardware costs of loop pipelining However, for high-level synthesis, there are hardware costs of loop pipelining Multiplexing and control costs Multiplexing and control costs Direct effect on area and longest combinational paths (critical paths) Direct effect on area and longest combinational paths (critical paths)

7 Copyright Center for Embedded Computer Systems 2004 7 Loop Pipelining: Hardware Costs d c e b a d c e b a d c e b a Num of Operations in loop can increase substantially Num of Operations in loop can increase substantially Increased resource sharing Increased resource sharing More complex & larger multiplexers and control logic More complex & larger multiplexers and control logic ALU After Pipelining Initial Code

8 Loop Pipelining of Code with Conditionals Control flow within loops introduces even more complexity Control flow within loops introduces even more complexity Increase in operations and control flow due to loop pipelining increases control and multiplexer logic by manifold Increase in operations and control flow due to loop pipelining increases control and multiplexer logic by manifold  Traditional Loop Pipelining techniques do not work for HLS of designs with complex control flow If Node TF BB 2 BB 4 BB 3 BB 5 BB 6 + b _ d _ g f < BB 1 BB 0 BB 7 Loop Exit _ + a c

9 Copyright Center for Embedded Computer Systems 2004 9 Our Solution Loop Shifting: Incremental Loop Pipelining Technique Loop Shifting: Incremental Loop Pipelining Technique Exploits inter-iteration parallelism by incrementally moving operations from one iteration to the previous iteration Exploits inter-iteration parallelism by incrementally moving operations from one iteration to the previous iteration Does not lead to an explosion in the number of operations in loop Does not lead to an explosion in the number of operations in loop Useful for code with nested conditionals and loops Useful for code with nested conditionals and loops Can be directed based on resource allocation & utilization Can be directed based on resource allocation & utilization

10 Copyright Center for Embedded Computer Systems 2004 10 Loop Shifting: An Incremental Loop Pipelining Technique BB 0 b + _ d Loop Exit Loop Node BB 3 BB 2 BB 1 BB 4 BB 0 b + _ d Loop Exit Loop Node BB 3 BB 2 BB 1 BB 4 a + c _ a + c _ LoopShifting a + c _

11 Copyright Center for Embedded Computer Systems 2004 11 Loop Shifting: An Incremental Loop Pipelining Technique BB 0 a + b c _ + _ d Loop Exit Loop Node BB 3 BB 2 BB 1 BB 4 BB 0 b + _ d Loop Exit Loop Node BB 3 BB 2 BB 1 BB 4 a + c _ a + c _LoopShifting Compac -tion

12 Shifting Loops with Conditionals If Node TF BB 2 BB 4 BB 3 BB 5 BB 6 + b _ d _ g f < BB 1 BB 0 BB 7 Loop Exit _ + a c Shift a set of concurrent operations from longer conditional branch Shift a set of concurrent operations from longer conditional branch Leads to Speculation of operations Leads to Speculation of operations Out of Conditional Out of Conditional Across Loop Iteration Across Loop Iteration Followed by Loop Body Compaction Followed by Loop Body Compaction

13 Shifting Loops with Conditionals If Node TF BB 2 BB 4 BB 3 BB 5 BB 6 + b _ d _ g f < BB 1 BB 0 BB 7 Loop Exit _ + a c Loop Exit If Node TF BB 2 BB 4 BB 3 BB 5 BB 6 + b _ d _ g f < BB 1 BB 0 + a _ c + a c _ BB 7

14 Shifting Loops with Conditionals If Node TF BB 2 BB 4 BB 3 BB 5 BB 6 + b _ d _ g f < BB 1 BB 0 BB 7 Loop Exit If Node TF BB 2 BB 4 BB 3 BB 5 BB 6 + b _ d f < BB 1 BB 0 Loop Exit + a _ c + a c _ BB 7 _ + a c + a _ c g _ 1 1 2 2

15 Implementation and Results

16 SPARK High Level Synthesis Framework

17 Copyright Center for Embedded Computer Systems 2004 17 SPARK Parallelizing HLS Framework C input and Synthesizable RTL VHDL output C input and Synthesizable RTL VHDL output Tool-box of Transformations and Heuristics Tool-box of Transformations and Heuristics Each of these can be developed independently of the other Each of these can be developed independently of the other Script based control over transformations & heuristics Script based control over transformations & heuristics Hierarchical Intermediate Representation (HTGs) Hierarchical Intermediate Representation (HTGs) Retains structural information about design (conditional blocks, loops) Retains structural information about design (conditional blocks, loops) Enables efficient and structured application of transformations Enables efficient and structured application of transformations Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Interconnect Minimizing Resource Binding Interconnect Minimizing Resource Binding Enables Graphical Visualization of Design description and intermediate results Enables Graphical Visualization of Design description and intermediate results 100,000+ lines of C++ code 100,000+ lines of C++ code

18 Copyright Center for Embedded Computer Systems 2004 18 How We Apply Loop Shifting Schedule Entire Design Repeat for Num_Of_Loop_Shifts Begin Shift one set of concurrent ops in Loop Body Shift one set of concurrent ops in Loop Body ReSchedule Loop Body /* Compaction */ ReSchedule Loop Body /* Compaction */End  We do Loop Body Compaction after each Loop Shift

19 Copyright Center for Embedded Computer Systems 2004 19 Loop Unrolling is done up Front In contrast, Loop Unrolling is done before scheduling in our experiments In contrast, Loop Unrolling is done before scheduling in our experiments Repeat for Num_Of_Loop_Unrolls Begin Unroll Loop Body Unroll Loop BodyEnd Schedule Entire Design

20 Copyright Center for Embedded Computer Systems 2004 20 Experiments and Results Results presented here for Results presented here for Loop Unrolling (represents increasing operations in loops) Loop Unrolling (represents increasing operations in loops) Loop Shifting Loop Shifting We used SPARK to synthesize designs derived from multimedia and image processing applications We used SPARK to synthesize designs derived from multimedia and image processing applications MPEG-1, GIMP Image Processing software MPEG-1, GIMP Image Processing software Baseline for our experiments is a design optimized by a range of parallelizing transformations (speculative code motions and pre-synthesis dynamic transformations) Baseline for our experiments is a design optimized by a range of parallelizing transformations (speculative code motions and pre-synthesis dynamic transformations) Scheduling Results Scheduling Results Number of States in FSM Number of States in FSM Cycles on Longest Path through Design Cycles on Longest Path through Design VHDL: Logic Synthesis VHDL: Logic Synthesis Critical Path Length (ns) Critical Path Length (ns) Unit Area Unit Area

21 Copyright Center for Embedded Computer Systems 2004 21 Target Applications Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred1 4217123 MPEG-1 pred2 11645287 GIMPtiler11235150

22 Copyright Center for Embedded Computer Systems 2004 22 Loop Unrolling: Scheduling & Logic Synthesis Results MPEG Pred-1 Function MPEG Pred-2 Function + 1 Loop Unroll + 3 Loop Unrolls Baseline: Design optimized by Parallelizing Transformations No improvement in Delay, with 24-36 % Worse Area Improvements in Cycles do not translate to better Circuit Delay and Area +36% +24% -20% +13% -8% +24%

23 Copyright Center for Embedded Computer Systems 2004 23 Loop Unrolling: Scheduling & Logic Synthesis Results GIMP Tiler Function + 1 Loop Unroll + 3 Loop Unrolls Baseline: Design optimized by Parallelizing Transformations 7 % improvement in Delay With 146 % Worse Area -7% -17% +146%

24 Copyright Center for Embedded Computer Systems 2004 24 Loop Shifting: Scheduling & Logic Synthesis Results MPEG Pred-1 Function MPEG Pred-2 Function + 1 Loop Shift + 2 Loop Shifts + 3 Loop Shifts Baseline: Design optimized by Parallelizing Transformations Overall: 11-17 % improvement in Delay Up to 16 % Worse Area +16% -17%-20%-11% -9% -4%

25 Copyright Center for Embedded Computer Systems 2004 25 Loop Shifting: Scheduling & Logic Synthesis Results GIMP Tiler Function + 1 Loop Shift + 2 Loop Shifts + 3 Loop Shifts Baseline: Design optimized by Parallelizing Transformations  13 % improvement in Delay, 34 % Worse Area  It is possible to achieve similar delay improvements with smaller area increase:  Need to carefully guide Loop Shifting/Pipelining -13% -19% +34% Same Delay Same Delay Less Area Less Area

26 Copyright Center for Embedded Computer Systems 2004 26 Conclusions Traditional/Aggressive Loop Pipelining techniques do not work for designs with complex control flow Traditional/Aggressive Loop Pipelining techniques do not work for designs with complex control flow Multiplexer and Control logic costs are too high Multiplexer and Control logic costs are too high Critical path length, clock period, and area increase Critical path length, clock period, and area increase Instead, it is possible to incrementally exploit inter- loop iteration parallelism by Loop Shifting Instead, it is possible to incrementally exploit inter- loop iteration parallelism by Loop Shifting Must be guided carefully to find best circuit delay & area Must be guided carefully to find best circuit delay & area Results on large multimedia and image processing designs illustrate the utility of loop shifting Results on large multimedia and image processing designs illustrate the utility of loop shifting We achieve up to 11-17 % improvement in circuit delay with little increase in area We achieve up to 11-17 % improvement in circuit delay with little increase in area

27 Copyright Center for Embedded Computer Systems 2004 27 Software Release of SPARK Download from Download fromhttp://www.cecs.uci.edu/~spark Includes User Manual Includes User Manual Tutorial: Tutorial: Synthesis of motion compensation from MPEG-1 player Synthesis of motion compensation from MPEG-1 player Available for Linux, Solaris and Windows Available for Linux, Solaris and Windows

28 Thank You


Download ppt "Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the."

Similar presentations


Ads by Google