Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng +, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu +, Z. Zhang Computer Science Department,

Similar presentations


Presentation on theme: "Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng +, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu +, Z. Zhang Computer Science Department,"— Presentation transcript:

1 Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng +, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu +, Z. Zhang Computer Science Department, UCLA + Microprocessor Development and Research Center, PKU

2 Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

3 Motivation High-level languages  Big gap between design productivity and complexity  Alleviate the design complexity  Need to produce high-quality products Need to consider multi-bitwidth  Recent research shows there are 40% redundant bits in programs of high-level languages [Stephenson et al, SIGPLAN’00]  Hardware resource cost will be reduced with consideration of multi- bitwidth Area is proportional to input bitwidth for adders and registers, and is proportional to the square of input bitwidth for multipliers Wire-length is reduced accordingly  Conventional high-level synthesis only focuses on resources with uniform bitwidth

4 Motivational Example - Impact of Bitwidth Adders * (3 clock cycles) + (1 clock cycle) Execution time: 8 clock cycles + + + + ** ** 18 5 26 18*6 16*4 24*1632*16 2618 Multipliers 32x1624x16 + + + 16 + * * * * Adders + + + + * * * * 18 5 26 18*6 16*4 24*16 32*16 265 Multipliers 32x1618x6 ++ + 16 + * * * * 30% saving 31% saving

5 Related Works High-level synthesis with consideration of bitwidth  ILP formulation [Constantinides et al, IEEE Electronics Letters’00]  Heuristic solution [Kum et al ’01] [Constantinides et al, DATE’01]  Split adders into 1-bit [Molina et al DAC’02]  Partially guarded computation [Choi et al, ISLPED’00] Limitation  No consideration of interconnect delay in scheduling and binding Interconnect delays dominate the timing in DSM tech Interconnect delays dominate the timing in DSM tech  No optimality evaluation of proposed solutions for register allocation and binding

6 Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

7 Bitwidth-Aware Synthesis Flow Multiple bitwidth scheduling and binding problem  Given: (1) A DFG annotated with bitwidths, (2) a time constraint, (3) placement information of functional units, and (4) a resource IP library, where each resource type has arbitrary bitwidth configurations, each of which is associated with an area cost.  Objective: Schedule and bind the DFG into the library with consideration of interconnect delay from placement and without violating the time constraint, such that the final area of the required resources is minimized.

8 RDR+MCAS Global Interconnect … LCC … … … … … FSM K cycles 1 cycle 2 cycles Register file IslandIsland Local Computational Cluster (LCC) Local Computational Cluster (LCC) …. Register File WiWi HiHi FSM ALU MUL Cluster with area constraint 1 cycle2 cycle K cycle MUX One solution for multi-cycle on-chip communication  Regular Distributed Register (RDR) micro-architecture [Cong et al, ISPD ’ 03] [Cong et al, ICCAD ’ 03] The whole chip is divided into an array of islands Chose the island size such that local computation and communication in each island can be done in a single cycle  MCAS: Architectural Synthesis for Multi-cycle Communication Efficiently maps the behavioral descriptions to RDR uArch Integrates architectural synthesis with physical planning Placement information of functional units

9 Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

10 Scheduling and Binding Lower bound estimation of FU bitwidth for a DFG  Prior works focus on the number of FUs Lower-bound-based simultaneous scheduling and binding  Time constrained  Consider the interconnect delay obtained from placement information given by MCAS

11 Lower Bound Estimation Extend the interval-based technique of [Sharma et al, 93] to support multi-bitwidth FUs Main idea  Compute the minimum resource requirement R(p, q) for each time interval [p,q]  [1,T]  The maximum of R(p, q) over all intervals is the final bitwidth lower bound

12 Example of Lower-bound Estimation The minimum bitwidth requirement for multipliers in interval [4, 7] Theorem: For any feasible scheduling, the minimum overlap between operation o and interval [p,q] is: O(o, p, q) = min{ | Lifetime_ASAP  [p, q] |, | Lifetime_ALAP  [p, q] | } The operation bitwidths that must be executed in [4,7] is {18, 24, 24, 32, 16} The minimum bitwidth requirement for multipliers in [4,7] will be R (4, 7)={32, 16} The minimum overlap between the multiplications, a, b, c and d, and interval [4,7]  O(a 18*6, 4, 7) = 1  O(b 24*16, 4, 7) = 2 + + + + a*a* d*d* b*b* c*c* 18 5 26 18*6 16*4 24*16 32*16 16 step1 step2 step3 step4 step5 step6 step7 step8 + + + + a*a* d*d* b*b* c*c* 185 26 18*6 16*4 24*16 32*16 16 ASAP ALAP  O(c 32*16, 4, 7) = 1  O(d 16*4, 4, 7) = 1 Sorted: {32, 24, 24, 18, 16} a*a* a*a* c*c* c*c* d*d* d*d* b*b* b*b*

13 Area Cost Weighted-area lower bound of an unscheduled DFG is defined as area for adders area for multipliers a ratio weight of multiplier area over adder area For a partially scheduled DFG, scheduling status S records the control steps for scheduled operations and feasible control steps for un-scheduled operations A is calculated the same way, denoted as A(S)

14 Scheduling and Binding Algorithm-1 Goal: Minimize the area cost of required FUs  Consider interconnect delay Basic idea  In each step, schedule an operation at a control step such that the resulted weighted-area lower bound A(S) is kept as small as possible 16 32 16 32 A(16,1) = 48 add-32: feasible control step [2,3] A(16,2) = 48 A(32,2) = 64 A(32,3) = 48 step1 step2 step3 add-16: feasible control step [1,2] 16 add-32: feasible control step [2,3] A(32,2) = 64 A(32,3) = 48 32 How to choose an operation and one of its feasible control step

15 Scheduling and Binding Algorithm-2 Simultaneous scheduling and binding with consideration of interconnect delay After operation o and c is chosen, FU binding is performed to decide whether o can be scheduled at step c finally  There is an available FU usable by o at step c  Data dependence between o and its scheduled and bound predecessors and successors is maintained 16 step1 step2 step3 * + MUL ADD 1 clock cycle island +

16 Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

17 Register Allocation and Binding Problem formulation  Given: A scheduled DFG annotated with bitwidth  Objective: Perform register allocation and binding to minimize the total bitwidth of registers Register allocation  Decide the minimum required registers Register binding  Explicitly map variables to register instances

18 Preliminaries Scheduled DFG Life times of variables 24 18 16 5 Lifetime of a variable  s(o): the control step where variable o is produced  e(o): the last control step where variable o is consumed Weighted interval graph 5 2416 18 A proper coloring of G corresponds to a register allocation and binding scheme Weight of a coloring scheme  The weight of color c W(c) = max{w(v) | v is colored with c }  The weight of the coloring scheme P is defined as W(G, P) =  W(c). 24+16+18 = 58

19 Coloring Problem Weighted-interval-graph coloring problem  Given: A weighted interval graph G(V, E)  Objective: Find a coloring scheme P of G, such that the weight of the coloring scheme P, W(G, P), is minimized Uniform weights  Be solved in polynomial time (Left-edge) Various weights  The complexity remains unknown  We propose a lower-bound estimation and an efficient algorithm

20 Lower-Bound Estimation |C  24 |  1 24 18 |C  18 |  1 16 |C  16 |  2 5 |C  5 |  3 Bitwidth lower bound 24*1+16*1+5*1=45 Scheduled DFG Life times of variables 24 18 16 5

21 Coloring Algorithm 24 18 16 5 Weight of coloring 24*1+16*1+5*1=45 16 5 Scheduled and bound DFGLife times of variables 24 18 16 5 24 18 5

22 Outline Motivation Bitwidth-aware synthesis flow  Scheduling and binding to minimize total bits of functional units (FU)  Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

23 Experimental Results -Weighted Interval-Graph Coloring DesignsLower BoundLeft-Edge+PostProcess [Kum et al ’ 01] Weighted IGC aircraft1270140213351270 chem896962929897 dir474487505474 honda312328368313 lee216 232216 mcm689721691689 pr270297298270 u5ml1717189217781717 wang269293302269 Ave gap-+6.6%+7.5%+0.05%

24 Experimental Results -Three Synthesis Flows Flow1 (MCAS)  MCAS generates the scheduling and binding results and placement information. All operations and variables have uniform bitwidth (32-bits). Flow2 (MCAS+MB-PP)  Perform a bitwidth post-processing after Flow1 is done, which is to set the bitwidth of a FU as the maximum bitwidth of all operations executed on it, and set the bitwidth of a register as the maximum bitwidth of all variables stored in it. Flow3 (MCAS-MB)  After MCAS generates the scheduling and binding results and placement, the lower-bound-based scheduling & binding and the bitwidth-aware register allocation and binding are performed. Share the same backend to generate datapath and controllers Altera’s Quartus II version 2.2 0 is used to synthesize the resulting RTL VHDL onto the FPGA device StratixTM EP1S80F1508C6

25 Experimental Results -Comparison of the Three Synthesis Flows Design Node# MCASMCAS+MB-PPMCAS-MB LEWL(k)LEWL(k)LEWL(k) aircraft 422 --105592676860181 chem. 342 833924771011914814136 dir 127 281091207548113527 honda 107 243377177438112424 lee491033547223561425 mcm942562105241183239275 pr4211946310304596738 u5ml56514447396127743187143166 wang48127573107836105038 Ave Red. - 11-18.1%-34.5%-36.3%-51.5% LE: Area results for datapath and control logic in terms of logic element WL: Wire-length

26 Conclusions We presented a complete bitwidth-aware high-level synthesis flow based on MCAS synthesis system Experimental results  Our bitwidth-aware synthesis flow achieves significant reduction for area and wire-length

27 Reference J. Choi, J. Jeon and K. Choi, “Power Minimization of Functional Units by Partially Guarded Computation,” Proc. of ISLPED, 2000 J. Cong, Y. Fan, X. Yang, and Z. Zhang, “Architecture and Synthesis for Multi-Cycle Communication,” Proc. Of International Symposium on Physical Design, 2003. J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang, "Architecture and Synthesis for On-Chip Multicycle Communication," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 2004 G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Optimal Datapath Allocation for Multiple-Wordlength Systems,” IEEE Electronics Letters, 2000 G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Heuristic Datapath Allocation for Multiple Wordlength Systems,” Proc. of Design, Automation and Test in Europe (DATE), 2001 K. Kum and W. Sung, “Combined Word-Length Optimization and High-Level Synthesis of Digital Signal Processing Systems,” IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, 2001 M. C. Molina, J. M. Mendias, and R. Hermida, “High-Level Synthesis of Multiple-Precision Circuits Independent of Data-Objects Length,” Proc. of the 39th Design Automation Conference, 2002 A. Sharma and R. Jain, “Estimating Architectural Resources and Performance for High-Level Synthesis Applications,” IEEE Trans. on VLSI Systems, 1993 M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth Analysis with Application to Silicon Compilation,” Proc. of the ACM SIGPLAN'2000 Conference on Programming Language Design and Implementation, 2000

28


Download ppt "Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng +, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu +, Z. Zhang Computer Science Department,"

Similar presentations


Ads by Google