Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng +, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu +, Z. Zhang Computer Science Department,

Slides:



Advertisements
Similar presentations
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Courseware Integer Linear Programming approach to Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark.
Dynamic and Leakage Power Reduction in MTCMOS Circuits Using an Automated Efficient Gate Clustering Technique Mohab Anis, Shawki Areibi *, Mohamed Mahmoud.
Altera FLEX 10K technology in Real Time Application.
Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.
Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.
Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.
International Conference on Computer-Aided Design San Jose, CA Nov. 2001ER UCLA UCLA 1 Congestion Reduction During Placement Based on Integer Programming.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
 Based on the resource constraints a lower bound on the iteration interval is estimated  Synthesis targeting reconfigurable logic (e.g. FPGA) faces the.
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,
ICS 252 Introduction to Computer Design
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
Fall 2006EE VLSI Design Automation I VII-1 EE 5301 – VLSI Design Automation I Kia Bazargan University of Minnesota Part VII: High Level Synthesis.
Architecture-Level Synthesis for Automatic Interconnect Pipelining
Tabu Search-Based Synthesis of Dynamically Reconfigurable Digital Microfluidic Biochips Elena Maftei, Paul Pop, Jan Madsen Technical University of Denmark.
HIGH LEVEL SYNTHESIS WITH AREA CONSTRAINTS FOR FPGA DESIGNES: AN EVOLUTIONARY APPROACH Tesi di Laurea di: Christian Pilato Matr.n Relatore: Prof.
ICCAD 2003 Algorithm for Achieving Minimum Energy Consumption in CMOS Circuits Using Multiple Supply and Threshold Voltages at the Module Level Yuvraj.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.
Implementation of Finite Field Inversion
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,
Register Placement for High- Performance Circuits M. Chiang, T. Okamoto and T. Yoshimura Waseda University, Japan DATE 2009.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
Outline Motivation and Contributions Related Works ILP Formulation
HIGH LEVEL SYNTHESIS WITH AREA CONSTRAINTS FOR FPGA DESIGNS: AN EVOLUTIONARY APPROACH Tesi di Laurea di: Christian Pilato Matr.n Relatore: Prof.
DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong , Computer Science Department , UCLA Presented.
HIGH LEVEL SYNTHESIS WITH AREA CONSTRAINTS FOR FPGA DESIGNS: AN EVOLUTIONARY APPROACH Tesi di Laurea di: Christian Pilato Matr.n Relatore: Prof.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Improved Resource Sharing for FPGA DSP Blocks
Architecture and Synthesis for Multi-Cycle Communication
Register Transfer Specification And Design
ECE 565 High-Level Synthesis—An Introduction
Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Architecture Synthesis
Michele Santoro: Further Improvements in Interconnect-Driven High-Level Synthesis of DFGs Using 2-Level Graph Isomorphism Michele.
Presentation transcript:

Bitwidth-Aware Scheduling and Binding in High-Level Synthesis X. Cheng +, J. Cong, Y. Fan, G. Han, J. Lin, J. Xu +, Z. Zhang Computer Science Department, UCLA + Microprocessor Development and Research Center, PKU

Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

Motivation High-level languages  Big gap between design productivity and complexity  Alleviate the design complexity  Need to produce high-quality products Need to consider multi-bitwidth  Recent research shows there are 40% redundant bits in programs of high-level languages [Stephenson et al, SIGPLAN’00]  Hardware resource cost will be reduced with consideration of multi- bitwidth Area is proportional to input bitwidth for adders and registers, and is proportional to the square of input bitwidth for multipliers Wire-length is reduced accordingly  Conventional high-level synthesis only focuses on resources with uniform bitwidth

Motivational Example - Impact of Bitwidth Adders * (3 clock cycles) + (1 clock cycle) Execution time: 8 clock cycles ** ** *6 16*4 24*1632* Multipliers 32x1624x * * * * Adders * * * * *6 16*4 24*16 32* Multipliers 32x1618x * * * * 30% saving 31% saving

Related Works High-level synthesis with consideration of bitwidth  ILP formulation [Constantinides et al, IEEE Electronics Letters’00]  Heuristic solution [Kum et al ’01] [Constantinides et al, DATE’01]  Split adders into 1-bit [Molina et al DAC’02]  Partially guarded computation [Choi et al, ISLPED’00] Limitation  No consideration of interconnect delay in scheduling and binding Interconnect delays dominate the timing in DSM tech Interconnect delays dominate the timing in DSM tech  No optimality evaluation of proposed solutions for register allocation and binding

Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

Bitwidth-Aware Synthesis Flow Multiple bitwidth scheduling and binding problem  Given: (1) A DFG annotated with bitwidths, (2) a time constraint, (3) placement information of functional units, and (4) a resource IP library, where each resource type has arbitrary bitwidth configurations, each of which is associated with an area cost.  Objective: Schedule and bind the DFG into the library with consideration of interconnect delay from placement and without violating the time constraint, such that the final area of the required resources is minimized.

RDR+MCAS Global Interconnect … LCC … … … … … FSM K cycles 1 cycle 2 cycles Register file IslandIsland Local Computational Cluster (LCC) Local Computational Cluster (LCC) …. Register File WiWi HiHi FSM ALU MUL Cluster with area constraint 1 cycle2 cycle K cycle MUX One solution for multi-cycle on-chip communication  Regular Distributed Register (RDR) micro-architecture [Cong et al, ISPD ’ 03] [Cong et al, ICCAD ’ 03] The whole chip is divided into an array of islands Chose the island size such that local computation and communication in each island can be done in a single cycle  MCAS: Architectural Synthesis for Multi-cycle Communication Efficiently maps the behavioral descriptions to RDR uArch Integrates architectural synthesis with physical planning Placement information of functional units

Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

Scheduling and Binding Lower bound estimation of FU bitwidth for a DFG  Prior works focus on the number of FUs Lower-bound-based simultaneous scheduling and binding  Time constrained  Consider the interconnect delay obtained from placement information given by MCAS

Lower Bound Estimation Extend the interval-based technique of [Sharma et al, 93] to support multi-bitwidth FUs Main idea  Compute the minimum resource requirement R(p, q) for each time interval [p,q]  [1,T]  The maximum of R(p, q) over all intervals is the final bitwidth lower bound

Example of Lower-bound Estimation The minimum bitwidth requirement for multipliers in interval [4, 7] Theorem: For any feasible scheduling, the minimum overlap between operation o and interval [p,q] is: O(o, p, q) = min{ | Lifetime_ASAP  [p, q] |, | Lifetime_ALAP  [p, q] | } The operation bitwidths that must be executed in [4,7] is {18, 24, 24, 32, 16} The minimum bitwidth requirement for multipliers in [4,7] will be R (4, 7)={32, 16} The minimum overlap between the multiplications, a, b, c and d, and interval [4,7]  O(a 18*6, 4, 7) = 1  O(b 24*16, 4, 7) = a*a* d*d* b*b* c*c* *6 16*4 24*16 32*16 16 step1 step2 step3 step4 step5 step6 step7 step a*a* d*d* b*b* c*c* *6 16*4 24*16 32*16 16 ASAP ALAP  O(c 32*16, 4, 7) = 1  O(d 16*4, 4, 7) = 1 Sorted: {32, 24, 24, 18, 16} a*a* a*a* c*c* c*c* d*d* d*d* b*b* b*b*

Area Cost Weighted-area lower bound of an unscheduled DFG is defined as area for adders area for multipliers a ratio weight of multiplier area over adder area For a partially scheduled DFG, scheduling status S records the control steps for scheduled operations and feasible control steps for un-scheduled operations A is calculated the same way, denoted as A(S)

Scheduling and Binding Algorithm-1 Goal: Minimize the area cost of required FUs  Consider interconnect delay Basic idea  In each step, schedule an operation at a control step such that the resulted weighted-area lower bound A(S) is kept as small as possible A(16,1) = 48 add-32: feasible control step [2,3] A(16,2) = 48 A(32,2) = 64 A(32,3) = 48 step1 step2 step3 add-16: feasible control step [1,2] 16 add-32: feasible control step [2,3] A(32,2) = 64 A(32,3) = How to choose an operation and one of its feasible control step

Scheduling and Binding Algorithm-2 Simultaneous scheduling and binding with consideration of interconnect delay After operation o and c is chosen, FU binding is performed to decide whether o can be scheduled at step c finally  There is an available FU usable by o at step c  Data dependence between o and its scheduled and bound predecessors and successors is maintained 16 step1 step2 step3 * + MUL ADD 1 clock cycle island +

Outline Motivation Bitwidth-aware synthesis flow Scheduling and binding to minimize total bits of functional units (FU) Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

Register Allocation and Binding Problem formulation  Given: A scheduled DFG annotated with bitwidth  Objective: Perform register allocation and binding to minimize the total bitwidth of registers Register allocation  Decide the minimum required registers Register binding  Explicitly map variables to register instances

Preliminaries Scheduled DFG Life times of variables Lifetime of a variable  s(o): the control step where variable o is produced  e(o): the last control step where variable o is consumed Weighted interval graph A proper coloring of G corresponds to a register allocation and binding scheme Weight of a coloring scheme  The weight of color c W(c) = max{w(v) | v is colored with c }  The weight of the coloring scheme P is defined as W(G, P) =  W(c) = 58

Coloring Problem Weighted-interval-graph coloring problem  Given: A weighted interval graph G(V, E)  Objective: Find a coloring scheme P of G, such that the weight of the coloring scheme P, W(G, P), is minimized Uniform weights  Be solved in polynomial time (Left-edge) Various weights  The complexity remains unknown  We propose a lower-bound estimation and an efficient algorithm

Lower-Bound Estimation |C  24 |  |C  18 |  1 16 |C  16 |  2 5 |C  5 |  3 Bitwidth lower bound 24*1+16*1+5*1=45 Scheduled DFG Life times of variables

Coloring Algorithm Weight of coloring 24*1+16*1+5*1= Scheduled and bound DFGLife times of variables

Outline Motivation Bitwidth-aware synthesis flow  Scheduling and binding to minimize total bits of functional units (FU)  Minimum weighted-interval-graph coloring problem for register allocation and binding Experimental results Conclusion

Experimental Results -Weighted Interval-Graph Coloring DesignsLower BoundLeft-Edge+PostProcess [Kum et al ’ 01] Weighted IGC aircraft chem dir honda lee mcm pr u5ml wang Ave gap-+6.6%+7.5%+0.05%

Experimental Results -Three Synthesis Flows Flow1 (MCAS)  MCAS generates the scheduling and binding results and placement information. All operations and variables have uniform bitwidth (32-bits). Flow2 (MCAS+MB-PP)  Perform a bitwidth post-processing after Flow1 is done, which is to set the bitwidth of a FU as the maximum bitwidth of all operations executed on it, and set the bitwidth of a register as the maximum bitwidth of all variables stored in it. Flow3 (MCAS-MB)  After MCAS generates the scheduling and binding results and placement, the lower-bound-based scheduling & binding and the bitwidth-aware register allocation and binding are performed. Share the same backend to generate datapath and controllers Altera’s Quartus II version is used to synthesize the resulting RTL VHDL onto the FPGA device StratixTM EP1S80F1508C6

Experimental Results -Comparison of the Three Synthesis Flows Design Node# MCASMCAS+MB-PPMCAS-MB LEWL(k)LEWL(k)LEWL(k) aircraft chem dir honda lee mcm pr u5ml wang Ave Red %-34.5%-36.3%-51.5% LE: Area results for datapath and control logic in terms of logic element WL: Wire-length

Conclusions We presented a complete bitwidth-aware high-level synthesis flow based on MCAS synthesis system Experimental results  Our bitwidth-aware synthesis flow achieves significant reduction for area and wire-length

Reference J. Choi, J. Jeon and K. Choi, “Power Minimization of Functional Units by Partially Guarded Computation,” Proc. of ISLPED, 2000 J. Cong, Y. Fan, X. Yang, and Z. Zhang, “Architecture and Synthesis for Multi-Cycle Communication,” Proc. Of International Symposium on Physical Design, J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang, "Architecture and Synthesis for On-Chip Multicycle Communication," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 2004 G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Optimal Datapath Allocation for Multiple-Wordlength Systems,” IEEE Electronics Letters, 2000 G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Heuristic Datapath Allocation for Multiple Wordlength Systems,” Proc. of Design, Automation and Test in Europe (DATE), 2001 K. Kum and W. Sung, “Combined Word-Length Optimization and High-Level Synthesis of Digital Signal Processing Systems,” IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, 2001 M. C. Molina, J. M. Mendias, and R. Hermida, “High-Level Synthesis of Multiple-Precision Circuits Independent of Data-Objects Length,” Proc. of the 39th Design Automation Conference, 2002 A. Sharma and R. Jain, “Estimating Architectural Resources and Performance for High-Level Synthesis Applications,” IEEE Trans. on VLSI Systems, 1993 M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth Analysis with Application to Silicon Compilation,” Proc. of the ACM SIGPLAN'2000 Conference on Programming Language Design and Implementation, 2000