L12 : Lower Power High Level Synthesis(3) 1999. 8 성균관대학교 조 준 동 교수

Slides:



Advertisements
Similar presentations
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Force Directed Scheduling Used as priority function. Force is related to concurrency. Sort operations for least force. Mechanical analogy: Force = constant.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.
Chapter 4 Retiming.
COE 561 Digital System Design & Synthesis Scheduling Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.
VLSI Communication SystemsRecap VLSI Communication Systems RECAP.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
VADA Lab.SungKyunKwan Univ. 1 Lower Power High Level Synthesis 성균관대학교 조 준 동 교수
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long.
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
Courseware Force-Directed Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,
Digital Design – Optimizations and Tradeoffs
System Partitioning Kris Kuchcinski
VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
SCHEDULING SOURCES- Mark Manwaring Kia Bazargan Giovanni De Micheli Gupta Youn-Long Lin M. Balakrishnan Camposano, J. Hofstede, Knapp, MacMillen Lin.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
ICS 252 Introduction to Computer Design
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
CS3350B Computer Architecture Winter 2015 Lecture 5.2: State Circuits: Circuits that Remember Marc Moreno Maza [Adapted.
CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.
Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.
L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Muhammad Elrabaa Computer Engineering Department King Fahd University of Petroleum.
ELEC692 VLSI Signal Processing Architecture Lecture 3
Pipelining and Retiming
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
L10 : Lower Power High Level Synthesis(1) 성균관대학교 조 준 동 교수
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Retiming EECS 290A Sequential Logic Synthesis and Verification.
L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
Register Transfer Specification And Design
VLSI Testing Lecture 5: Logic Simulation
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
James D. Z. Ma Department of Electrical and Computer Engineering
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Architectural-Level Synthesis
Architecture Synthesis
Low Power Digital Design
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수

Matrix-vector product algorithm

Retiming Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit

Exploiting spatial locality for interconnect power reduction Global Local Adder1 Adder2

Balancing maximal time-sharing and fully-parallel implementation A fourth-order parallel-form IIR filter (a) Local assignment (2 global transfers), (b) Non-local assignment (20 global transfers)

Retiming/pipelining for Critical path

Effective Resource Utilization

Hazard propagation elimination by clocked sampling By sampling a steady state signal at a register input, no more glitches are propagated through the next combinational logics.

Regularity Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (muxes, buffers, and buses). Regular designs often have less control hardware.

Module Selection Select the clock period, choose proper hardware modules for all operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints. Full pipelining: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path. Clustering operations into non-pipelining hardware modules, the reusability of these modules over the complete computational graph be maximized. During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints

Estimation Estimate min and max bounds on the required resources to – delimit the design space min bounds to serve as an initial solution – serve as entries in a resource utilization table which guides the transformation, assignment and scheduling operations Max bound on execution time is t max : topological ordering of DFG using ASAP and ALAP Minimum bounds on the number of resources for each resource class Where N Ri : the number of resources of class R i d Ri : the duration of a single operation O Ri : the number of operations

Exploring the Design Space Find the minimal area solution constrained to the timing constraints By checking the critical paths, it determine if the proposed graph violates the timing constraints. If so, retiming, pipelining and tree height reduction can be applied. After acceptable graph is obtained, the resource allocation process is initiated. – change the available hardware (FU's, registers, busses) –redistribute the time allocation over the sub-graphs –transform the graph to reduce the hardware requirements. Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence.

Data path Synthesis

Scheduling and Binding The scheduling task selects the control step, in which a given operation will happen, i.e., assign each operation to an execution cycle Sharing: Bind a resource to more than one operation. –Operations must not execute concurrently. Graph scheduled hierachically in a bottom-up fashion Power tradeoffs –Shorter schedules enable supply voltage (Vdd) scaling –Schedule directly impacts resource sharing –Energy consumption depends what the previous instruction was –Reordering to minimize the switching on the control path Clock selection –Eliminate slacks –Choose optimal system clock period

ASAP Scheduling AlgorithmHAL Example

Algorithm ALAP Scheduling HAL Example

Force Directed Scheduling Used as priority function. Force is related to concurrency. Sort operations for least force. Mechanical analogy: Force = constant displacement. constant = operation-type distribution. displacement = change in probability.

Force Directed Scheduling

Example : Operation V 6

Force-Directed Scheduling Algorithm (Paulin)

Force-Directed Scheduling Example Probability of scheduling operations into control steps Probability of scheduling operations into control steps after operation o 3 is scheduled to step s 2 Operator cost for multiplications in a Operator cost for multiplications in c

List Scheduling The scheduled DFG DFG with mobility labeling (inside <>) ready operation list/resource constraint

Static-List Scheduling DFG Partial schedule of five nodes Priority list The final schedule

Divide-and-Conquer to minimize the power consumption Decompose a computation into strongly connected components Any adjacent trivial SCCs are merged into a sub part; Use pipelining to isolate the sub parts; For each sub part –Minimize the number of delays using retiming; –If (the sub part is linear) Apply optimal unfolding; –Else Apply unfolding after the isolation of nonlinear operations; Merge linear sub parts to further optimize; Schedule merged sub parts to minimize memory usage

Choosing Optimal Clock Period

SCC decomposition step Using the standard depth-first search-based algorithm [Tarjan,1972] which has a low order polynomial-time complexity. For any pair of operations A and B within an SCC, there exist both a path from A to B and a path from B to A. The graph formed by all the SCCs is acyclic. Thus, the SCCs can be isolated from each other using pipeline delays, which enables us to optimize each SCC separately.

Idetifying SCC The first step of the approach is to identify the computation's strongly connected components,.

Choosing Optimal Clock Period

Supply Voltage Scaling Lowering Vdd reduces energy, but increase delays

Multiple Supply Voltages Filter Example

Shut-down 을 이용한 Scheduling: |a-b|

Loop Scheduling Sequential Execution Partial loop unrolling Loop folding

Reduce execution delay of a loop. Pipeline operations inside a loop. Overlap execution of operations. Need a prologue and epilogue. Use pipeline scheduling for loop graph model.

DFG Restructuring DFG2 DFG2 after redundant operation insertion

Minimizing the bit transitions for constants during Scheduling

Control Synthesis Synthesize circuit that: Executes scheduled operations. Provides synchronization. Supports: Iteration. Branching. Hierarchy. Interfaces.

Allocation ◆ Bind a resource to more than one operation.

Optimum binding

Example