High-Level Synthesis Algorithms. 2 Scheduling:  Inputs: − A DFG − An architecture (i.e. a set of processing elements)  Output: − Starting time of each.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Temporal Placement.

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.

ECE 667 Synthesis and Verification of Digital Circuits

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Lecture 7 FPGA technology. 2 Implementation Platform Comparison.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.

Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.

ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.

COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

ECE Synthesis & Verification - Lecture 3 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.

Chapter 2 – Netlist and System Partitioning

Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.

EDA (CS286.5b) Day 11 Scheduling (List, Force, Approximation) N.B. no class Thursday (FPGA) …

CS294-6 Reconfigurable Computing Day 15 October 13, 1998 LUT Mapping.

Dynamic NoC. 2 Limitations of Fixed NoC Communication NoC for reconfigurable devices:  NOC: a viable infrastructure for communication among task dynamically.

CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.

ICS 252 Introduction to Computer Design

ECE Synthesis & Verification - LP Scheduling 1 ECE 667 ECE 667 Synthesis and Verification of Digital Circuits Scheduling Algorithms Analytical approach.

VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.

High-Level Synthesis for Reconfigurable Systems. 2 Agenda Modeling 1.Dataflow graphs 2.Sequencing graphs 3.Finite State Machine with Datapath High-level.

Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.

Graph partition in PCB and VLSI physical synthesis Lin Zhong ELEC424, Fall 2010.

CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

J. Christiansen, CERN - EP/MIC

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.

L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.

Static Timing Analysis

1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Optimizing Packet Lookup in Time and Space on FPGA Author: Thilan Ganegedara, Viktor Prasanna Publisher: FPL 2012 Presenter: Chun-Sheng Hsueh Date: 2012/11/28.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

DECISION 1. How do you do a Bubble Sort? Bubble Sort:  You compare adjacent items in a list;  If they are in order, leave them.  If they are not in.

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

Partial Reconfigurable Designs

Scheduling Determines the precise start time of each task.

Reconfigurable Computing

Instructor: Shengyu Zhang

Reconfigurable Computing

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

Architecture Synthesis

Integrated Systems Centre © Giovanni De Micheli – All rights reserved

ICS 252 Introduction to Computer Design

Fast Min-Register Retiming Through Binary Max-Flow

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

High-Level Synthesis Algorithms

2 Scheduling:  Inputs: − A DFG − An architecture (i.e. a set of processing elements)  Output: − Starting time of each node on a given resource Temporal partitioning:  Input: − A DFG − A reconfigurable device  Output: − A set of partitions − Starting time of each node is the starting time of the partition to which it belongs Solution approaches:  List scheduling  Integer linear programming (exact method)  Network flow  Spectral method − * Recursive bi-partitioning approaches Temporal partitioning & Scheduling

3 Unconstrained scheduling:  Assumption: unlimited amount of resources − Device with unlimited size  Usually as pre-processing step for other algorithms −E.g. computation of the upper and lower bounds on the starting time of operations.  Lower bound: the earliest time at which a module can be scheduled,  Upper bound: the latest time at which a module can be started. Unconstrained Scheduling

4 ASAP (as soon as possible)  Defines the earliest starting time for each node in the DFG  Computes a minimal latency ALAP (as late as possible)  Defines the latest starting time for each node in the DFG according to a given latency The mobility of a node:  (ALAP starting time) – (ASAP starting time)  Mobility = 0  node is on a critical path Unconstrained Scheduling

5  Unconstrained scheduling with optimal latency : L = 4 Zeit 4 *+ -< Zeit 0 Zeit 3 Zeit 4 *** ** +- Time 1 Time 2 Time 3 Zeit 3 Time 4 Time 0 ASAP Example

6 Assumptions:  Multiplication: latency of 100 clocks,  Addition/subtraction: 50 clocks,  data transmission delay is neglected. ASAP Example Computation delay of the prev. node Node’s starting time as computed by the algorithm.

7 ASAP(G(V,E),d) { FOREACH ( v i without predecessor) s(v i ) := 0; REPEAT { choose a node v i, whose predecessors are all planned; s(v i ) := max j:(vj,vi)  E {s(v j )+ d j }; } UNTIL (all nodes v i are planned); RETURN s; } ASAP Algorithm

8  Unconstrained scheduling with optimal latency : L = 4 *+ - < Zeit 1 Zeit 3 Zeit 4 ** * * *+ - Time 1 Time 2 Time 3 Time 4 Time 0 ALAP-Example

9 * * 1 1 Zeit 0 Zeit 1 Zeit 2 Zeit 3 Zeit 4 * * + < * + * - * * * + + < Time 1 Time 2 Time 3 Time 4 Time 0 Mobility

10 Assumptions:  Multiplication: latency of 100 clocks,  Addition/subtraction: 50 clocks,  Overall computation time: 250 ALAP Example Computation delay of the prev. node Node’s starting time as computed by the algorithm.

11 ALAP(G(V,E),d, L) { FOREACH( vi without successor) s(vi) := L - di; REPEAT { Choose a node vi, which successors are all planned; s(vi) := min j:(vi,vj)  E {s(vj)} - di; } UNTIL (all nodes vi are planned); RETURN s } ALAP-Algorithm

12 Constrained scheduling:  A set of fixed resources available (ASIC).  Many tasks competing for a given resource, −  One of them must be chosen according to a given criteria and the rest will be scheduled later. 1. Extended ASAP, ALAP:  Compute ASAP or ALAP  Assign the tasks earlier (ASAP) or later (ALAP), until the resource constraints (e.g. area) are fulfilled. Constrained Scheduling

13 *+-<**** * +- ●Constraint:  2 Multipliers, 2 ALUs (+, , <) Time 0 Time 1 Time 2 Time 3 Time 4 Extended ASAP

14 List scheduling:  Sort nodes in topological order  Assign priority to nodes  Criteria can be: − number of successors, − depth (length of longest path from inputs), − latency-weighted depth, −w: latency of the operation to be executed by the nodes on the path. − mobility, − connectivity, −... Constrained Scheduling

15 At any time step t:  A ready set L is constructed (operations ready to be scheduled) −L: operations whose predecessors have already been scheduled early enough to complete their execution at time t.  Tasks are placed in L in decreasing priority order  At a given step, the free resource is assigned the task with highest priority. Constrained Scheduling

16  At a given step, the free resource is assigned the task with highest priority. Constrained Scheduling Are there enough resources of type k to implement all the operations of type k? Assign sources to operations Assign sources to high priority operations yn

17 * + - < * * * ** ●Criterion: number of successors ●Resources: 1 multiplier, 1 ALU (+, -, <) Constrained Scheduling (Example)

18 Time 0 Time 1 Time 2 Time 3 Time 4 Time 5 Time 6 Time 7 * + - < * * * * + * - Constrained Scheduling (Example)

19 List Scheduling: Example Resources: 1 multiplier, 1 adder Latency:  Multiplication: 100 clocks,  Add/sub: 50 clocks, ****

20 Force Directed List Scheduling

21 In RCS,  Resource types are not important. − Amount of basic resources are important.  Operators do not compete for resources. − They compete for area.  Only the starting time and the end time of the complete partition is usually considered. Temporal Partitioning vs. Constrained Scheduling

22 Temporal Partitioning in RCS Temporal partitioning:  The same as list scheduling  Assignment criterion: there should be enough places left on the device to accommodate the new component. Algorithm: List-scheduling algorithm for reconfigurable devices sort the nodes of v according to their priorities P0 := Ø while V ≠Ø do select a vertex v  V with highest priority and whose predecessors are all placed if (a partition P i exists with s(P i ) + s(v) ≤ s(H)) then Pi = Pi  {v} else create a new partition P i+1 and set P i+1 = {v} end if end while

23 P2 P1 + < ** * * P3 -*- * + ●Connectivity: ●c(P1) = 1/6, ●c(P2) = 1/3, ●c(P3) = 2/6. ●Quality: 0.28 Temporal Partitioning vs. Constrained Scheduling ●Criterion: number of successors ●size(FPGA) = 250, ●size (mult) = 100, ●size(add) = size(sub) = 20, ●size(comp) = 10. * + - < * * * **

24 Improvement Best criteria:  Total computation time of DFG: t DFG = n × C H +  1,…,n (t Pi )  C H : Reconfiguration time of device H  t Pi : Computation time of partition P i.  n: Number of partitions Optimization:  If C H too large, then the optimization will tend to minimize the number of partitions  If C H « t p, then algorithm will tend to avoid long paths in partitions.

25 Improvement Advantage of LS-based temporal partitioning:  Fast (linear time algorithm)  Local optimization possible −e.g. configuration switching +/ * * + -* - / Level 0 Level 1 Level 2 Level 3 Disadvantage:  Levelization: −Modules are assigned to partitions based more on their level number rather than their interconnectivity with other component.  Interconnectivity (data exchange) must be optimized.

26 P2 P1 + < ** * * P3 - * - * + ●Connectivity: ●c(P1) = 1/6, ●c(P2) = 1/3, ●c(P3) = 2/6. ●Quality: 0.28 LS-Based Temporal Partitioning ●Criterion: number of successors ●size(FPGA) = 250, ●size (mult) = 100, ●size(add) = size(sub) = 20, ●size(comp) = 10. * + - < * * * **

27 * + - < * * * ** ●Connectivity: ●c(P1) = 2/10, ●c(P2) = 2/3, ●c(P3) = 2/3. ●Quality: 0.51 ●Quality is better P2 P1 + < * * * P3 * - * - * + Improved Temporal Partitioning

28 Pair wise interchange Improved List Scheduling

29 With the ILP (Integer Linear Programming),  Temporal partitioning constraints are formulated as equations.  The equations are then solved using an ILP- solver. The constraints usually considered are:  Uniqueness constraint  Temporal order constraint  Memory constraint  Resource constraint  Latency constraint Notations: 2.2 Temporal partitioning – ILP

30 Unique assignment constraint: Each task must be placed in exactly one partition. (m = # of partitions) Precedence constraint: For each edge e = (u, v) in the graph, u must be placed either in the same partition as v or in an earlier partition than that in which v is placed. 2.2 Temporal partitioning – ILP

31 Resource constraint: The sum of the resources needed to implement the modules in one partition should not exceed the total amount of available resources. − Device area constraint: s − Device terminal constraints: T (size of communication memory): Temporal partitioning – ILP

32 Temporal partitioning by ILP : Example assignment constraint:  y11+ y12 + y13 = 1  y21+ y22 + y23 = 1 ……  y71 +y72 + y73 = 1  Partition P1:  y22 = y23 = 0, y21 = 1  y32 = y33 = 0, y31 = 1  y42 = y43 = 0, y41 = 1  Partition P2:  y11 = y13 = 0, y12 = 1  y51 = y53 = 0, y52 = 1  y61 = y63 = 0, y62 = 1  Partition P3:  y71 = y72 = 0, y73 = 1

33 Temporal partitioning by ILP: Example Precedence constraint: ii ii

34 Temporal partitioning by ILP: Example Resource constraint:  device with a size of 200 LUTs, and 100 LUTs for the multiplication, 50 LUTs each for the addition, the comparison s(u)=

35 Temporal partitioning by ILP: Example Communication memory constraint:  Assume that a memory with 50 bytes is available for communication and each datum has a 32-bit width. Bits

36 Recursive bipartitioning The goal at each step is the generation of a unidirectional bipartition The goal at each step is to compute a bipartition wich minimizes the edge-cut size between the two partitions. Network flow methods are used to compute the a bipartition with minimal edge-cut size. Directly applying the min-cut max-flow theorem may leads to non-unidirectional cost. Therefore, the original G is first transformed into a new graph G' in which each cut is unidirectional Unidirectional recursive bipartitioning A bidirectional cut 2.3 Temporal partitioning – Network- flow

37 Two-terminal net transformation  Replace an edge (v 1, v 2 ) with two edges (v 1, v 2 ) with capacity 1 and (v 2, v 1 ) with infinite capacity Multi-terminal net transformation  For a multi-terminal net {v 1, v 2,.....v 2 },  Introduce a dummy node v with no weight and a briging (v 1, v) with capacity 1.  Introduces the egdes (v, v 2 ),.... (v, v n ), each of which is assigned a capacity 1.  Introduce the edges (v 2, v 1 ),..., (v n, v 1 ), each of which is assigned an infinite capacity  Having computed a min-cut in the transformed graph G, a min-cut can be derived in G: for each node of G' assigned to a partition, its counterpart in G is assigned to the corresponding partition in G. 2.3 Temporal partitioning – Network- flow

Multi-Context FPGAs

39 Multi-Context FPGAs Reconfiguration Time:  Can be high (compared to computation time)  If in a loop, too many reconfigurations −  High total computation Solutions:  Multi-Context  Partial Reconfiguration  Pipeline Reconfiguration [Trimberger97]

40 Multi-Context FPGA Advantages:  Switch between stored configurations quickly (some in a single clock cycle) −  Dramatically reducing reconfiguration overhead if the next configuration is present in one of the alternate contexts  Background loading of configuration data during circuit operation −  Overlapping computation with reconfiguration

41 Multi-Context FPGAs Pg 99 of [Hauck08]

42 Multi-Context FPGAs Multi-Context Problems:  Consumes valuable area which could be used for logic  Either all needed contexts must fit in the available hardware  or some control must determine when contexts should be loaded from external memory  Additional configuration data and required multiplexing occupies valuable area −This could otherwise be used for logic or routing.  Never been commercialized? [Bobda07] 1.Eight-context DRFPGA fabricated by NEC [Fujii99]

43 Partial Reconfiguration Partial reconfiguration:  Some part of the device is configured.  Can decrease reconfiguration time. −Especially if a small part needs to be changed −E.g. in a cryptography system, the key is changed.  Can allow multiple independent configurations to be swapped in/out independently.

44 Partial Reconfiguration Devices:  Xilinx 6200 family (1997): −Each logic block could be programmed individually.  Atmel AT40K (1999):  Xilinx Virtex FPGA family: −Reconfigures logic blocks in groups called frames −Virtex II (2004): Frame = A full column −Virtex 5 (2006): Frame = Partial column (41 32-bit words)

45 Virtex Devices Partial reconfiguration in Virtex: Frames:  Smallest unit of reconfiguration. Frames in Xilinx devices:  Virtex, Virtex II, Virtex II-Pro: −The whole column.  Virtex 4, Virtex 5, Virtex 6 −Only a complete tile. −Different in various devices: Width Height TASK 1 Logical shared memory TASK 2 CLB [Banerjee07]

46 Partial Reconfiguration Problems:  If configurations occupy large areas, Time spent transmitting configuration addresses may be > time saved transmitting configuration data −  Serial loading better  If the full configuration sequence is not known at compile time, Overlapping configurations −Solution: De-fragmentation:

47 Pipeline Reconfiguration Pipeline reconfiguration:  Uses a series of physical pipeline stages.  Number of virtual stages is generally not constrained by the number of physical stages  PipeRench (2000) Numbers (in boxes): pipeline stage Shaded boxes: reconfiguration for the given cycle

48 Pipeline Reconfiguration Problem:  Can only propagate forward through the pipeline stages. −  Any feedback connections must be completely contained within a single stage.

49 References  [Bobda07] C. Bobda, “Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications,” Springer,  [Hauck08] S. Hauck, A. DeHon, "Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation" Morgan-Kaufmann, 2007  [Fujii99] T. Fujii et al., “A dynamically reconfigurable logic engine with a multicontext/multi-mode unified-cell architecture,” in Proc. IEEE Int. Solid-State Circuits Conf., 1999, pp. 364–365.  [Mehdipour06] F. Mehdipour*, M. Saheb Zamani, M. Sedighi, “An integrated temporal partitioning and physical design framework for static compilation of reconfigurable computing systems,” Journal of Microprocessors and Microsystems, Elsevier, v30, 2006, pp. 52–62.