High-Level Synthesis Algorithms. 2 Scheduling:  Inputs: − A DFG − An architecture (i.e. a set of processing elements)  Output: − Starting time of each.

High-Level Synthesis Algorithms

2 Scheduling:  Inputs: − A DFG − An architecture (i.e. a set of processing elements)  Output: − Starting time of each node on a given resource Temporal partitioning:  Input: − A DFG − A reconfigurable device  Output: − A set of partitions − Starting time of each node is the starting time of the partition to which it belongs Solution approaches:  List scheduling  Integer linear programming (exact method)  Network flow  Spectral method − * Recursive bi-partitioning approaches Temporal partitioning & Scheduling

3 Unconstrained scheduling:  Assumption: unlimited amount of resources − Device with unlimited size  Usually as pre-processing step for other algorithms −E.g. computation of the upper and lower bounds on the starting time of operations.  Lower bound: the earliest time at which a module can be scheduled,  Upper bound: the latest time at which a module can be started. Unconstrained Scheduling

4 ASAP (as soon as possible)  Defines the earliest starting time for each node in the DFG  Computes a minimal latency ALAP (as late as possible)  Defines the latest starting time for each node in the DFG according to a given latency The mobility of a node:  (ALAP starting time) – (ASAP starting time)  Mobility = 0  node is on a critical path Unconstrained Scheduling

5  Unconstrained scheduling with optimal latency : L = 4 Zeit 4 *+ -< Zeit 0 Zeit 3 Zeit 4 *** ** +- Time 1 Time 2 Time 3 Zeit 3 Time 4 Time 0 ASAP Example

6 Assumptions:  Multiplication: latency of 100 clocks,  Addition/subtraction: 50 clocks,  data transmission delay is neglected. ASAP Example Computation delay of the prev. node Node’s starting time as computed by the algorithm.

7 ASAP(G(V,E),d) { FOREACH ( v i without predecessor) s(v i ) := 0; REPEAT { choose a node v i, whose predecessors are all planned; s(v i ) := max j:(vj,vi)  E {s(v j )+ d j }; } UNTIL (all nodes v i are planned); RETURN s; } ASAP Algorithm

8  Unconstrained scheduling with optimal latency : L = 4 *+ - < Zeit 1 Zeit 3 Zeit 4 ** * * *+ - Time 1 Time 2 Time 3 Time 4 Time 0 ALAP-Example

9 * * 1 1 Zeit 0 Zeit 1 Zeit 2 Zeit 3 Zeit 4 * * + < * + * - * * - 2 2 2 2 * + + < 0 0 0 0 0 Time 1 Time 2 Time 3 Time 4 Time 0 Mobility

10 Assumptions:  Multiplication: latency of 100 clocks,  Addition/subtraction: 50 clocks,  Overall computation time: 250 ALAP Example Computation delay of the prev. node Node’s starting time as computed by the algorithm.

11 ALAP(G(V,E),d, L) { FOREACH( vi without successor) s(vi) := L - di; REPEAT { Choose a node vi, which successors are all planned; s(vi) := min j:(vi,vj)  E {s(vj)} - di; } UNTIL (all nodes vi are planned); RETURN s } ALAP-Algorithm

12 Constrained scheduling:  A set of fixed resources available (ASIC).  Many tasks competing for a given resource, −  One of them must be chosen according to a given criteria and the rest will be scheduled later. 1. Extended ASAP, ALAP:  Compute ASAP or ALAP  Assign the tasks earlier (ASAP) or later (ALAP), until the resource constraints (e.g. area) are fulfilled. Constrained Scheduling

13 *+-<**** * +- ●Constraint:  2 Multipliers, 2 ALUs (+, , <) Time 0 Time 1 Time 2 Time 3 Time 4 Extended ASAP

14 List scheduling:  Sort nodes in topological order  Assign priority to nodes  Criteria can be: − number of successors, − depth (length of longest path from inputs), − latency-weighted depth, −w: latency of the operation to be executed by the nodes on the path. − mobility, − connectivity, −... Constrained Scheduling

15 At any time step t:  A ready set L is constructed (operations ready to be scheduled) −L: operations whose predecessors have already been scheduled early enough to complete their execution at time t.  Tasks are placed in L in decreasing priority order  At a given step, the free resource is assigned the task with highest priority. Constrained Scheduling

16  At a given step, the free resource is assigned the task with highest priority. Constrained Scheduling Are there enough resources of type k to implement all the operations of type k? Assign sources to operations Assign sources to high priority operations yn

17 * + - < * * * ** + - 3 3 2 211 1 1 00 0 ●Criterion: number of successors ●Resources: 1 multiplier, 1 ALU (+, -, <) Constrained Scheduling (Example)

18 Time 0 Time 1 Time 2 Time 3 Time 4 Time 5 Time 6 Time 7 * + - < * * * * + * - Constrained Scheduling (Example)

19 List Scheduling: Example Resources: 1 multiplier, 1 adder Latency:  Multiplication: 100 clocks,  Add/sub: 50 clocks, ****

20 Force Directed List Scheduling

21 In RCS,  Resource types are not important. − Amount of basic resources are important.  Operators do not compete for resources. − They compete for area.  Only the starting time and the end time of the complete partition is usually considered. Temporal Partitioning vs. Constrained Scheduling

22 Temporal Partitioning in RCS Temporal partitioning:  The same as list scheduling  Assignment criterion: there should be enough places left on the device to accommodate the new component. Algorithm: List-scheduling algorithm for reconfigurable devices sort the nodes of v according to their priorities P0 := Ø while V ≠Ø do select a vertex v  V with highest priority and whose predecessors are all placed if (a partition P i exists with s(P i ) + s(v) ≤ s(H)) then Pi = Pi  {v} else create a new partition P i+1 and set P i+1 = {v} end if end while

23 P2 P1 + < ** * * P3 -*- * + ●Connectivity: ●c(P1) = 1/6, ●c(P2) = 1/3, ●c(P3) = 2/6. ●Quality: 0.28 Temporal Partitioning vs. Constrained Scheduling ●Criterion: number of successors ●size(FPGA) = 250, ●size (mult) = 100, ●size(add) = size(sub) = 20, ●size(comp) = 10. * + - < * * * ** + - 3 3 2 21 1 1 1 00 0 3 3 1 2 2 1 1 1 0 0 0

24 Improvement Best criteria:  Total computation time of DFG: t DFG = n × C H +  1,…,n (t Pi )  C H : Reconfiguration time of device H  t Pi : Computation time of partition P i.  n: Number of partitions Optimization:  If C H too large, then the optimization will tend to minimize the number of partitions  If C H « t p, then algorithm will tend to avoid long paths in partitions.

25 Improvement Advantage of LS-based temporal partitioning:  Fast (linear time algorithm)  Local optimization possible −e.g. configuration switching +/ * * + -* - / Level 0 Level 1 Level 2 Level 3 Disadvantage:  Levelization: −Modules are assigned to partitions based more on their level number rather than their interconnectivity with other component.  Interconnectivity (data exchange) must be optimized.

26 P2 P1 + < ** * * P3 - * - * + ●Connectivity: ●c(P1) = 1/6, ●c(P2) = 1/3, ●c(P3) = 2/6. ●Quality: 0.28 LS-Based Temporal Partitioning ●Criterion: number of successors ●size(FPGA) = 250, ●size (mult) = 100, ●size(add) = size(sub) = 20, ●size(comp) = 10. * + - < * * * ** + - 3 3 2 21 1 1 1 00 0 3 3 1 2 2 1 1 1 0 0 0

27 * + - < * * * ** + - 3 3 2 21 1 1 1 00 0 ●Connectivity: ●c(P1) = 2/10, ●c(P2) = 2/3, ●c(P3) = 2/3. ●Quality: 0.51 ●Quality is better P2 P1 + < * * * P3 * - * - * + Improved Temporal Partitioning 3 3 1 2 2 1 1 1 0 0 0

28 Pair wise interchange Improved List Scheduling

29 With the ILP (Integer Linear Programming),  Temporal partitioning constraints are formulated as equations.  The equations are then solved using an ILP- solver. The constraints usually considered are:  Uniqueness constraint  Temporal order constraint  Memory constraint  Resource constraint  Latency constraint Notations: 2.2 Temporal partitioning – ILP

30 Unique assignment constraint: Each task must be placed in exactly one partition. (m = # of partitions) Precedence constraint: For each edge e = (u, v) in the graph, u must be placed either in the same partition as v or in an earlier partition than that in which v is placed. 2.2 Temporal partitioning – ILP

31 Resource constraint: The sum of the resources needed to implement the modules in one partition should not exceed the total amount of available resources. − Device area constraint: s − Device terminal constraints: T (size of communication memory): Temporal partitioning – ILP

32 Temporal partitioning by ILP : Example assignment constraint:  y11+ y12 + y13 = 1  y21+ y22 + y23 = 1 ……  y71 +y72 + y73 = 1  Partition P1:  y22 = y23 = 0, y21 = 1  y32 = y33 = 0, y31 = 1  y42 = y43 = 0, y41 = 1  Partition P2:  y11 = y13 = 0, y12 = 1  y51 = y53 = 0, y52 = 1  y61 = y63 = 0, y62 = 1  Partition P3:  y71 = y72 = 0, y73 = 1

33 Temporal partitioning by ILP: Example Precedence constraint: ii ii

34 Temporal partitioning by ILP: Example Resource constraint:  device with a size of 200 LUTs, and 100 LUTs for the multiplication, 50 LUTs each for the addition, the comparison s(u)=

35 Temporal partitioning by ILP: Example Communication memory constraint:  Assume that a memory with 50 bytes is available for communication and each datum has a 32-bit width. Bits

36 Recursive bipartitioning The goal at each step is the generation of a unidirectional bipartition The goal at each step is to compute a bipartition wich minimizes the edge-cut size between the two partitions. Network flow methods are used to compute the a bipartition with minimal edge-cut size. Directly applying the min-cut max-flow theorem may leads to non-unidirectional cost. Therefore, the original G is first transformed into a new graph G' in which each cut is unidirectional Unidirectional recursive bipartitioning A bidirectional cut 2.3 Temporal partitioning – Network- flow

37 Two-terminal net transformation  Replace an edge (v 1, v 2 ) with two edges (v 1, v 2 ) with capacity 1 and (v 2, v 1 ) with infinite capacity Multi-terminal net transformation  For a multi-terminal net {v 1, v 2,.....v 2 },  Introduce a dummy node v with no weight and a briging (v 1, v) with capacity 1.  Introduces the egdes (v, v 2 ),.... (v, v n ), each of which is assigned a capacity 1.  Introduce the edges (v 2, v 1 ),..., (v n, v 1 ), each of which is assigned an infinite capacity  Having computed a min-cut in the transformed graph G, a min-cut can be derived in G: for each node of G' assigned to a partition, its counterpart in G is assigned to the corresponding partition in G. 2.3 Temporal partitioning – Network- flow

Multi-Context FPGAs

39 Multi-Context FPGAs Reconfiguration Time:  Can be high (compared to computation time)  If in a loop, too many reconfigurations −  High total computation Solutions:  Multi-Context  Partial Reconfiguration  Pipeline Reconfiguration [Trimberger97]

40 Multi-Context FPGA Advantages:  Switch between stored configurations quickly (some in a single clock cycle) −  Dramatically reducing reconfiguration overhead if the next configuration is present in one of the alternate contexts  Background loading of configuration data during circuit operation −  Overlapping computation with reconfiguration

41 Multi-Context FPGAs Pg 99 of [Hauck08]

42 Multi-Context FPGAs Multi-Context Problems:  Consumes valuable area which could be used for logic  Either all needed contexts must fit in the available hardware  or some control must determine when contexts should be loaded from external memory  Additional configuration data and required multiplexing occupies valuable area −This could otherwise be used for logic or routing.  Never been commercialized? [Bobda07] 1.Eight-context DRFPGA fabricated by NEC [Fujii99]

43 Partial Reconfiguration Partial reconfiguration:  Some part of the device is configured.  Can decrease reconfiguration time. −Especially if a small part needs to be changed −E.g. in a cryptography system, the key is changed.  Can allow multiple independent configurations to be swapped in/out independently.

44 Partial Reconfiguration Devices:  Xilinx 6200 family (1997): −Each logic block could be programmed individually.  Atmel AT40K (1999):  Xilinx Virtex FPGA family: −Reconfigures logic blocks in groups called frames −Virtex II (2004): Frame = A full column −Virtex 5 (2006): Frame = Partial column (41 32-bit words)

45 Virtex Devices Partial reconfiguration in Virtex: Frames:  Smallest unit of reconfiguration. Frames in Xilinx devices:  Virtex, Virtex II, Virtex II-Pro: −The whole column.  Virtex 4, Virtex 5, Virtex 6 −Only a complete tile. −Different in various devices: Width Height TASK 1 Logical shared memory TASK 2 CLB [Banerjee07]

46 Partial Reconfiguration Problems:  If configurations occupy large areas, Time spent transmitting configuration addresses may be > time saved transmitting configuration data −  Serial loading better  If the full configuration sequence is not known at compile time, Overlapping configurations −Solution: De-fragmentation:

47 Pipeline Reconfiguration Pipeline reconfiguration:  Uses a series of physical pipeline stages.  Number of virtual stages is generally not constrained by the number of physical stages  PipeRench (2000) Numbers (in boxes): pipeline stage Shaded boxes: reconfiguration for the given cycle

48 Pipeline Reconfiguration Problem:  Can only propagate forward through the pipeline stages. −  Any feedback connections must be completely contained within a single stage.

49 References  [Bobda07] C. Bobda, “Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications,” Springer, 2007.  [Hauck08] S. Hauck, A. DeHon, "Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation" Morgan-Kaufmann, 2007  [Fujii99] T. Fujii et al., “A dynamically reconfigurable logic engine with a multicontext/multi-mode unified-cell architecture,” in Proc. IEEE Int. Solid-State Circuits Conf., 1999, pp. 364–365.  [Mehdipour06] F. Mehdipour*, M. Saheb Zamani, M. Sedighi, “An integrated temporal partitioning and physical design framework for static compilation of reconfigurable computing systems,” Journal of Microprocessors and Microsystems, Elsevier, v30, 2006, pp. 52–62.

High-Level Synthesis Algorithms. 2 Scheduling:  Inputs: − A DFG − An architecture (i.e. a set of processing elements)  Output: − Starting time of each.

Similar presentations

Presentation on theme: "High-Level Synthesis Algorithms. 2 Scheduling:  Inputs: − A DFG − An architecture (i.e. a set of processing elements)  Output: − Starting time of each."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-Level Synthesis Algorithms. 2 Scheduling:  Inputs: − A DFG − An architecture (i.e. a set of processing elements)  Output: − Starting time of each.

Similar presentations

Presentation on theme: "High-Level Synthesis Algorithms. 2 Scheduling:  Inputs: − A DFG − An architecture (i.e. a set of processing elements)  Output: − Starting time of each."— Presentation transcript:

Similar presentations

About project

Feedback