Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scheduling for Synthesis of Embedded Hardware

Similar presentations


Presentation on theme: "Scheduling for Synthesis of Embedded Hardware"— Presentation transcript:

1 Scheduling for Synthesis of Embedded Hardware
EE202A (Fall 2004): Lecture #11 Note: Several slides in this lecture are from Prof. Miodrag Potkonjak, UCLA Computer Science

2 Reading List for This Lecture
4/17/2017 Reading List for This Lecture Recommended: R. Walker and S. Choudhuri, “Introduction to the scheduling problem”, IEEE Design & Test of Computers (special issue on high-level synthesis), vol. 12, iss. 2, pp , June C-T. Hwang, J-H. Lee, and Y-C. Hsu, “A formal approach to the scheduling problem in high level synthesis”, IEEE Transactions on CAD, vol. 10, iss. 4, pp , April 1991. P. Paulin and J. P. Knight, “Force directed scheduling for the behavioral synthesis of ASICs”, IEEE Transactions on CAD, vol. 8, iss. 6, June 1989.

3 HW-SW Design Flow System Specification High-Level Synthesis C/C++
C/VHDL/Verilog Behavioral Assembly Code Structural RTL System Specification [Gupta, UCSD]

4 High-Level Synthesis (HLS)
Converts behavioral specification into structural register transfer level (RTL) description Example: GCD computation High-Level Synthesis [Ghosh, ICCAD ‘96]

5 Typical HLS System

6 High Level Synthesis Resource Allocation - How Much?
Scheduling - When? Assignment - Where? Module Selection Template Matching & Operation Chaining Clock Selection Partitioning Transformations

7 Algorithm Description

8 CDFG

9 Precedence Graph

10 Sequence Graph: Start & End Nodes

11 Hierarchy in Sequence Graphs

12 Hierarchy in Sequence Graphs (contd.)

13 Hierarchy in Sequence Graphs (contd.)

14 Implementation

15 Timing Constraints Time measured in “cycles” or “control steps”
problem? Max & min timing constraints

16 Constraint Graphs

17 Operations with Unknown Delays
Unknown but bounded e.g. Conditionals loops Unknown and unbounded I/O operations synchronization Completion signal Called “anchor nodes” Need to schedule relative to these anchors

18 Scheduling Under Timing Constraints
Feasible constraint graph Timing constraints satisfied when execution delays of all the anchors is zero Necessary for existence of schedule Well-posed constraint graph Timing constraints satisfied for all values of execution delays Implies feasibility Feasible constraint graph is well-posed or can be made well-posed iff no cycles with unbounded weight exist

19 Ill-posed (a, b) vs. Well-posed (c) Timing Constraints

20

21 Allocation, Assignment, and Scheduling
Techniques well understood and mature

22 Example: Scheduling, Allocation, and Assignment
Control Step

23 Variants of HLS scheduling
Unconstrained Scheduling (UCS) Unlimited HW resources, No latency constraints Time Constrained Scheduling (TCS) Given upper bound on schedule length Goal is to minimize total resource cost Resource Constrained Scheduling (RCS) Given maximum number of each resource type Goal is to minimize the schedule length Time & Resource Constrained Sched. (TRCS)

24 ASAP Scheduling Algorithm
(Solves the Unconstrained Scheduling Problem)

25 ASAP Scheduling Example

26 ASAP Scheduling Example
Dummy start node scheduled at time = 0 Sequence Graph ASAP Schedule

27 ALAP Scheduling Algorithm
(Solves the Unconstrained Scheduling Problem, but latency constraint is required for algo. to make sense)

28 ALAP Scheduling Example

29 ALAP Scheduling Example
Dummy sink node scheduled at (latency constraint + 1) ALAP Schedule (latency constraint = 4) Sequence Graph

30 Observation about ALAP & ASAP
Start time of an operation given by ASAP is the earliest possible through any scheduling algorithm For a given latency constraint, start time given by ALAP is the latest possible through any scheduling algorithm (ALAP start time – ASAP start time) denotes the mobility No priority is given to nodes on critical path Unimportant nodes may be scheduled ahead of critical nodes No problem if unlimited hardware. If limited resources, less critical nodes may block critical ones & yield poor schedules List scheduling techniques overcome this problem by utilizing a more global node selection criterion

31 List Scheduling Illustration
Candidate list in each control step Operation(s) in shaded boxes are the ones selected for scheduling in the current CS CS Resource constraint: 1 ADD, 2 MUL CS CS CS Order of selecting candidate operations affects schedule quality

32 List Scheduling Algorithm
Commonly used selection criteria: Nodes with least mobility picked first, Nodes with maximum number of successors picked first

33 Taxonomy of scheduling algorithms
NP complete problem Optimal techniques Integer Linear Programming (ILP) Heuristics - Iterative improvement based e.g., Simulated annealing Heuristics – Constructive e.g., Force directed scheduling, List scheduling If all resources identical, reduces to multi-processor scheduling Minimum latency multiprocessor sched. is also NP complete

34 Scheduling - Optimal Techniques
Integer Linear Programming Branch and Bound

35 Integer Linear Programming
Given: integer-valued matrix Amxn, vectors B = ( b1, b2, … , bm ), C = ( c1, c2, … , cn ) Minimize: CTX Subject to: AX  B X = ( x1, x2, … , xn ) is an integer-valued vector

36 ILP based scheduling RCS version: For a set of (dependent) operations {V0,V1,...,Vn }, given an upper bound ak on the # of available resources of type k where k  {1, …, nres}, and latency di of each operation Vi, find a schedule of minimum length that satisfies all resource and precedence constraints V0 denotes dummy start node, Vn denotes dummy sink node Step 1: Run a heuristic, e.g., list scheduling, to obtain possibly sub-optimal, but achievable schedule length. Say it is  Step 2: Run ASAP & ALAP without resource constraints to get earliest & latest start times for each operation (used for pruning)

37 Integer Linear Programming
For each computation dependency: ti has to be done before tj, introduce a constraint: k x1i+ (k-1) x2i xki  k x1j+ (k-1) x2j xkj+ 1(*) Minimize: y0 Subject to: x1i+ x2i xki = 1 for all 1  i  n yj  y0 for all 1  i  k all computation dependency of type (*)

38 An Example 6 computations 3 control steps c1 c2 c3 c4 c6 c5

39 An Example Introduce variables:
xij for 1  i  3, 1  j  6 yi = xi1+xi2+xi3+xi4+xi5+xi6 for 1  i  3 y0 Dependency constraints: e.g. execute c1 before c4 3x11+2x21+x31  3x14 +2x24+x34+1 Execution constraints: x1i+x2i+x3i = 1 for 1  i  6

40 An Example Minimize: y0 Subject to: yi  y0 for all 1  i  3
dependency constraints execution constraints One solution: y0 = 2 x11 = 1, x12 = 1, x23 = 1, x24 = 1, x35 = 1, x36 = 1. All other xij = 0

41 ILP Model of Scheduling
For each operation Vi, for each control step j ( 1  j   ), define variable xij as: xij = 1, if computation Vi is executed in control step j xij = 0, otherwise Constraint 1: Start time is unique (i.e., each operation must be scheduled exactly once) Start time and end time of operation Vj are given by and respectively

42 ILP Model of Scheduling (contd.)
Constraint 2: Sequencing relationships must be satisfied Constraint 3: Resource bounds must be met Since upper bound on # of resources of type k is ak

43 Minimum-latency Scheduling Under Resource-constraints
Let t be the vector whose entries are start times, and let c = [0, 0, …, 1]T Formal ILP model is given by:

44 Example Two types of resources Both take 1 cycle execution time
Multiplier (2 available) ALU (2 available) Adder Subtraction Comparison Both take 1 cycle execution time

45 Example (contd.) Heuristic (list scheduling) gives latency = 4 steps
Use ALAP and ASAP (with no resource constraints) to get bounds on start times ASAP matches latency of heuristic So heuristic is optimum, but let us ignore it! Constraints?

46 Example (contd.) Start time is unique

47 Example (contd.) Sequencing constraints
Note: Only non-trivial ones listed Those with more than one possible start time for at least one operation

48 Example (contd.) Resource constraints

49 Example (contd.) Consider c = [0, 0, …, 1]T
Minimum latency schedule Since sink has no mobility (xn,5 = 1), any feasible schedule is optimum Consider c = [1, 1, …, 1] T Finds earliest start times for all operations Equivalently, it minimizes the following term:

50 Example: Optimum Schedule Under Resource Constraints

51 Example (contd.) Extension to TCS version: Assume multiplier costs 5 units of area, & ALU costs 1 unit of area  now becomes the (given) schedule length bound Same uniqueness and sequencing constraints as before Resource constraints are now in terms of unknown variables a1 and a2 a1 = # of multipliers a2 = # of ALUs

52 Example (contd.) Resource constraints

53 Example Solution Minimize cTa = 5.a1 + 1.a2 Solution with cost 12

54 Extensions ILP formulation can be extended to consider:
Operation chaining Functional pipelining and other transformations See recommended reading paper for details

55 Precedence-constrained Multiprocessor Scheduling
All operations done by the same type of resource NP complete even if all operations have unit delay

56 Scheduling - Iterative Improvement
Kernighan - Lin (deterministic) Simulated Annealing Lottery Iterative Improvement Neural Networks Genetic Algorithms Taboo Search

57 Scheduling - Constructive Techniques
Most Constrained Least Constraining

58 Force Directed Scheduling
Goal is to reduce hardware by balancing concurrency Iterative algorithm, one operation scheduled per iteration Information (i.e. speed & area) fed back into scheduler

59 The Force Directed Scheduling Algorithm

60 Step 1 Determine ASAP and ALAP schedules * - * * * * - * * - + < +

61 Step 2 Determine Time Frame of each op
Length of box ~ Possible execution cycles Width of box ~ Probability of assignment Uniform distribution, Area assigned = 1 C-step 1 * - + < C-step 2 C-step 3 1/2 C-step 4 1/3 Time Frames

62 Step 3 Create Distribution Graphs Sum of probabilities of each Op type
Indicates concurrency of similar Ops DG(i) =  Prob(Op, i) 1 2 3 4 1 2 3 4 1 1 2 2 3 3 4 4 DG for Multiply DG for Add, Sub, Comp

63 Diff Eq Example: Precedence Graph Recalled

64 Diff Eq Example: Time Frame & Probability Calculation

65 Diff Eq Example: DG Calculation

66 Conditional Statements
Operations in different branches are mutually exclusive Operations of same type can be overlapped onto DG Probability of most likely operation is added to DG 1 2 DG for Add - + Fork Join

67 Self Forces Force(i) = DG(i) * x(i) Self Force(j) = [Force(i)]
Scheduling an operation will effect overall concurrency Every operation has 'self force' for every C-step of its time frame Analogous to the effect of a spring: f = Kx Desirable scheduling will have negative self force Will achieve better concurrency (lower potential energy) Force(i) = DG(i) * x(i) DG(i) ~ Current Distribution Graph value x(i) ~ Change in operation’s probability Self Force(j) = [Force(i)]

68 Example Attempt to schedule multiply in C-step 1
Self Force(1) = Force(1) + Force(2) = ( DG(1) * X(1) ) + ( DG(2) * X(2) ) = [2.833*(0.5) * (-0.5)] = +0.25 This is positive, scheduling the multiply in the first C-step would be bad 1 2 3 4 DG for Multiply * - + < C-step 1 C-step 2 C-step 3 C-step 4 1/2 1/3

69 Diff Eq. Example: Self Force for Node 4

70 Predecessor & Successor Forces
Scheduling an operation may affect the time frames of other linked operations This may negate the benefits of the desired assignment Predecessor/Successor Forces = Sum of Self Forces of any implicitly scheduled operations * - + <

71 Diff Eq Example: Successor Force on Node 4
If node 4 scheduled in step 1 no effect on time frame for successor node 8 Total force = Froce4(1) = +0.25 If node 4 scheduled in step 2 causes node 8 to be scheduled into step 3 must calculate successor force

72 Diff Eq Example: Final Time Frame and Schedule

73 Diff Eq Example: Final DG

74 Lookahead Temporarily modify the constant DG(i) to include the effect of the iteration being considered Force (i) = temp_DG(i) * x(i) temp_DG(i) = DG(i) + x(i)/3 Consider previous example: Self Force(1) = (DG(1) + x(1)/3)x(1) + (DG(2) + x(2)/3)x(2) = .5( /3) -.5( /3) = This is even worse than before

75 Minimization of Bus Costs
Basic algorithm suitable for narrow class of problems Algorithm can be refined to consider “cost” factors Number of buses ~ number of concurrent data transfers Number of buses = maximum transfers in any C-step Create modified DG to include transfers: Transfer DG Trans DG(i) =  [Prob (op,i) * Opn_No_InOuts] Opn_No_InOuts ~ combined distinct in/outputs for Op Calculate Force with this DG and add to Self Force

76 Minimization of Register Costs
Minimum registers required is given by the largest number of data arcs crossing a C-step boundary Create Storage Operations, at output of any operation that transfers a value to a destination in a later C-step Generate Storage DG for these “operations” Length of storage operation depends on final schedule s d 2 3 1 4 Storage distribution for S 5 ASAP Lifetime MAX Lifetime ALAP Lifetime

77 Minimization of Register Costs (contd.)
avg life] = storage DG(i) = (no overlap between ASAP & ALAP) storage DG(i) = (if overlap) Calculate and add “Storage” Force to Self Force 7 registers minimum ASAP Force Directed 5 registers minimum

78 Pipelining Functional Pipelining Structural Pipelining
* + < - 1', 3 2', 4 1 2 3 4 DG for Multiply 3, 1’ 4, 2’ 3’ 4’ Instance Instance’ Functional Pipelining Functional Pipelining Pipelining across multiple operations Must balance distribution across groups of concurrent C-steps Cut DG horizontally and superimpose Finally perform regular Force Directed Scheduling Structural Pipelining Pipelining within an operation For non data-dependant operations, only the first C-step need be considered 1 2 3 4 * Structural Pipelining

79 Other Optimizations Local timing constraints Multiclass FU’s
Insert dummy timing operations -> Restricted time frames Multiclass FU’s Create multiclass DG by summing probabilities of relevant ops Multistep/Chained operations. Carry propagation delay information with operation Extend time frames into other C-steps as required Hardware constraints Use Force as priority function in list scheduling algorithms

80 Scheduling using Simulated Annealing
Reference: Devadas, S.; Newton, A.R. Algorithms for hardware allocation in data path synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, July 1989, Vol.8, (no.7):

81 Simulated Annealing Local Search Solution space Cost function ?

82 Statistical Mechanics Combinatorial Optimization
State {r:} (configuration -- a set of atomic position ) weight e-E({r:])/K BT -- Boltzmann distribution E({r:]): energy of configuration KB: Boltzmann constant T: temperature Low temperature limit ??

83 Analogy Physical System Optimization Problem State (configuration)
Energy Ground State Rapid Quenching Careful Annealing Optimization Problem Solution Cost Function Optimal Solution Iteration Improvement Simulated Annealing

84 Generic Simulated Annealing Algorithm
1. Get an initial solution S 2. Get an initial temperature T > 0 3. While not yet 'frozen' do the following: 3.1 For 1 i  L, do the following: 3.1.1 Pick a random neighbor S'of S 3.1.2 Let =cost(S') - cost(S) 3.1.3 If   0 (downhill move) set S = S' 3.1.4 If >0 (uphill move) set S=S' with probability e-/T 3.2 Set T = rT (reduce temperature) 4. Return S

85 Basic Ingredients for S.A.
Solution Space Neighborhood Structure Cost Function Annealing Schedule

86 Observation All scheduling algorithms we have discussed so far are critical path schedulers They can only generate schedules for iteration period larger than or equal to the critical path They only exploit concurrency within a single iteration, and only utilize the intra-iteration precedence constraints

87 Example Can one do better than iteration period of 4? Approaches
Pipelining + retiming can reduce critical path to 3, and also the # of functional units Approaches Transformations followed by scheduling Transformations integrated with scheduling

88 Conclusions High Level Synthesis
Connects Behavioral Description and Structural Description Scheduling is a key step Estimations, Transformations are others High Level of Abstraction, High Impact on the Final Design


Download ppt "Scheduling for Synthesis of Embedded Hardware"

Similar presentations


Ads by Google