Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL) csda Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions

2 Custom ISE Identification Register File ALUMULLD/ST Data Memory AFU out1 = F (in1, in2, in3, in4) out2 = G (in1, in2, in3, in4) Limited number of I/O ports

3 Outline Problem formulation ISE selection I/O serialisation Related work Non-optimality of earlier work Integer Linear Programming (ILP) formulation Results Conclusions

4 Problem Formulation Given a dataflow graph a set of forbidden nodes Find a subgraph S, which is convex free of forbidden nodes And, has largest gain M (S) = N exec * (SW (S) – HW (S)) f a x2x2 x1x1 d x3x3 h bceg

5 Convex Subgraph d cb a In order to execute the AFU we need the output of node b Computation of node b requires the output of AFU A non-convex AFU cannot be scheduled without creating a deadlock

6 I/O Serialisation f d bce 2 inputs, 4 outputs Available I/O ports: (1, 2) c b e d f

7 ISE Merit Estimation M (S) = N exec * (SW (S) – HW (S)) f a x2x2 x1x1 d x3x3 h bceg c b e d f

8 Related Work ISE identification under I/O constraints Search space pruning using I/O and convexity constraints [Atasu03, Clark03, Yu04, Pozzi06, Yu07, Chen07] ILP based approach [Atasu05] Pseudo-polynomial time algorithm [Bonzini07] ISE identification under relaxed I/O constraints Restricted search space exploration [Pozzi05] Generation of a semi compact set of connected ISEs [Pothineni07] I/O serialisation Exponential time algorithms [Pozzi05, Pothineni07] Algorithms for specific processor models Single-issue RISC processor model [Verma07]

9 Earlier Work ISE SelectionI/O Serialisation Atasu03 Yu07 Chen07 Bonzini07 Pozzi05 Pothineni07 Optimal ISEs selection under various I/O constraints Exponential time I/O serialisation algorithm

10 Non-Optimality of Earlier Work.5.6.5.6.5.6.3.2.5.6.5.6.5.6.3.2 cycle saved:

11 Our Contributions Optimal ILP formulation for a large class of processor models Earlier work consider RISC processor model only Single run In the earlier work ISE selection was done for various I/O constraints ISE selection and I/O scheduling together Another source of non-optimality of earlier work

12 Integer Linear Programming Objective function Linear constraints

13 ILP Formulation Linear constraints No forbidden nodes Convexity constraints I/O serialisation based constraints I/O access per cycle based constraints Objective function Saving in cycles should be maximum

14 ISE Selection Constraints (1 of 2) Variable: For each node n i a Boolean variable x i x i is true iff node n i is in the selected ISE Constraint: No forbidden node should be in the ISE If n i is a forbidden node, then x i = 0 Variable: For each node n i two Boolean variables p i and s i p i (s i ) is true iff at least a predecessor (successor) of n i is in the selected ISE Constraint: Subgraph corresponding to the selected ISE must be convex If (p i and s i are true), then x i must be true (i.e., p i + s i – x i ≤ 1)

15 ISE Selection Constraints (2 of 2) Relationship between p i, s i and x i p i = 0 if n i has no children U (x j U p j ) where n j ’s are children of n i s i = 0 if n i has no parents U (x j U p j ) where n j ’s are parents of n i

16 I/O Serialisation Based Constraints (1 of 3) n1n1 n2n2 n3n3 n4n4 n5n5 Variable: An integer variable intDelay i Denotes the cycle in which node n i is executed, e.g., intDelay 1 = 0 intDelay 4 = 1 intDelay 5 = 2 Variable: A real variable fractionalDelay i Denotes the smallest time after intDelay i cycle when output of n i are available, e.g., fractionalDelay 3 = HW (n 3 ) fractionalDelay 4 = HW (n 3 ) + HW (n 4 ) Variable: An integer variable ρ ij Denotes the number of stages across the edges between the nodes n i and n j, e.g., ρ 13 = 1 ρ 34 = 0 ρ 25 = 2

17 I/O Serialisation Based Constraints (2 of 3) Constraint: The difference between the cycles of predecessor and successor node is the same as number of latches on the edge connecting them, e.g., intDelay 4 = intDelay 3 + ρ 34 intDelay 5 = intDelay 2 + ρ 25 Constraint: The total number of stages is the same as the last cycle in which an output node is computed, e.g., R = intDelay 5 + ρ 57 R = intDelay 2 + ρ 26 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n7 Extra latches on output edges are created in order to realize an imaginary sink node

18 I/O Serialisation Based Constraints (3 of 3) Constraint: fractionalDelay of a node depends on the fractionalDelay of its predecessor nodes, e.g., Case 1: if node is the first node in the cycle fractionalDelay 3 = HW (n 3 ) Case 2: if node is not the first node in the cycle fractionalDelay 4 = fractionalDelay 3 + HW (n 4 ) Constraint: fractionalDelay of a node should never exceed the cycle time, e.g., fractionalDelay 3 ≤ λ fractionalDelay 4 ≤ λ n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n7

19 I/O Access Per Cycle Based Constraints Variable: Boolean variables c ik IN and c ik OUT c ik IN is true, iff n i is an input of ISE and is accessed in the k th stage of execution (similarly for c ik OUT ) Constraint: In each stage no more than m inputs should be accessed, and no more than n outputs should be written back, i.e., for each k ∑ c ik IN ≤ m ∑ c ik OUT ≤ n c ik IN and c ik OUT can be computed using the intDelay, fractionalDelay of nodes and ρ values of incoming and outgoing edges of the AFU

20 Objective Function Saving in cycles should be maximized SW (S) – HW (S) should be maximum SW (S) = ∑ x i SW (n i ) HW (S) = R Any processor model where SW (S) and HW (S) can be computed using linear inequalities, can be handled using ILP

21 Experimental Setup Input dataflow graph ISE selection Atasu03 ISE selection Atasu03 ILP method I/O serialisation Pozzi05 No serialisation exp / subopt exp / opt

22 Results (1 of 3) viterbi adpcmdecoder adpcmcoder No pipelining Pozzi’s algorithm ILP method

23 Results (2 of 3) Pozzi’s algorithm takes several hours on this benchmark, and produces inferior results Benchmark: aes Biggest dataflow graph: 703 After 3 minutesAfter an hour

24 Results (3 of 3) The best AFU with 22 inputs and 22 outputs

25 Conclusions ISE SelectionI/O Serialisation Atasu03 Yu07 Chen07 Bonzini07 Pozzi05 Pothineni07 The methodology can be generalized for a large class of processor models Optimal, single run algorithm

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Similar presentations

Presentation on theme: "Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Similar presentations

Presentation on theme: "Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale."— Presentation transcript:

Similar presentations

About project

Feedback