Resource Sharing and Binding 4541.633A SoC Design Automation School of EECS Seoul National University
Data-Dominated Circuits Resource sharing in non-hierarchical CDFG Compatibility graph G+(V, E) E={(vi,vj)|t(vi)=t(vj) and ((ti+di£tj) or (tj+dj£ti)), i,j=1,...,nops} same type no concurrency transitive orientation property --> G+(V, E) is a comparability graph --> minimum clique partitioning in polynomial time NOP * - v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4 + < 3 7 6 1 2 8 Mult 5 4 11 10 9 ALU compatibility graph
Data-Dominated Circuits Conflict graph G-(V, E) complement of G+(V, E) vertex color same color --> no conflict --> can share one resource chromatic number of G-(V, E)=clique cover number of G+(V, E) NOP * - v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4 + < 3 7 6 1 2 8 5 4 11 10 9 conflict graph
Data-Dominated Circuits Conflict graph G-(V, E) as an interval graph execution interval [ti, ti + di - 1] intersection between two intervals --> edge minimum vertex coloring in polynomial time (left edge algorithm) 1 7 3 8 6 2 4 5 9 11 10 NOP * - v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4 + <
Data-Dominated Circuits NOP v0 1 7 3 8 6 2 4 5 9 11 10 v1 * * + v2 v10 C-step 1 * * v6 < v3 v11 C-step 2 - * v7 v4 * v8 C-step 3 - + v9 v5 C-step 4 NOP vn
Data-Dominated Circuits Resource sharing in hierarchical CDFG Model call we can flatten the hierarchy to compute the compatibility of operations across different levels of hierarchy + + a a * * * b * b + +
Data-Dominated Circuits single call --> interval graph can be used multiple calls --> not an interval graph when the hierarchy is to be preserved --> coloring a general graph is NP-hard + a * a 2 a 2 * 2 * 4 3 * 3 3 * 4 * 4 not chordal --> not an interval graph a * + * a
Data-Dominated Circuits Iteration unroll similar to the case of model call Branching a c NOP NOP a a c d BR c d d b b b not chordal --> not an interval graph NOP NOP same type --> compatible
General Circuits Register sharing * * * * - * - Lifetime of variable Variables alive in non-overlapping intervals or under alternative conditions are compatible Compatibility graph --> min. clique partitioning Conflict graph --> min. vertex coloring Non-hierarchical --> intervals --> left edge algorithm v1 * * v2 conflict graph (interval graph) z1 z2 z1 z2 z1 z2 * * v6 v3 z4 z3 z3 z4 - * z3 z4 v7 v4 z5 z6 z5 z6 - z5 z6 v5
+ * * * < * - * * - + Hierarchical (iteration) u 3 x u dx x u dx x General Circuits Hierarchical (iteration) u 3 x u dx x u dx x dx u y x v1 + * * v2 v10 x 3 z1 z2 a z1 z2 x * v6 < v3 * v11 z3 z4 dx z3 z4 c - v7 v8 * v4 * z7 z5 z6 y z5 z6 z7 - + v9 v5 u y u y
circular-arc conflict graph General Circuits z1 z2 u x z1 z2 1 u y 4 2 x z3 z4 y z4 z3 3 z5 z6 z7 z7 z6 z5 circular-arc conflict graph not a chordal graph --> intractable
Multi-port memory binding General Circuits Multi-port memory binding Given a scheduled graph, minimize the number of ports of the memory where xil is 1 if i-th variable is accessed at step l . Given a, the number of ports of the multi-port memory, maximize the number of variables to be stored in the memory. That is, maximize 1T b= subject to where bT = [b1, b2, ..., bnvar], bi = 1 if i-th variable is stored in the memory.
Example (assume all ports are read/write ports) General Circuits Example (assume all ports are read/write ports) time-step 1 : z3 = z1 + z2; z12 = z1 time-step 2 : z5 = z3 + z4; z7 = z3 * z6; z13 = z3 time-step 3 : z8 = z3 + z5; z9 = z1 + z7; z11 = z10 / z5 time-step 4 : z14 = z11 Ù z8; z15 = z12 Ú z9 time-step 5 : z1 = z14; z2 = z15 maximize subject to b1 + b2 + b3 + b12 £ a b3 + b4 + b5 + b6 + b7 + b13 £ a b1 + b3 + b5 + b7 + b8 + b9 + b10 + b11 £ a b8 + b9 + b11 + b12 + b14 + b15 £ a b1 + b2 + b14 + b15 £ a a=1 --> b2=b4=b8=1 --> only z2, z4, and z8 can be stored a=2 --> z2, z4, z5, z10, z12, z14 can be stored
* * * * - * - Bus sharing and binding General Circuits Bus sharing and binding Analogous to multi-port memory binding problem minimize the number of buses maximize the number of data transfers Example1 number of write buses = aw number of read buses = ar w1 + w2 £ aw r1 + r2 £ ar w3 + w4 £ aw r3 + r4 £ ar w5 + w6 £ aw r5 + r6 £ ar Example 2 number of read/write buses=a w1 + w2 £ a r1 + r2 + w3 + w4 £ a r3 + r4 + w5 + w6 £ a r5 + r6 £ a v1 * * v2 z1 z2 * * v6 v3 z4 z3 - * v7 v4 z5 z6 - v5
Multiplexers Unconstrained minimum-area binding Example -> -> General Circuits Multiplexers Unconstrained minimum-area binding Example n add operations a adders > 0 then area increases as a increases < 0 then area decreases as a increases may omit 2: 1. mux area accounts for two muxes 2. consider operand sharing --> approximated average ->
Weighted compatibility graph General Circuits Weighted compatibility graph Spread the mux cost over the operations share --> overhead (mux+wiring) --> assign weights to the graph --> the problem becomes weighted clique partitioning problem --> how to weight and how to solve?
Example each vertex has the triple dedicated: General Circuits Example each vertex has the triple dedicated: v1, v2, v3 share a resource: 1 3 2 4 4 1 2 3
chaining is considered General Circuits Performance-constrained Add performance constraint and minimize area area = cT a + mux_area(B) + wire_area(B) where cT a = [area1, area2, ... areanres] [a1 a2 ... anres]T di: propagation delay of functional resource B: binding f: cycle time mux_delay(B), wire_delay(B), mux_area(B), wire_area(B): non-linear functions of B Performance-directed binding Minimize path delay More functional resource less mux's --> less mux delay more area --> more wire delay path _ delay = å d + mux _ delay ( B ) + wire _ delay ( B ) < f " path i i Î path chaining is considered
Module Selection Problem Same operation with different resource types Ripple-carry adder, carry look-ahead adder --> different area, propagation delay Serial, parallel --> different area, cycle time, execution delay in cycles Example: 32bit x 32bit multiplier fully serial multiplier: (area, delay in cycles) = (1, 1024) serial-parallel multiplier: (area, delay in cycles) = (32, 32) fully parallel multiplier: (area, delay in cycles) = (1024, 1) Module selection and scheduling Module selection --> execution delay --> scheduling Module selection and binding Same module must be selected for operations sharing a resource
Module Selection Problem Minimize latency using fastest resource types then replace with slower and smaller resource types for non-critical operations Example mult (area, delay) = (5, 1), (2, 2) ALU (1, 1) latency=4 v1, v2, v3 : two fast mult v8, v6 or v7 : non-critical --> small mult --> use just two fast mult (area 10) sharing is impossible NOP v0 v1 * * v2 + v10 C-step 1 * * v6 < v3 v11 C-step 2 - v8 v4 * v7 * C-step 3 - + v9 v5 C-step 4 NOP vn
Module Selection Problem latency = 5 v1, v2, v3, v7 : one fast mult v6, v8 : one small mult area = 7 NOP v0 * v1 * v6 C-step 1 * v2 + v10 C-step 2 * v8 < C-step 3 v3 * v11 - v4 * v7 C-step 4 - + v9 v5 C-step 5 NOP vn
Module Selection Problem Module selection and resource sharing Example adder vs. ALU dedicated resource: area = 3 areaadd+ areaALU {v1, v2, v3}, {v4}: area = areaadd+ areaALU + 2 areaDmux {v2, v3}, {v1, v4}: v1 < + v4 v3 + + v2 v1 < + v4 v3 + + v2
Resource Sharing and Binding for Pipelined Circuits Operations with start time l + pd0 conflict with each other for p Î Z example d0 = 2 3 1 8 v1 * + * * v2 v6 v10 stage 1 C-step 1 7 6 2 * * < compatibility graph v3 * v7 v8 v11 C-step 2 9 - v4 + C-step 1 stage 2 4 10 v9 - v5 C-step 2 5 11 v1 + * * v2 * - v6 v10 v4 + v9 C-step 1 * * v8 < - + v3 * v7 v11 v5 C-step 2
Resource Sharing and Binding for Pipelined Circuits Pipelining with branching K. Hwang, A. Casavant, M. Dragomirecky, and M. d'Abreu, "Constrained conditional resource sharing in pipeline synthesis," Proc. ICCAD, Nov. 1988. Alternative path operations may not be compatible Twisted pair: only one pair can share a resource if (cond ==1) { d = a + b; y = c * d; } else { e = a * b; y = c + e; a b c a b c + * d e * + y y true block false block
Resource Sharing and Binding for Pipelined Circuits + a b c + d * * reg reg reg reg condi 0 1 MUX 0 1 MUX e y + true block * y reg reg false block + a b c * condi-1 0 1 MUX a b c e reg y + + d * y false block y true block