Architecture Synthesis

Architecture Synthesis
A SoC Design Automation School of EECS Seoul National University

Architecture Synthesis
Introduction Introduction Architecture Synthesis Behavioral description exploration of design space RTL description (structural view of datapath + logic-level specification of control unit) Datapath interconnection of resources Resource functional resource (ALU, adder, multiplier, ...) memory resource (register, RAM, ROM, ...) interface resource (bus, steering logic, I/O pad, ...)

(library + module generator)
Introduction behavioral model (CDFG) Architecture synthesis RTL description constraints (timing, area, performance, resource binding) resource (library + module generator) primitives area, delay given area, delay estimated

Architecture Synthesis Problem
Place operations in TIME and SPACE scheduling resource binding Given a CDFG G(V, E), V = {vi | i= 0, 1, ..., n}, E = {(vi, vj) | i, j = 0, 1, ..., n} Temporal domain: scheduling Task of determining start times of operations subject to precedence constraints Latency = l = tn - t0 start time of sink start time of source

Definition: Scheduling is a mapping j: V --> Z+ where j(vj) = tj such that tj ³ ti + di, " i, j | (vi, vj) Î E v0 NOP v1 * * v2 * v6 * + C-step 1 v8 v10 C-step 2 v7 + < v3 * * v11 v9 C-step 3 - v4 C-step 4 - v5 NOP vn l = = 4

NOP v0 * v1 + C-step 1 v10 * v2 < C-step 2 v11 C-step 3 * v3 * v6 - C-step 4 v4 * v7 C-step 5 - C-step 6 * v5 v8 C-step 7 + v9 NOP vn l = = 7

Chaining, pipelining, multi-cycle operation multiplier: 35 ns others: 25 ns cycle time: 50 ns v0 NOP v1 * * v2 * v6 * + C-step 1 v8 v10 C-step 2 v7 + < v3 * * v11 v9 C-step 3 - v4 C-step 3 - v5 NOP vn l = = 3 chaining

Spatial domain: binding R = {1, 2, ..., nres}: set of resource types t: V --> R one-to-many: resource (module) selection can be applied many-to-one: resource sharing can be applied Definition: Resource binding is a mapping b: V --> R  Z+ where b(vi) = (t, r) denotes that vi is implemented by r-th instance of resource type t(vi) = t, t Î R Dedicated resource binding --> b is one-to-one Shared resource --> b is many-to-one --> the corresponding operations cannot execute concurrently

resource binding can be represented by a hypergraph v0 NOP v1 * * v2 * v6 * + C-step 1 v8 v10 C-step 2 v7 + < v3 * * v11 v9 C-step 3 - v4 C-step 4 - v5 NOP vn

binding constraint partial binding resource binding must be compatible with the partial binding upper bound on resource usage of each type resource allocation: {ak | k=1, 2, ..., nres} resource binding: b(vi) = (t, r), r £ at for each operation vi

Area and Performance Estimation
Data-dominated circuits NOP * + - < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn mult: 5 units 35 ns ALU: 1 unit ns cycle time = 40 ns overhead = 1 unit dedicated resource: area = 6 * * = 36 units latency = 4 cycles one instance for each type: area = = 7 units latency = 7 cycles

General circuits Register used at cycle boundary add to area and time (set-up time + propagation delay) Steering logic MUX : area and time can be estimated easily bus : drivers must be considered Wiring fast floor-planner can be used average interconnect length = (#blocks)a, 0 £ a £ 1 Control unit latency = #control steps --> #states --> area (address space in ROM-based control units) optimization using state encoding, state minimization, > hard to estimate area

Example registers 7 intermediate variables 3 loop variables (x, y, u) 3 loop invariants (a, 3, dx) total 13 variables compute #registers considering variables’ lifetimes NOP * + - < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn 3 x u dx 3 y u dx x dx x=x+dx v1 * * v2 * v6 * + v8 v10 y dx a xl v7 + < v3 * * v11 u v9 - v4 yl c xl = x +dx; ul = u - (3*x*u*dx) - (3*y*dx); yl = y + (u*dx); c = xl < a; v5 - ul

steering logic dedicated: no MUX shared resources: 5 operand pairs for mult 5 operand pairs for ALU shared registers also need MUXs 1 mult, 1 ALU, 2 registers: four 5-way MUXs + two 2-way MUXs NOP * + - < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn mult ALU a b c a a c b mult mult

control unit 4 states vs. 7 states * * ALU ALU NOP v0 v1 * + v10 exists does not exist * v2 < v11 reg1 reg2 * ALU out v3 mult out * v6 - v4 mult ALU mult out ALU out * v7 - * v8 v5 + v9 NOP vn

Strategies for Architecture Optimization
Area/latency optimization Given cycle-time resource-constrained minimum-latency scheduling latency-constrained minimum-resource scheduling Scheduling before/after binding circular dependency solve jointly or iteratively data-dominated: scheduling before binding (DSP) control-dominated: binding before scheduling scheduling binding concurrency delay

Example NOP * + - < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn area £ 20 units latency £ 8 mult: 35 ns, 5 units ALU: 25 ns, 1 unit area overhead: 1 unit cycle-time = 40 ns (#mult, #ALU) = (1, 1) --> latency = 7, area = 7 (#mult, #ALU) = (1, 2) --> latency = 7, area = 8 (#mult, #ALU) = (2, 1) --> latency = 5, area = 12 (#mult, #ALU) = (2, 2) --> latency = 4, area = 13 cycle-time = 30 ns --> mult in multi-cycle operation (#mult, #ALU) = (2, 1) --> latency = 8, area = 12 (#mult, #ALU) = (3, 1) --> latency = 7, area = 17 (#mult, #ALU) = (3, 2) --> latency = 6, area = 18

(2, 2) area (2, 1) 20 not a Pareto point (3, 2) (1, 2) 18 17 15 13 12 10 8 7 5 (3, 1) a Pareto point (1, 1) (2, 1) cycle-time 40 ns 30 ns latency

Cycle-time/latency optimization Fixed area: after binding Cycle-time/latency trade-off by multi-cycle operation or chaining Example constraints: 20 ns £ cycle-time £ 50 ns latency £ 8 dedicated resource: 1 mult + 1 ALU: t > 35 --> latency = 7 cycle-time (ns) t > 35 25 < t < 35 20 < t < 25 t = 50 latency 4 6 8 3 mult: multi-cycle operation mult, ALU: multi-cycle operation ALU: chaining

36 30 20 10 7 latency 30 ns area 50 45 40 35 25 cycle-time

Cycle-time/area optimization Fixed latency (after scheduling) Resource sharing needs multiplexers Example dedicated (no MUX) latency = 4 area = 36, cycle-time = mult delay = 35 ns less area delay --> cycle-time increase

4 mult + 2 ALU latency = 4 2 * 3-way MUX + 6 * 2-way MUX total area = 4 * * * * 0.2 = 24.8 cycle-time = max (25 + 3, ) = 37 ns area = 0.3 unit delay = 3 ns area = 0.2 unit delay = 2 ns NOP v0 v1 * * v2 * v6 * + C-step 1 v8 v10 C-step 2 v7 + < v3 * * v11 v9 C-step 3 - v4 C-step 4 - v5 NOP vn

1 mult + 1 ALU latency = 7 4 * 5-way MUX total area = 1 * * * 0.5 = 9 cycle-time = = 40 ns cycle-time cycle-time 45 45 40 40 35 35 30 30 area 25 (6, 5) 25 20 20 36 30 20 10 (4, 2) (1, 1) 30 ns latency

Data-path synthesis at the architecture level
Connectivity synthesis Define interconnection among functional resource steering logic memory resource I/O control unit v0 NOP v1 * * + v2 v10 C-step 1 < C-step 2 * v6 v3 * v11 - v7 v8 C-step 3 v4 * * - + v9 C-step 4 v5 NOP vn

a 3 dx x y enables u c r1 r2 r3 mux controls ALU control +, -, < +
Data-Path Synthesis a 3 dx x y enables u c r1 r2 r3 mux controls ALU control +, -, < + * * condition Controller

Control-Unit Synthesis
Micro-coded control and hardwired control Micro-coded control synthesis for non-hierarchical, data-independent delay Horizontal micro-code and vertical micro-code Horizontal micro-code example dedicated resources 11 operations --> 11 register enables --> 11 bit/word 4 steps --> 4 words --> 2 bit counter NOP * - v0 v1 v2 v6 v3 v4 v7 v8 v10 < v11 + v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4

address micro-words 0 0 0 1 1 0 1 1 reset activation signals counter NOP * - v0 v1 v2 v6 v3 v4 v7 v8 v10 < v11 + v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4

Vertical micro-code fully vertical --> é log2 Nact ù one activation per micro-code lengthening schedule or multiple micro-code read per schedule step micro-words 0001 0010 1010 0011 0110 1011 0100 0111 1000 0101 1001 micro-words micro-ROM é log2 Nact ù activation signals decoder nano-ROM Nact activation signals

Micro-coded control optimization micro-code compaction group of operations {Oi, Oj, Ok} field 00 01 10 11 no-op Oi Oj Ok no pair is concurrent conflict graph: vertex <--> operation edge <--> concurrency (conflict) --> min. vertex coloring --> same color in same group --> #colors = # fields min. #fields does not imply min. #bits complement --> compatibility graph --> weighted clique partitioning problem weight = #bits --> minimize total weight 9 bits 4 op's 3 bits 4 op's 3 bits 4 op's 3 bits 8-bits 3 op's 2 bits 3 op's 2 bits 3 op's 2 bits 3 op's 2 bits

example group operations : {v1, v3, v4}, {v2}, {v6, v7, v5}, {v8, v9}, {v10, v11} --> 5 fields each field has no-op micro-words field A B C D E operation v1 v3 v4 v2 v6 v7 v5 v8 v9 v10 v11 code 01 10 11 1 {v1, v2, v10} {v3, v6, v11} {v4, v7, v8} {v5, v9} D1 D2 D3 D4 activation signals

Hard-wired control synthesis Moore type FSM synthesis is straightforward NOP * + - < v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn reset reset reset’ S1 S2 1,2,6,8,10 3,7,9,11 reset reset’ reset’ S3 5 S4 4

Structural pipelining Functional pipelining
Pipelined Circuits Pipelined Circuits Structural pipelining Pipelined resource Functional pipelining Pipelined CDFG Data introduction interval d0 (in # cycles) v1 * * v2 * v6 * + C-step 1 v8 v10 C-step 2 v7 + < v3 * * v11 v9 C-step 1 d0 = 4 --> 2 - v4 C-step 2 - v5

* * * + C-step 1 C-step 2 * < * * C-step 1 - + C-step 2 -
Pipelined Circuits Given d0 and cycle-time find optimal area/latency trade-off points example cycle-time = 40 ns, area bound = 20 d0 = 1: area = 36 d0 = 2: 3 mult, 3 ALU --> area = 19 v1 * * v2 * v6 + C-step 1 v10 S1 1, 2, 6, 10, 4, 9 C-step 2 v7 * < v3 * * v8 v11 C-step 1 d0 = 2 - + v4 v9 3,7,8,11,5 S2 C-step 2 - v5

Pipelined Circuits (2, 2) area (2, 1) 20 (3, 3) (1, 2) 18 17 15 13 12 10 8 7 5 (1, 1) (cycle-time, d0) (40, 4) (40, 2) latency

Architecture Synthesis

Similar presentations

Presentation on theme: "Architecture Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Architecture Synthesis

Similar presentations

Presentation on theme: "Architecture Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback