Resource Sharing and Binding

Slides:



Advertisements
Similar presentations
HIGH LEVEL SYNTHESIS WITH AREA CONSTRAINTS FOR FPGA DESIGNES: AN EVOLUTIONARY APPROACH Tesi di Laurea di: Christian Pilato Matr.n Relatore: Prof.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
ECE 667 Synthesis and Verification of Digital Circuits
Example of Scheduling and Allocation based on Jaap Hofstede IIR Filter.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
COE 561 Digital System Design & Synthesis Scheduling Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.
ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
ICS 252 Introduction to Computer Design Fall 2006 Eli Bozorgzadeh Computer Science Department-UCI.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
ECE Synthesis & Verification - Lecture 4 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Allocation:
ECE Synthesis & Verification - LP Scheduling 1 ECE 667 ECE 667 Synthesis and Verification of Digital Circuits Scheduling Algorithms Analytical approach.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.
High-Level Synthesis-II Virendra Singh Indian Institute of Science Bangalore IEP on Digital System IIT Kanpur.
L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
Graph Coloring. Vertex Coloring problem in VLSI routing channels Standard cells Share a track Minimize channel width- assign horizontal Metal wires to.
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Design of Digital Circuits Lecture 14: Microprogramming
Scheduling Determines the precise start time of each task.
Architecture and Synthesis for Multi-Cycle Communication
DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik
CSE241A VLSI Digital Circuits Winter 2003 Recitation 2
Register Transfer Specification And Design
ECE 565 High-Level Synthesis—An Introduction
UNIVERSITY OF MASSACHUSETTS Dept
Chap 7. Register Transfers and Datapaths
CS137: Electronic Design Automation
High-Level Synthesis: Creating Custom Circuits from High-Level Code
High-Level Synthesis: Creating Custom Circuits from High-Level Code
High-Level Synthesis Creating Custom Circuits from High-Level Code
CSL718 : VLIW - Software Driven ILP
Digital System Design Review.
High-Level Synthesis: Creating Custom Circuits from High-Level Code
High-Level Synthesis: Creating Custom Circuits from High-Level Code
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Unsigned Multiplication
ICS 353: Design and Analysis of Algorithms
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Reconfigurable Computing
Serial versus Pipelined Execution
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
ESE535: Electronic Design Automation
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Richard Anderson Lecture 28 Coping with NP-Completeness
ARM implementation the design is divided into a data path section that is described in register transfer level (RTL) notation control section that is viewed.
A SoC Design Automation Seoul National University
Architectural-Level Synthesis
Architecture Synthesis
Scheduling Algorithms
ICS 252 Introduction to Computer Design
Sungho Kang Yonsei University
Integrated Systems Centre © Giovanni De Micheli – All rights reserved
ICS 252 Introduction to Computer Design
Graphs and Vertex Coloring
ICS 252 Introduction to Computer Design
Richard Anderson Lecture 27 Survey of NP Complete Problems
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Instruction Scheduling Hal Perkins Autumn 2011
Reconfigurable Computing (EN2911X, Fall07)
CS137: Electronic Design Automation
Presentation transcript:

Resource Sharing and Binding 4541.633A SoC Design Automation School of EECS Seoul National University

Data-Dominated Circuits Resource sharing in non-hierarchical CDFG Compatibility graph G+(V, E) E={(vi,vj)|t(vi)=t(vj) and ((ti+di£tj) or (tj+dj£ti)), i,j=1,...,nops} same type no concurrency transitive orientation property --> G+(V, E) is a comparability graph --> minimum clique partitioning in polynomial time NOP * - v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4 + < 3 7 6 1 2 8 Mult 5 4 11 10 9 ALU compatibility graph

Data-Dominated Circuits Conflict graph G-(V, E) complement of G+(V, E) vertex color same color --> no conflict --> can share one resource chromatic number of G-(V, E)=clique cover number of G+(V, E) NOP * - v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4 + < 3 7 6 1 2 8 5 4 11 10 9 conflict graph

Data-Dominated Circuits Conflict graph G-(V, E) as an interval graph execution interval [ti, ti + di - 1] intersection between two intervals --> edge minimum vertex coloring in polynomial time (left edge algorithm) 1 7 3 8 6 2 4 5 9 11 10 NOP * - v0 v1 v2 v6 v3 v4 v7 v8 v10 v11 v9 v5 vn C-step 1 C-step 2 C-step 3 C-step 4 + <

Data-Dominated Circuits NOP v0 1 7 3 8 6 2 4 5 9 11 10 v1 * * + v2 v10 C-step 1 * * v6 < v3 v11 C-step 2 - * v7 v4 * v8 C-step 3 - + v9 v5 C-step 4 NOP vn

Data-Dominated Circuits Resource sharing in hierarchical CDFG Model call we can flatten the hierarchy to compute the compatibility of operations across different levels of hierarchy + + a a * * * b * b + +

Data-Dominated Circuits single call --> interval graph can be used multiple calls --> not an interval graph when the hierarchy is to be preserved --> coloring a general graph is NP-hard + a * a 2 a 2 * 2 * 4 3 * 3 3 * 4 * 4 not chordal --> not an interval graph a * + * a

Data-Dominated Circuits Iteration unroll similar to the case of model call Branching a c NOP NOP a a c d BR c d d b b b not chordal --> not an interval graph NOP NOP same type --> compatible

General Circuits Register sharing * * * * - * - Lifetime of variable Variables alive in non-overlapping intervals or under alternative conditions are compatible Compatibility graph --> min. clique partitioning Conflict graph --> min. vertex coloring Non-hierarchical --> intervals --> left edge algorithm v1 * * v2 conflict graph (interval graph) z1 z2 z1 z2 z1 z2 * * v6 v3 z4 z3 z3 z4 - * z3 z4 v7 v4 z5 z6 z5 z6 - z5 z6 v5

+ * * * < * - * * - + Hierarchical (iteration) u 3 x u dx x u dx x General Circuits Hierarchical (iteration) u 3 x u dx x u dx x dx u y x v1 + * * v2 v10 x 3 z1 z2 a z1 z2 x * v6 < v3 * v11 z3 z4 dx z3 z4 c - v7 v8 * v4 * z7 z5 z6 y z5 z6 z7 - + v9 v5 u y u y

circular-arc conflict graph General Circuits z1 z2 u x z1 z2 1 u y 4 2 x z3 z4 y z4 z3 3 z5 z6 z7 z7 z6 z5 circular-arc conflict graph not a chordal graph --> intractable

Multi-port memory binding General Circuits Multi-port memory binding Given a scheduled graph, minimize the number of ports of the memory where xil is 1 if i-th variable is accessed at step l . Given a, the number of ports of the multi-port memory, maximize the number of variables to be stored in the memory. That is, maximize 1T b= subject to where bT = [b1, b2, ..., bnvar], bi = 1 if i-th variable is stored in the memory.

Example (assume all ports are read/write ports) General Circuits Example (assume all ports are read/write ports) time-step 1 : z3 = z1 + z2; z12 = z1 time-step 2 : z5 = z3 + z4; z7 = z3 * z6; z13 = z3 time-step 3 : z8 = z3 + z5; z9 = z1 + z7; z11 = z10 / z5 time-step 4 : z14 = z11 Ù z8; z15 = z12 Ú z9 time-step 5 : z1 = z14; z2 = z15 maximize subject to b1 + b2 + b3 + b12 £ a b3 + b4 + b5 + b6 + b7 + b13 £ a b1 + b3 + b5 + b7 + b8 + b9 + b10 + b11 £ a b8 + b9 + b11 + b12 + b14 + b15 £ a b1 + b2 + b14 + b15 £ a a=1 --> b2=b4=b8=1 --> only z2, z4, and z8 can be stored a=2 --> z2, z4, z5, z10, z12, z14 can be stored

* * * * - * - Bus sharing and binding General Circuits Bus sharing and binding Analogous to multi-port memory binding problem minimize the number of buses maximize the number of data transfers Example1 number of write buses = aw number of read buses = ar w1 + w2 £ aw r1 + r2 £ ar w3 + w4 £ aw r3 + r4 £ ar w5 + w6 £ aw r5 + r6 £ ar Example 2 number of read/write buses=a w1 + w2 £ a r1 + r2 + w3 + w4 £ a r3 + r4 + w5 + w6 £ a r5 + r6 £ a v1 * * v2 z1 z2 * * v6 v3 z4 z3 - * v7 v4 z5 z6 - v5

Multiplexers Unconstrained minimum-area binding Example -> -> General Circuits Multiplexers Unconstrained minimum-area binding Example n add operations a adders > 0 then area increases as a increases < 0 then area decreases as a increases may omit 2: 1. mux area accounts for two muxes 2. consider operand sharing --> approximated average ->

Weighted compatibility graph General Circuits Weighted compatibility graph Spread the mux cost over the operations share --> overhead (mux+wiring) --> assign weights to the graph --> the problem becomes weighted clique partitioning problem --> how to weight and how to solve?

Example each vertex has the triple dedicated: General Circuits Example each vertex has the triple dedicated: v1, v2, v3 share a resource: 1 3 2 4 4 1 2 3

chaining is considered General Circuits Performance-constrained Add performance constraint and minimize area area = cT a + mux_area(B) + wire_area(B) where cT a = [area1, area2, ... areanres] [a1 a2 ... anres]T di: propagation delay of functional resource B: binding f: cycle time mux_delay(B), wire_delay(B), mux_area(B), wire_area(B): non-linear functions of B Performance-directed binding Minimize path delay More functional resource less mux's --> less mux delay more area --> more wire delay path _ delay = å d + mux _ delay ( B ) + wire _ delay ( B ) < f " path i i Î path chaining is considered

Module Selection Problem Same operation with different resource types Ripple-carry adder, carry look-ahead adder --> different area, propagation delay Serial, parallel --> different area, cycle time, execution delay in cycles Example: 32bit x 32bit multiplier fully serial multiplier: (area, delay in cycles) = (1, 1024) serial-parallel multiplier: (area, delay in cycles) = (32, 32) fully parallel multiplier: (area, delay in cycles) = (1024, 1) Module selection and scheduling Module selection --> execution delay --> scheduling Module selection and binding Same module must be selected for operations sharing a resource

Module Selection Problem Minimize latency using fastest resource types then replace with slower and smaller resource types for non-critical operations Example mult (area, delay) = (5, 1), (2, 2) ALU (1, 1) latency=4 v1, v2, v3 : two fast mult v8, v6 or v7 : non-critical --> small mult --> use just two fast mult (area 10) sharing is impossible NOP v0 v1 * * v2 + v10 C-step 1 * * v6 < v3 v11 C-step 2 - v8 v4 * v7 * C-step 3 - + v9 v5 C-step 4 NOP vn

Module Selection Problem latency = 5 v1, v2, v3, v7 : one fast mult v6, v8 : one small mult area = 7 NOP v0 * v1 * v6 C-step 1 * v2 + v10 C-step 2 * v8 < C-step 3 v3 * v11 - v4 * v7 C-step 4 - + v9 v5 C-step 5 NOP vn

Module Selection Problem Module selection and resource sharing Example adder vs. ALU dedicated resource: area = 3 areaadd+ areaALU {v1, v2, v3}, {v4}: area = areaadd+ areaALU + 2 areaDmux {v2, v3}, {v1, v4}: v1 < + v4 v3 + + v2 v1 < + v4 v3 + + v2

Resource Sharing and Binding for Pipelined Circuits Operations with start time l + pd0 conflict with each other for p Î Z example d0 = 2 3 1 8 v1 * + * * v2 v6 v10 stage 1 C-step 1 7 6 2 * * < compatibility graph v3 * v7 v8 v11 C-step 2 9 - v4 + C-step 1 stage 2 4 10 v9 - v5 C-step 2 5 11 v1 + * * v2 * - v6 v10 v4 + v9 C-step 1 * * v8 < - + v3 * v7 v11 v5 C-step 2

Resource Sharing and Binding for Pipelined Circuits Pipelining with branching K. Hwang, A. Casavant, M. Dragomirecky, and M. d'Abreu, "Constrained conditional resource sharing in pipeline synthesis," Proc. ICCAD, Nov. 1988. Alternative path operations may not be compatible Twisted pair: only one pair can share a resource if (cond ==1) { d = a + b; y = c * d; } else { e = a * b; y = c + e; a b c a b c + * d e * + y y true block false block

Resource Sharing and Binding for Pipelined Circuits + a b c + d * * reg reg reg reg condi 0 1 MUX 0 1 MUX e y + true block * y reg reg false block + a b c * condi-1 0 1 MUX a b c e reg y + + d * y false block y true block