High-Level Synthesis: Creating Custom Circuits from High-Level Code

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Program Representations. Representing programs Goals.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
ICS 252 Introduction to Computer Design
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.
Precision Going back to constant prop, in what cases would we lose precision?
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
High-Level Synthesis: Creating Custom Circuits from High-Level Code Greg Stitt ECE Department University of Florida.
ENGG3190 Logic Synthesis High Level Synthesis Winter 2014 S. Areibi School of Engineering University of Guelph.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
Exploiting Parallelism
High-Level Synthesis: Creating Custom Circuits from High-Level Code Greg Stitt ECE Department University of Florida.
System-on-Chip Design Analysis of Control Data Flow
Buffering Techniques Greg Stitt ECE Department University of Florida.
High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida.
Code Optimization Overview and Examples
Compiler Design (40-414) Main Text Book:
Introduction to Compiler Construction
Introduction to Parsing (adapted from CS 164 at Berkeley)
Register Transfer Specification And Design
ECE 565 High-Level Synthesis—An Introduction
Chap 7. Register Transfers and Datapaths
Morgan Kaufmann Publishers
High-Level Synthesis: Creating Custom Circuits from High-Level Code
ESE532: System-on-a-Chip Architecture
课程名 编译原理 Compiling Techniques
Optimization Code Optimization ©SoftMoore Consulting.
High-Level Synthesis: Creating Custom Circuits from High-Level Code
High-Level Synthesis Creating Custom Circuits from High-Level Code
CS/COE0447 Computer Organization & Assembly Language
CSCI1600: Embedded and Real Time Software
High-Level Synthesis: Creating Custom Circuits from High-Level Code
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Manual Example How to manually convert high-level code into circuit
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Optimizing Transformations Hal Perkins Autumn 2011
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
ECE-C662 Introduction to Behavioral Synthesis Knapp Text Ch
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
High-Level Synthesis: Creating Custom Circuits from High-Level Code
Optimizing Transformations Hal Perkins Winter 2008
Code Optimization Overview and Examples Control Flow Graph
A SoC Design Automation Seoul National University
Guest Lecturer TA: Shreyas Chand
Architectural-Level Synthesis
Static Code Scheduling
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Chapter 10: Compilers and Language Translation
Lecture 4: Instruction Set Design/Pipelining
CSCI1600: Embedded and Real Time Software
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

High-Level Synthesis: Creating Custom Circuits from High-Level Code Greg Stitt ECE Department University of Florida

Existing FPGA Tool Flow Register-transfer (RT) synthesis Specify RT structure (muxes, registers, etc) + Allows precise specification - But, time consuming, difficult, error prone HDL RT Synthesis Netlist Technology Mapping Physical Design Placement Bitfile Routing Processor FPGA

Future FPGA Tool Flow? High-level Synthesis RT Synthesis C/C++, Java, etc. High-level Synthesis HDL RT Synthesis Netlist Technology Mapping Physical Design Placement Bitfile Routing Processor FPGA

High-level Synthesis Wouldn’t it be nice to write high-level code? Ratio of C to VHDL developers (10000:1 ?) + Easier to specify + Separates function from architecture + More portable - Hardware potentially slower Similar to assembly code era Programmers could always beat compiler But, no longer the case Hopefully, high-level synthesis will catch up to manual effort

High-level Synthesis More challenging than compilation Compilation maps behavior into assembly instructions Architecture is known to compiler High-level synthesis creates a custom architecture to execute behavior Huge hardware exploration space Best solution may include microprocessors Should handle any high-level code Not all code appropriate for hardware

High-level Synthesis First, consider how to manually convert high-level code into circuit Steps 1) Build FSM for controller 2) Build datapath based on FSM acc = 0; for (i=0; i < 128; i++) acc += a[i];

Manual Example Build a FSM (controller) Decompose code into states acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc=0, i = 0 if (i < 128) load a[i] Done acc += a[i] i++

Manual Example Build a datapath Allocate resources for each state + + acc=0, i = 0 if (i < 128) load a[i] Done i acc a[i] addr 1 128 1 acc += a[i] + + < + i++ acc = 0; for (i=0; i < 128; i++) acc += a[i];

Manual Example Build a datapath Determine register inputs + + < + In from memory acc=0, i = 0 &a if (i < 128) 2x1 2x1 2x1 load a[i] Done i acc a[i] addr 1 128 1 acc += a[i] + < + + i++ acc = 0; for (i=0; i < 128; i++) acc += a[i];

Manual Example Build a datapath Add outputs + + + < In from memory acc=0, i = 0 &a if (i < 128) 2x1 2x1 2x1 load a[i] Done i acc a[i] addr 1 128 1 acc += a[i] + + < + i++ acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc Memory address

Manual Example Build a datapath Add control signals + + + < In from memory acc=0, i = 0 &a if (i < 128) 2x1 2x1 2x1 load a[i] Done i acc a[i] addr 1 128 1 acc += a[i] + + < + i++ acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc Memory address

Manual Example Combine controller+datapath Controller + + + < In from memory Controller &a 2x1 2x1 2x1 i acc a[i] addr 1 128 1 + + < + acc = 0; for (i=0; i < 128; i++) acc += a[i]; Done Memory Read acc Memory address

Manual Example Alternatives Use one adder (plus muxes) < + In from memory &a 2x1 2x1 2x1 i acc a[i] addr 1 128 < MUX MUX + acc Memory address

Manual Example Comparison with high-level synthesis Determining when to perform each operation => Scheduling Allocating resource for each operation => Resource allocation Mapping operations onto resources => Binding

Another Example Your turn Steps for (i=0; i < 100; i++) { if (a[i] > 0) x ++; else x --; a[i] = x; } //output x Steps 1) Build FSM (do not perform if conversion) 2) Build datapath based on FSM

High-Level Synthesis Could be C, C++, Java, Perl, Python, SystemC, ImpulseC, etc. High-level Code High-Level Synthesis Custom Circuit Usually a RT VHDL description, but could as low level as a bit file

High-Level Synthesis Controller < + High-Level Synthesis acc i addr for (i=0; i < 128; i++) acc += a[i]; High-Level Synthesis acc i < addr a[i] + 1 128 2x1 &a In from memory Memory address Done Memory Read Controller

Main Steps Controller + Datapath High-level Code Converts code to intermediate representation - allows all following steps to use common format. Front-end Syntactic Analysis Intermediate Representation Optimization Determines when each operation will execute, and resources used Scheduling/Resource Allocation Back-end Binding/Resource Sharing Maps operations onto physical resources Controller + Datapath

Intermediate Representation Syntactic Analysis Definition: Analysis of code to verify syntactic correctness Converts code into intermediate representation 2 steps 1) Lexical analysis (Lexing) 2) Parsing High-level Code Lexical Analysis Syntactic Analysis Parsing Intermediate Representation

Lexical Analysis Lexical analysis (lexing) breaks code into a series of defined tokens Token: defined language constructs x = 0; if (y < z) x = 1; Lexical Analysis ID(x), ASSIGN, INT(0), SEMICOLON, IF, LPAREN, ID(y), LT, ID(z), RPAREN, ID(x), ASSIGN, INT(1), SEMICOLON

Lexing Tools Define tokens using regular expressions - outputs C code that lexes input Common tool is “lex” /* braces and parentheses */ "[" { YYPRINT; return LBRACE; } "]" { YYPRINT; return RBRACE; } "," { YYPRINT; return COMMA; } ";" { YYPRINT; return SEMICOLON; } "!" { YYPRINT; return EXCLAMATION; } "{" { YYPRINT; return LBRACKET; } "}" { YYPRINT; return RBRACKET; } "-" { YYPRINT; return MINUS; } /* integers [0-9]+ { yylval.intVal = atoi( yytext ); return INT; }

Parsing Analysis of token sequence to determine correct grammatical structure Languages defined by context-free grammar Correct Programs Grammar x = 0; x = 0; y = 1; Program = Exp Exp = Stmt SEMICOLON | IF LPAREN Cond RPAREN Exp | Exp Exp if (a < b) x = 10; if (var1 != var2) x = 10; Cond = ID Comp ID x = 0; if (y < z) x = 1; x = 0; if (y < z) x = 1; y = 5; t = 1; Stmt = ID ASSIGN INT Comp = LT | NE

Parsing Incorrect Programs Grammar Program = Exp Exp = S SEMICOLON | IF LPAREN Cond RPAREN Exp | Exp Exp x = 5;; x = 5 if (x+5 > y) x = 2; Cond = ID Comp ID x = y; S = ID ASSIGN INT Comp = LT | NE

Parsing Tools Define grammar in special language Automatically creates parser based on grammar Popular tool is “yacc” - yet-another-compiler-compiler program: functions { $$ = $1; } ; functions: function { $$ = $1; } | functions function { $$ = $1; } function: HEXNUMBER LABEL COLON code { $$ = $2; }

Intermediate Representation Parser converts tokens to intermediate representation Usually, an abstract syntax tree Assign x = 0; if (y < z) x = 1; d = 6; x if assign cond assign y < z x 1 d 6

Intermediate Representation Why use intermediate representation? Easier to analyze/optimize than source code Theoretically can be used for all languages Makes synthesis back end language independent C Code Java Perl Syntactic Analysis Syntactic Analysis Syntactic Analysis Intermediate Representation Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too Back End

Intermediate Representation Different Types Abstract Syntax Tree Control/Data Flow Graph (CDFG) Sequencing Graph Etc. We will focus on CDFG Combines control flow graph (CFG) and data flow graph (DFG)

Control flow graphs CFG Represents control flow dependencies of basic blocks Basic block is section of code that always executes from beginning to end I.e. no jumps into or out of block acc=0, i = 0 acc = 0; for (i=0; i < 128; i++) acc += a[i]; if (i < 128) acc += a[i] i ++ Done

Control flow graphs Your turn Create a CFG for this code i = 0; while (j < 10) { if (x < 5) y = 2; else if (z < 10) y = 6; }

Data Flow Graphs DFG Represents data dependencies between operations a x = a+b; y = c*d; z = x - y; + * - x y z

Control/Data Flow Graph Combines CFG and DFG Maintains DFG for each node of CFG acc = 0; for (i=0; i < 128; i++) acc += a[i]; acc i acc=0; i=0; if (i < 128) acc a[i] i 1 acc += a[i] i ++ Done + + acc i

High-Level Synthesis: Optimization

Synthesis Optimizations After creating CDFG, high-level synthesis optimizes graph Goals Reduce area Improve latency Increase parallelism Reduce power/energy 2 types Data flow optimizations Control flow optimizations

Data Flow Optimizations Tree-height reduction Generally made possible from commutativity, associativity, and distributivity a b c d a b c d + + + + + + a b c d a b c d * + * + + +

Data Flow Optimizations Operator Strength Reduction Replacing an expensive (“strong”) operation with a faster one Common example: replacing multiply/divide with shift 1 multiplication 0 multiplications b[i] = a[i] * 8; b[i] = a[i] << 3; a = b * 5; c = b << 2; a = b + c; a = b * 13; c = b << 2; d = b << 3; a = c + d + b;

Data Flow Optimizations Constant propagation Statically evaluate expressions with constants x = 0; y = x * 15; z = y + 10; x = 0; y = 0; z = 10;

Data Flow Optimizations Function Specialization Create specialized code for common inputs Treat common inputs as constants If inputs not known statically, must include if statement for each call to specialized function int f (int x) { y = x * 15; return y + 10; } int f (int x) { y = x * 15; return y + 10; } int f_opt () { return 10; } Treat frequent input as a constant for (I=0; I < 1000; I++) f(0); … } for (I=0; I < 1000; I++) f_opt(0); … }

Data Flow Optimizations Common sub-expression elimination If expression appears more than once, repetitions can be replaced a = x + y; . . . . . . b = c * 25 + x + y; a = x + y; . . . . . . b = c * 25 + a; x + y already determined

Data Flow Optimizations Dead code elimination Remove code that is never executed May seem like stupid code, but often comes from constant propagation or function specialization int f (int x) { if (x > 0 ) a = b * 15; else a = b / 4; return a; } int f_opt () { a = b * 15; return a; } Specialized version for x > 0 does not need else branch - “dead code”

Data Flow Optimizations Code motion (hoisting/sinking) Avoid repeated computation for (I=0; I < 100; I++) { z = x + y; b[i] = a[i] + z ; } z = x + y; for (I=0; I < 100; I++) { b[i] = a[i] + z ; }

Control Flow Optimizations Loop Unrolling Replicate body of loop May increase parallelism for (i=0; i < 128; i++) a[i] = b[i] + c[i]; for (i=0; i < 128; i+=2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1] }

Control Flow Optimizations Function Inlining Replace function call with body of function Common for both SW and HW SW - Eliminates function call instructions HW - Eliminates unnecessary control states for (i=0; i < 128; i++) a[i] = f( b[i], c[i] ); . . . . int f (int a, int b) { return a + b * 15; } for (i=0; i < 128; i++) a[i] = b[i] + c[i] * 15;

Control Flow Optimizations Conditional Expansion Replace if with logic expression Execute if/else bodies in parallel y = ab if (a) x = b+d else x =bd y = ab x = a(b+d) + a’bd [DeMicheli] Can be further optimized to: y = ab x = y + d(a+b)

Example Optimize this x = 0; y = a + b; if (x < 15) z = a + b - c; else z = x + 12; output = z * 12;

High-Level Synthesis: Scheduling/Resource Allocation

Scheduling Scheduling assigns a start time to each operation in DFG Start times must not violate dependencies in DFG Start times must meet performance constraints Alternatively, resource constraints Performed on the DFG of each CFG node => Can’t execute multiple CFG nodes in parallel

Examples a b c d a b c d + Cycle1 Cycle1 + + Cycle2 Cycle2 + Cycle3 +

Scheduling Problems Several types of scheduling problems Problems: Usually some combination of performance and resource constraints Problems: Unconstrained Not very useful, every schedule is valid Minimum latency Latency constrained Mininum-latency, resource constrained i.e. find the schedule with the shortest latency, that uses less than a specified # of resources NP-Complete Mininum-resource, latency constrained i.e. find the schedule that meets the latency constraint (which may be anything), and uses the minimum # of resources

Minimum Latency Scheduling ASAP (as soon as possible) algorithm Find a node whose predecessors are scheduled (or has no predecessors) Schedule node one cycle later than max cycle of predecessor Repeat until all nodes scheduled a b c d e f g h Cycle1 + + - < Cycle2 * Cycle3 * Cycle4 + Minimum possible latency - 4 cycles

Minimum Latency Scheduling ALAP (as late as possible) algorithm Run ASAP, get minimum latency L Find a node whose successors are scheduled (or has none) Schedule node one cycle before min cycle of predecessor Nodes with no successors scheduled to cycle L Repeat until all nodes scheduled a b c d e f g h Cycle1 + + - < Cycle4 Cycle3 Cycle2 * Cycle3 * Cycle4 + L = 4 cycles

Minimum Latency Scheduling ALAP (as late as possible) algorithm Run ASAP, get minimum latency L Find a node whose successors are scheduled Schedule node one cycle before min cycle of predecessor Nodes with no successors scheduled to cycle L Repeat until all nodes scheduled a b c d e f g h Cycle1 + + Cycle2 * Cycle3 - * Cycle4 + < L = 4 cycles

Minimum Latency Scheduling ALAP Has to run ASAP first, seems pointless But, many heuristics need the mobility/slack of each operation ASAP gives the earliest possible time for an operation ALAP gives the latest possible time for an operation Slack = difference between earliest and latest possible schedule Slack = 0 implies operation has to be done in the current scheduled cycle The larger the slack, the more options a heuristic has to schedule the operation

Latency-Constrained Scheduling Instead of finding the minimum latency, find latency less than L Solutions: Use ASAP, verify that minimum latency less than L Use ALAP starting with cycle L instead of minimum latency (don’t need ASAP)

Scheduling with Resource Constraints Schedule must use less than specified number of resources Constraints: 1 ALU (+/-), 1 Multiplier a b c d e f g Cycle1 + + - Cycle2 Cycle3 * * + Cycle4 Cycle5 +

Scheduling with Resource Constraints Schedule must use less than specified number of resources Constraints: 2 ALU (+/-), 1 Multiplier a b c d e f g Cycle1 + + - Cycle2 * * + Cycle3 Cycle4 +

Mininum-Latency, Resource-Constrained Scheduling Definition: Given resource constraints, find schedule that has the minimum latency Example: Constraints: 1 ALU (+/-), 1 Multiplier a b c d e f g Cycle1 + + - Cycle2 Cycle4 Cycle3 + * Cycle5 + Cycle6

Mininum-Latency, Resource-Constrained Scheduling Definition: Given resource constraints, find schedule that has the minimum latency Example: Constraints: 1 ALU (+/-), 1 Multiplier a b c d e f g Cycle1 + + - Cycle2 Cycle3 Cycle4 + * + Cycle5 Different schedules may use same resources, but have different latencies

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm Assumes one type of resource Basic Idea Input: graph, # of resources r 1) Label each node by max distance from output i.e. Use path length as priority 2) Determine C => set of unscheduled nodes that are candidates to be scheduled Candidate if either no predecessor, or predecessors scheduled 3) From C, schedule up to r nodes to current cycle, using label as priority 4) Increment current cycle, repeat from 2) until all nodes scheduled

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm Example a b c d e f g j k + + - - * * + + r = 3

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm Step 1 - Label each node by max distance from output i.e. use path length as priority a b c d e f g j k 3 4 1 4 3 2 2 1 r = 3

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm Step 2 - Determine C => set of unscheduled nodes that are candidates to be scheduled a b c d e f g j k C = 3 4 1 4 3 2 2 Cycle 1 1 r = 3

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm Step 3 - From C, schedule up to r nodes to current cycle, using label as priority a b c d e f g j k Cycle1 3 4 1 4 3 2 Not scheduled due to lower priority 2 Cycle 1 1 r = 3

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm Step 2 a b c d e f g j k Cycle1 3 4 1 4 3 2 C = 2 Cycle 2 1 r = 3

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm Step 3 a b c d e f g j k Cycle1 3 4 1 4 Cycle2 3 2 2 Cycle 2 1 r = 3

Mininum-Latency, Resource-Constrained Scheduling Hu’s Algorithm Skipping to finish a b c d e f g j k Cycle1 3 4 1 4 Cycle2 3 2 Cycle3 2 Cycle4 1 r = 3

Mininum-Latency, Resource-Constrained Scheduling Hu’s is simplified problem Common Extensions: Multiple resource types Multi-cycle operation a b c d Cycle1 + - * Cycle2 /

Mininum-Latency, Resource-Constrained Scheduling List Scheduling - (minimum latency, resource-constrained version) Basic Idea - Hu’s algorithm for each resource type Input: graph, set of constraints R for each resource type 1) Label nodes based on max distance to output 2) For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 3) Increment cycle, repeat from 2) until all nodes scheduled

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency Step 1 - Label nodes based on max distance to output (not shown, so you can see operations) *nodes given IDs for illustration purposes 2 ALUs (+/-), 2 Multipliers a b c d e f g j k + 4 1 * 2 3 - + * 6 * 5 + 7 - 8

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle 1 2,3,4 1 + 2 3 4 - 1 * + * 6 * 5 + 7 - 8

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle 1 2,3,4 1 * 2 3 + 4 - 1 + Cycle1 * 6 * Candidate, but not scheduled due to low priority 5 + 7 - 8

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle 2,3,4 1 5,6 4 2 + 2 3 4 * - 1 + Cycle1 * 6 * 5 + 7 - 8

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle 2,3,4 1 5,6 4 2 + 1 * 2 3 4 - + Cycle1 * 6 * 5 Cycle2 + 7 - 8

Mininum-Latency, Resource-Constrained Scheduling List scheduling - minimum latency For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to Rt operations from C based on priority, to current cycle Rt is the constraint on resource type t 2 ALUs (+/-), 2 Multipliers Candidates a b c d e f g j k Mult ALU Cycle 2,3,4 1 5,6 4 2 7 3 + 1 * 2 3 4 - + Cycle1 * 6 * 5 Cycle2 + 7 - 8

Mininum-Latency, Resource-Constrained Scheduling List scheduling - (minimum latency) Final schedule Note - ASAP would require more resources ALAP wouldn’t but in general, it would 2 ALUs (+/-), 2 Multipliers a b c d e f g j k Cycle1 + 1 * 2 3 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8

Mininum-Latency, Resource-Constrained Scheduling Extensions: Multicycle operations Same idea (differences shown in red) Input: graph, set of constraints R for each resource type 1) Label nodes based on max distance to output 2) For each resource type t 3) Determine candidate nodes, C (those w/ no predecessors or w/ scheduled predecessors) 4) Schedule up to (Rt - nt) operations from C based on priority, one cycle after predecessor Rt is the constraint on resource type t nt is the number of resource t in use from previous cycles Repeat from 2) until all nodes scheduled

Mininum-Latency, Resource-Constrained Scheduling Example: 2 ALUs (+/-), 2 Multipliers a b c d e f g j k Cycle1 * 2 3 + 1 + Cycle2 * 4 Cycle3 6 * Cycle4 5 * Cycle5 Cycle6 + 7 Cycle7 - 8

List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult) Steps (will be on test) 1) Label nodes with priority 2) Update candidate list for each cycle 3) Redraw graph to show schedule 6 2 3 * - 5 + + 1 + * 4 * 8 + 7 + - 10 9 - 11

List Scheduling (Min Latency) Your turn (2 ALUs, 1 Mult, Mults take 2 cycles) a b c d e f g 2 3 * * 1 + + 4 * 5 + 7

Minimum-Resource, Latency-Constrained Note that if no resource constraints given, schedule determines number of required resources Max # of each resource type used in a single cycle a b c d e f g 3 ALUs Cycle1 + + - 2 Mults Cycle2 * * + Cycle3 Cycle4 +

Minimum-Resource, Latency-Constrained Minimum-Resource Latency-Constrained Scheduling: For all schedules that have latency less than the constraint, find the one that uses the fewest resources Latency Constraint <= 4 Latency Constraint <= 4 a b c d e f g a b c d e f g Cycle1 + + - Cycle1 + + - Cycle2 * * Cycle2 * * + Cycle3 + Cycle3 Cycle4 + Cycle4 + 2 ALUs, 1 Mult 3 ALUs, 2 Mult

Minimum-Resource, Latency-Constrained List scheduling (Minimum resource version) Basic Idea 1) Compute latest start times for each op using ALAP with specified latency constraint 2) For each resource type 3) Determine candidate nodes 4) Compute slack for each candidate Slack = current cycle - latest possible cycle 5) Schedule ops with 0 slack Update required number of resources 6) Schedule ops that require no extra resources 7) Repeat from 2) until all nodes scheduled

Minimum-Resource, Latency-Constrained 1) Find ALAP schedule a b c d e f g j k + 1 * 2 + 3 4 - Last Possible Cycle 6 * 5 * Node LPC - 7 1 3 2 Latency Constraint = 3 cycles a b c d e f g j k Cycle1 1 * 2 + 3 + Cycle2 * 6 5 * Cycle3 4 7 - - Defines last possible cycle for each operation

Minimum-Resource, Latency-Constrained 2) For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Candidates = {1,2,3,4} Node LPC Slack Cycle 1 3 2 2 a b c d e f g j k * 2 + 3 + 4 1 - * 6 * 5 - 7 Cycle 1

Minimum-Resource, Latency-Constrained 5)Schedule ops with 0 slack Update required number of resources 6) Schedule ops that require no extra resources Node LPC Slack Cycle Candidates = {1,2,3,4} 1 3 2 0 1 2 X Resources = 1 Mult, 2 ALU a b c d e f g j k * 2 + 3 + 4 1 - * 6 * 5 4 requires 1 more ALU - 7 Cycle 1

Minimum-Resource, Latency-Constrained 2)For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC Slack Cycle Candidates = {4,5,6} 1 3 2 1 Resources = 1 Mult, 2 ALU a b c d e f g j k * 2 + 3 + 4 1 - * 6 * 5 - 7 Cycle 2

Minimum-Resource, Latency-Constrained 5)Schedule ops with 0 slack Update required number of resources 6) Schedule ops that require no extra resources Node LPC Slack Cycle Candidates = {4,5,6} 1 3 2 1 1 2 0 2 Resources = 2 Mult, 2 ALU a b c d e f g j k + 1 * 2 + 3 4 - * 6 * 5 - 7 Already 1 ALU - 4 can be scheduled Cycle 2

Minimum-Resource, Latency-Constrained 2)For each resource type 3) Determine candidate nodes C 4) Compute slack for each candidate Slack = current cycle - latest possible cycle Node LPC Slack Cycle Candidates = {7} 1 3 2 1 2 Resources = 2 Mult, 2 ALU a b c d e f g j k * 2 + 3 + 4 1 - * 6 * 5 - 7 Cycle 3

Minimum-Resource, Latency-Constrained Final Schedule Required Resources = 2 Mult, 2 ALU Node LPC Slack Cycle a b c d e f g j k 1 3 2 1 2 3 Cycle1 * 2 + 3 1 + Cycle2 * 6 * 4 - 5 Cycle3 - 7

Other extensions Chaining Multiple operations in a single cycle a b c d e f Multiple adds may be faster than 1 divide - perform adds in one cycle + + / + -

Summary Scheduling assigns each operation in a DFG a start time Done for each DFG in the CDFG Different Types Minimum Latency ASAP, ALAP Latency-constrained Minimum-latency, Resource-constrained Hu’s Algorithm List Scheduling Minimum-resource, latency-constrained

High-level Synthesis: Binding/Resource Sharing

Binding During scheduling, we determined: When ops will execute How many resources are needed We still need to decide which ops execute on which resources => Binding If multiple ops use the same resource =>Resource Sharing

Binding Basic Idea - Map operations onto resources such that operations in same cycle can’t use same resource 2 ALUs (+/-), 2 Multipliers Cycle1 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 Mult1 ALU1 ALU2 Mult2

Binding Many possibilities Bad binding may increase resources, require huge steering logic, reduce clock, etc. 2 ALUs (+/-), 2 Multipliers Cycle1 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 Mult1 ALU1 Mult2 ALU2

Binding Can’t do this 1 resource can’t perform multiple ops simultaneously! 2 ALUs (+/-), 2 Multipliers Cycle1 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8

Binding How to automate? Compatibility Graph More graph theory Each node is an operation Edges represent compatible operations Compatible - if two ops can share a resource I.e. Ops that use same type of resource (ALU, etc.) and are scheduled to different cycles

Compatibility Graph ALUs Mults Cycle1 Cycle2 Cycle3 Cycle4 + 1 * + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 and 6 not compatible (same cycle) 5 7 4 2 and 3 not compatible (same cycle) 3

Compatibility Graph ALUs Mults Cycle1 2 3 + 1 * + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 Note - Fully connected subgraphs can share a resource (all involved nodes are compatible) 3

Compatibility Graph ALUs Mults Cycle1 2 3 + 1 * + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 Note - Fully connected subgraphs can share a resource (all involved nodes are compatible) 3

Compatibility Graph ALUs Mults Cycle1 2 3 + 1 * + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 Note - Fully connected subgraphs can share a resource (all involved nodes are compatible) 3

Compatibility Graph Binding: Find minimum number of fully connected graphs that cover entire graph Well-known problem: Clique partitioning (NP-complete) Cliques = { {2,8,7,4},{3},{1,5},{6} } ALU1 executes 2,8,7,4 ALU2 executes 3 MULT1 executes 1,5 MULT2 executes 6 2 8 1 6 7 4 5 3

Compatibility Graph Final Binding: ALUs Mults Cycle1 Cycle2 Cycle3 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 3

Translation to Datapath b c d e f g h i Cycle1 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 Add mux for each input Add register to each resource Add input to left mux for each left input in DFG Do same for right mux If only 1 input, remove mux a b c h d e i g e f Mux Mux Mux Mux Mult(1,5) ALU(2,7,8,4) Mult(6) ALU(3) Reg Reg Reg Reg

Compatibility Graph Alternative Final Binding: ALUs Mults Cycle1 * 2 3 + 1 + Cycle2 * 6 * 4 - 5 Cycle3 + 7 Cycle4 - 8 ALUs Mults 2 8 1 6 5 7 4 3

Left Edge Algorithm Alternative to clique partitioning Take scheduled DFG, rotate it 90 degrees a b f g * + - j k 1 2 3 4 5 6 7 8 Cycle1 Cycle2 Cycle5 Cycle6 Cycle4 Cycle3 Cycle7 c d e 2 ALUs (+/-), 2 Multipliers

Left Edge Algorithm 2 ALUs (+/-), 2 Multipliers * 4 Initialize right_edge to 0 Find a node N whose left edge is >= right_edge Bind N a particular resource Update right_edge to the right edge of N Repeat from 2) for nodes using the same resource type until right_edge passes all nodes Repeat from 1) until all nodes bound * 6 + 3 - 8 + + 7 2 * 5 * 1 Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7