Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Resource Sharing in LegUp

Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing functional units

Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing functional units E.g. consider a C program which performs division twice: a b c d z w //

Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing functional units E.g. consider a C program which performs division twice: a c b d a b c d z w z, w // /

Resource Sharing in High Level Synthesis Intuitively, large operators such as dividers, remainder and multipliers are beneficial to share But because multiplexors are relatively expensive to implement in FPGAs, generally smaller operators (adders, bitwise operations) are not shared

Example – Sharing a Bitwise AND Consider a Bitwise AND: &

Example – Sharing a Bitwise AND Consider a Bitwise AND: 2 Input LUT &

Example – Sharing a Bitwise AND Consider a Bitwise AND:And a 2-to-1 MUX: 2 Input LUT 3-input LUT &

Example – Sharing a Bitwise AND Therefore this seems like a bad idea: &

Example – Sharing a Bitwise AND Therefore this seems like a bad idea: But in fact, this depends on the LUT architecture &

Project Overview and Goals Determine conclusively for which operators sharing is beneficial in FPGAs Consider architectural impact: – 4-input LUT architectures (Cyclone II) – 6-LUT (Adaptive LUT) architectures (Stratix IV) Identify/analyze the benefits of sharing patterns of smaller operations (e.g. multiplication followed by add)

Stratix IV, Adaptive Logic Modules (ALM) Each ALM contains 2 Adaptive LUTs (ALUTs) which can implement a function of between 4 and 7 inputs

Stratix IV, Adaptive Logic Modules (ALM) Each ALM contains 2 Adaptive LUTs (ALUTs) which can implement a function of between 4 and 7 inputs Cyclone II

ALM Example Consider two circuits: Circuit 1: Implemented using 100 3-input LUTs Circuit 2: Implemented using 45 3-input LUTs and 45 5-input LUTs

Adaptive Logic Modules (ALM)

Adaptive Logic Modules (ALM) 45 50

ALM Example Consider two circuits: Circuit 1: Implemented using 100 3-input LUTs Circuit 2: Implemented using 45 3-input LUTs and 45 5-input LUTs

ALM Example Consider two circuits: Circuit 1: Implemented using 100 3-input LUTs  Requires 50 ALMs Circuit 2: Implemented using 45 3-input LUTs and 45 5-input LUTs

ALM Example Consider two circuits: Circuit 1: Implemented using 100 3-input LUTs  Requires 50 ALMs Circuit 2: Implemented using 45 3-input LUTs and 45 5-input LUTs  Requires 45 ALMs, even though the circuit contains more logic

Resource Sharing in Stratix IV All of the circuits created by LegUp tend to use mostly 2 and 3 input functions (LUTs)

Resource Sharing in Stratix IV All of the circuits created by LegUp tend to use mostly 2 and 3 input functions (LUTs) ALUT Size 71% 70% 78% 45% 48% 57% 65% 55% 53% 75% Average: 62%

Sharing Single Operations Given that LegUp-generated circuits contain mostly 2-3 input functions therefore, the number of ALMs can be reduced by packing many “smaller LUTs” into fewer “larger LUTs”

Sharing Single Operations Given that LegUp-generated circuits contain mostly 2-3 input functions therefore, the number of ALMs can be reduced by packing many “smaller LUTs” into fewer “larger LUTs” Revisit the example of the Bitwise AND

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND & 32

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND Requires 32 LUTs for 32 output bits 32 LUTs & 32

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND Requires 32 LUTs for 32 output bits 32 LUTs (all 2-input LUTs) & 32

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND Requires 32 LUTs for 32 output bits 64 LUTs (all 2-input LUTs) & 32 &

Example – Sharing a Bitwise AND Consider a 32-bit Bitwise AND Requires 32 LUTs for 32 output bits 64 LUTs 32 LUTs (all 2-input LUTs) (5-input LUTs) & 32 & &

Sharing Single Operations In the example of bitwise operations, we can reduce the number of LUTs by half at the expense of increasing their size However, if a circuits contains mostly small LUTs, ALMs are being under-utilized and can incorporate these larger logic functions Therefore, sharing even small operations reduces ALUT and ALM usage

Variable Liveness Analysis Consider next if each bitwise AND had its output stored in a register: & 32 &

Variable Liveness Analysis Consider next if each bitwise AND had its output stored in a register: 64 Registers & 32 &

Variable Liveness Analysis Consider next if each bitwise AND had its output stored in a register: 64 Registers 32 Registers (if lifetimes are independent) & 32 & &

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared 1. A verilog module was created for each single LLVM instruction with multiplexing (“sharable”) and without (“unsharable”)

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared 1. A verilog module was created for each single LLVM instruction with multiplexing (“sharable”) and without (“unsharable”) & 32 &

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared 1. A verilog module was created for each single LLVM instruction with multiplexing (“sharable”) and without (“unsharable”) 2. Registers were placed at the inputs and outputs to isolate delay

Evaluating Area of Single Operators Goal: determine, for each LUT architecture, which single operators produce area reduction when shared 1. A verilog module was created for each single LLVM instruction with multiplexing (“sharable”) and without (“unsharable”) 2. Registers were placed at the inputs and outputs to isolate delay 3. Area and speed results were obtained for each instruction, in each configuration, and for Cyclone II and Stratix IV

Evaluating Area of Single Operators Sharing is beneficial when ratios (in brackets) are less than 2 More operators show benefit in Stratix IV due to the flexible LUT architecture

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + –

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + – & + + –

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + – & – ++

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + – & + – +

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations Consider: & + + – & + + –

Sharing Computational Patterns So far, ALUTs and Registers were saved by sharing single operations By sharing chains instead of only single operations, the amount of multiplexing is reduced and ALUTs decrease further

Computational patterns are represented as Directed Graphs, with a single output (“root”) node: Each node is an instruction Input Input Input Size 5 Graph Sharing Computational Patterns + – * + &

Pattern Sharing Algorithm: 1.Find all computational patterns in the software program 2.Sort patterns by equivalent functionality 3.Determine which patterns are candidates for sharing and choose (optimal?) pairing Sharing Computational Patterns

LLVM produces a Data Flow Graph to represent each compiled C Program The first step of the pattern sharing is to find all subgraphs which are candidates for sharing 1. Finding all Computational Patterns

1. Finding all Computational Patterns const Size: 1

1. Finding all Computational Patterns const Only one root allowed

2. Sorting Patterns By Functional Equivalence + – << + & A B C D E + – + & A B C D E a) A Graph with a re-converging path b) This graph is functionally identical to (a) but topologically different due to commutativity

2. Sorting Patterns By Functional Equivalence + – << + & A B C D E + – + & A B C D E a) A Graph with a re-converging path b) This graph is functionally identical to (a) but topologically different due to commutativity (As opposed to just topological)

So far, steps 1 and 2 have provided sets of equivalent patterns 3. Decide which Pattern Instances to Share

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: A B C D 3. Decide which Pattern Instances to Share – + – –– + + +

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: 3. Decide which Pattern Instances to Share – – + +

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: 3. Decide which Pattern Instances to Share – – – + + +

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: A B C D 3. Decide which Pattern Instances to Share – – –– + + + +

So far, steps 1 and 2 have provided sets of equivalent patterns For example, we may have found 4 graphs for this pattern: A B C D Our goal is to split these 4 into pairs (create groups of 2) so that each hardware unit will implement two patterns But which combination of pairs is best? 3. Decide which Pattern Instances to Share – – –– + + + +

Variable Lifetimes Optimization Prefer to share patterns with non-overlapping lifetimes – Saves registers. – + – + % & 1 2 3 4 5 6 A B P1 P2 – + – + % & 1 2 3 4 5 6 A B P1 P2 a) Values A,B have overlapping lifetimes b) Values A,B have non- overlapping lifetimes

Prefer to share patterns with shared input variables – Reduces multiplexing cost + A B + AC + D E 123 Shared Input Variable Optimization

Prefer to share patterns with shared input variables – Reduces multiplexing cost + A B + AC + D E 123 + A B C 1, 2 Shared Input Variable Optimization

Adder C would be optimized by synthesis tools because only six outputs bits are needed Sharing adder C with A or B would force a 6-bit addition to be implemented using a 32-bit adder Bit Width Optimization + 32 + + & 6 …… 63 32 A B C

Considering these optimizations, a cost function is used to select between possible pairs of graphs Once pairs have been determined, the Binding phase of LegUp is modified to implement pairs of computational patterns with the same hardware 3. Decide which Pattern Instances to Share

Results

Geomean 4.9% improvement for pattern sharing (i.e. between columns 2 and 3)

Geomean 4.2% improvement for pattern sharing (i.e. between columns 2 and 3)

48% 57% 31% 41% 55% 43% 42% 40% 36% 44% 64% Average: 45% (was 62%)

FPGA logic architecture has significant impact on resource sharing Resource sharing can provide >10% area reduction Future work: alter scheduling to favor creation of certain patterns – Provide more sharing opportunities Paper on this is under review for FPGA 2012 – Contains many details; advanced copy is available Summary

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Similar presentations

Presentation on theme: "Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Similar presentations

Presentation on theme: "Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing."— Presentation transcript:

Similar presentations

About project

Feedback