Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Similar presentations


Presentation on theme: "ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular."— Presentation transcript:

1 ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular Array Synthesis on FPGAs

2 ISSS 2001, Montréal2 Outline n Context and motivation n Space time transformations n Transformation flow n Experimental validation n Conclusion

3 ISSS 2001, Montréal3 High performance IP-Cores n High-level specifications Matlab, C, C++ or specific language (Alpha) Targeting nested loops Core must be formally correct n Hard/Soft co-generation Hardware RTL module (VHDL) Simple driver API (C) n Regular Processor Arrays High data through-put, specialized datapath Well suited for VLSI/FPGA

4 ISSS 2001, Montréal4 Targeting FPGAs n Poor clock speed Typical clock speed is 1/10 Asic speed Very design dependant Good at low precision arithmetic (8 bits) Really bad for complex operations (floats) n But high performance Optimized designs can compete with Asics Performance gain due to parallelism Pipeline comes for free (lots of DFFs)

5 ISSS 2001, Montréal5 Processor Array Synthesis For i:=1 to 3 For j:=1 to 3 For k:=1 to 3 C[i,j]:=C[i,j] +A[i,k]*B[k,j]; End for; Iteration domain extracted from loop bounds Data dependence vector between iterations Iteration domain is projected on the processor grid Matrix multiplication example Iteration are scheduled on their associated PE

6 ISSS 2001, Montréal6 PE Architecture Temporal registers act as local memory n Combinational datapath connected to registers n Unidirectional flow and pipelined connections n N classes of registers (N = loop dimension) n One critical path for each register class n Operating frequency set by worst critical path Spatio-temporal registers must be disambiguated Spatial registers serve as interconnect between PEs

7 ISSS 2001, Montréal7 Conclusion n Simplistic schedule inside a PE (no ILP) n Complex loop bodies induces poor performance Floating point Matrix mult operating at 12MHz 2D SOR on 16 bits operating at 40MHz n The PE architecture is not suited to FPGAs !! n Proposed solution : allowing pipelined data-paths, by altering the PE architecture through simple space-time transformations.

8 ISSS 2001, Montréal8 Retiming Tc= 1 logic level Tc= 2 logic level n Move registers to minimize clock period n Handled by most FPGA RTL synthesis tools n Efficient iff sufficient number of registers n We just need to add registers in the PE !!

9 ISSS 2001, Montréal9 Serialization (1/2) n Regroup PEs into clusters n Iterations in a cluster executed sequentially n Through-put is slowed down by cluster size n Local memory is duplicated Original PE array before clustering Array after clustering

10 ISSS 2001, Montréal10 Serialization (2/2) n Decomposed along each spatial dimension n Serialization impacts the PE according to simple transformation rules n Loop level Parallelism traded for Instruction Level Parallelism Temporal registers duplicated by serialization factor  i Feed-back loop are created for all spatial paths in the i th axis

11 ISSS 2001, Montréal11 Skewing Skewing by factor 2 along vertical PE axis n Affects latency, but not through-put. n Adds temporal registers along spatial axis n Skewing can be used before and after serialization n Cannot reduce original temporal critical path

12 ISSS 2001, Montréal12 Problem formulation n Find the optimal set of transformations parameters. n Minimize number of registers n Preserve loop-level parallelism T c = 86 ns, requires d j = 6 stages to obtain T c = 15ns T c = 70 ns, d i =5 stages to obtain T c = 15ns T c = 60 ns requires d t =4 stages to obtain T c = 15ns

13 ISSS 2001, Montréal13 1. Assumes  i given (partitioning step) 4. Determine all the skewing parameters 2. Sort PE space axis in ascending order of T c 2. For each PE axis i do i. Pre-serialization skewing i pre ii. Serialization  i 4. For each PE axis i do i. Post-serialization skewing i post Proposed heuristic

14 ISSS 2001, Montréal14 Transformation example 1. Pre-skew along axis y by factor y pre =1. 2. Serialisation along axis y axis by factor  y =2. 3. Pre-skew along axis x by factor x pre =2. 4. Serialisation along axis x by factor  x =2. 6. Apply retiming 5. Post skew along axis y by factor y post =1.

15 ISSS 2001, Montréal15 Experimental validation n Chosen benchmark Matrix multiplication (8,16 bits and floats) Adaptive filter (DLMS) (8,16 bits and floats) String matching (DNA, Protein) n Performance metrics A pe : PE area usage f pe : PE operating frequency Raw performance  =N pe.f pe N pe approximated by 1/A pe

16 ISSS 2001, Montréal16 Area overhead Area overhead decreases as combinational datapath area cost grows

17 ISSS 2001, Montréal17 Frequency improvement Speed improvment up to one order of magnitude (for floats)

18 ISSS 2001, Montréal18 Raw performance Speed improvment up to one order of magnitude (for floats)

19 ISSS 2001, Montréal19 Conclusion n Extract very fine grain ILP from the datapath as a whole n Simple space-time transformations but yield impressive results. n Preserve circuit correctness and control logic regularity and simplicity n Performance benefits are limited by the lack of place & route aware retiming tools.


Download ppt "ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular."

Similar presentations


Ads by Google