Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.

Similar presentations


Presentation on theme: "Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009."— Presentation transcript:

1 Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009

2 2 FPGA Systems and Soft Processors Soft Processor Custom HW HDL + CAD Software + Compiler Easier Faster Smaller Less Power Simplify FPGA design: Customize soft processor architecture ? Configurable COMPETE WeeksMonths Target: Data level parallelism → vector processors Used in 25% of designs [source: Altera, 2009] Digital System computation

3 3 Vector Processing Primer // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c Each vector instruction holds many units of independent operations vr2[0]= vr0[0]+vr1[0] vr2[1]= vr0[1]+vr1[1] vr2[2]= vr0[2]+vr1[2] vr2[4]= vr0[4]+vr1[4] vr2[3]= vr0[3]+vr1[3] vr2[5]= vr0[5]+vr1[5] vr2[6]= vr0[6]+vr1[6] vr2[7]= vr0[7]+vr1[7] vr2[8]= vr0[8]+vr1[8] vr2[9]= vr0[9]+vr1[9] vr2[10]=vr0[10]+vr1[10] vr2[11]=vr0[11]+vr1[11] vr2[12]=vr0[12]+vr1[12] vr2[13]=vr0[13]+vr1[13] vr2[14]=vr0[14]+vr1[14] vr2[15]=vr0[15]+vr1[15] vadd 1 Vector Lane

4 4 Vector Processing Primer // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c Each vector instruction holds many units of independent operations vadd 16 Vector Lanes vr2[0]= vr0[0]+vr1[0] vr2[1]= vr0[1]+vr1[1] vr2[2]= vr0[2]+vr1[2] vr2[4]= vr0[4]+vr1[4] vr2[3]= vr0[3]+vr1[3] vr2[5]= vr0[5]+vr1[5] vr2[6]= vr0[6]+vr1[6] vr2[7]= vr0[7]+vr1[7] vr2[8]= vr0[8]+vr1[8] vr2[9]= vr0[9]+vr1[9] vr2[10]=vr0[10]+vr1[10] vr2[11]=vr0[11]+vr1[11] vr2[12]=vr0[12]+vr1[12] vr2[13]=vr0[13]+vr1[13] vr2[14]=vr0[14]+vr1[14] vr2[15]=vr0[15]+vr1[15] 16x speedup Previous Work (on Soft Vector Processors) : 1. Scalability 2. Flexibility 3. Portability

5 5 Soft Vector Processors vs HW Custom HW HDL + CAD Software + Compiler Easier Faster Smaller Less Power What is the soft vector processor vs FPGA custom HW gap? (also vs scalar soft processor) + Vectorizer Lane 1 Vector Lanes Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7 Lane 8 …16 How much? Soft Vector Processor WeeksMonths Scalable Fine-tunable Customizable

6 6 Measuring the Gap EEMBC Benchmarks Scalar Soft Processor Soft Vector Processor HW Circuits Evaluation Speed Area Speed Area Speed Area Compare Conclusions

7 7 VESPA Architecture Design (Vector Extended Soft Processor Architecture) Scalar Pipeline 3-stage Vector Control Pipeline 3-stage Vector Pipeline 6-stage IcacheDcache Decode RF ALUALU MUXMUX WB VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF VR WB VR RF VR WB Decode Supports integer and fixed-point operations [VIRAM] 32-bit Lanes Shared Dcache Legend Pipe stage Logic Storage Lane 1 ALU,Mem Unit Lane 2 ALU, Mem, Mul

8 8 VESPA Parameters DescriptionSymbolValues Number of LanesL1,2,4,8, … Memory Crossbar LanesM1,2, …, L Multiplier LanesX1,2, …, L Maximum Vector LengthMVL2,4,8, … Width of Lanes (in bits)W1-32 Instruction Enable (each)-on/off Data Cache CapacityDDany Data Cache Line SizeDWany Data Prefetch SizeDPK< DD Vector Data Prefetch SizeDPV< DD/MVL Compute Architecture Memory Hierarchy Instruction Set Architecture

9 9 VESPA Evaluation Infrastructure Vectorized assembly subroutines GNU as ELF Binary Instruction Set Simulation scalar μP + vpu VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU Mem Unit x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift EEMBC C Benchmarks RTL Simulation SOFTWAREHARDWARE Verilog Altera Quartus II v 8.1 cycles area, clock frequency GCC ld verification TM4 Realistic and detailed evaluation

10 10 Measuring the Gap EEMBC Benchmarks Scalar Soft Processor Soft Vector Processor HW Circuits Evaluation Speed Area Speed Area Speed Area Compare Conclusions

11 11 Designing HW Circuits (with simplifying assumptions) DDR Core Datapath Memory Request Control HW Idealized cycle count (modelled) Assume fed at full DDR bandwidth Calculate execution time from data size Optimistic HW implementations vs real processors Altera Quartus II v 8.1 area, clock frequency

12 12 Benchmarks Converted to HW EEMBC VIRAM Stratix III 3S200C2 VESPA Clock: 120-140 MHz HW Clock: 275-475 MHz HW advantage: 3x faster clock frequency

13 13 Performance/Area Space (vs HW) Scalar – 432x slower, 7x larger Soft vector processors can significantly close performance gap Slowdown vs HW Area vs HW fastest VESPA 17x slower, 64x larger HW (1,1) optimistic HW Area Advantage HW Speed Advantage

14 14 Area-Delay Product Commonly used to measure efficiency in silicon Considers both performance and area Inverse of performance-per-area Calculated using: (Area) × (Wall Clock Execution Time)

15 15 Area-Delay Space (vs HW) 2900x 900x VESPA up to 3 times better silicon usage than Scalar Area-Delay vs HW HW Area Advantage HW Area-Delay Advantage

16 16 Reducing the Performance Gap Previously: VESPA was 50x slower than HW Reducing loop overhead VESPA: Decoupled pipelines (+7% speed) Improving data delivery VESPA: Parameterized cache (2x speed, 2x area) VESPA: Data Prefetching (+42% speed) These enhancements were key parts of reducing gap, combined 3x performance improvement

17 17 Vector Memory Crossbar Wider Cache Line Size Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Dcache 4KB, 16B line … Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 VESPA 16 lanes vld.w (load 16 sequential 32-bit words)

18 18 Vector Memory Crossbar Wider Cache Line Size Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Dcache 16KB, 64B line … Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 VESPA 16 lanes vld.w (load 16 sequential 32-bit words) 4x 2x speed, 2x area (reduced cache accesses + some prefetching)

19 19 Hardware Prefetching Example DDR Dcache … vld.w No PrefetchingPrefetching 3 blocks DDR Dcache … vld.w MISS 10 cycle penalty 10 cycle penalty vld.w HIT MISS 42% speed improvement from reduced miss cycles

20 20 Reducing the Area Gap (by Customizing the Instruction Set) FPGAs can be reconfigured between applications Observations: Not all applications 1. Operate on 32-bit data types 2. Use the entire vector instruction set Eliminate unused hardware

21 21 VESPA Parameters DescriptionSymbolValues Number of LanesL1,2,4,8, … Maximum Vector LengthMVL2,4,8, … Width of Lanes (in bits)W1-32 Memory Crossbar LanesM1,2, …, L Multiplier LanesX1,2, …, L Instruction Enable (each)-on/off Data Cache CapacityDDany Data Cache Line SizeDWany Data Prefetch SizeDPK< DD Vector Data Prefetch SizeDPV< DD/MVL Subset instruction set Reduce width

22 22 Customized VESPA vs HW Up to 45% area saved with width reduction & subsetting 45% Slowdown vs HW Area vs HW HW Area Advantage HW Speed Advantage

23 23 Summary VESPA more competitive with HW design Fastest VESPA only 17x slower than HW Scalar soft processor was 432x slower than HW Attacking loop overhead and data delivery was key Decoupled pipelines, cache tuning, data prefetching Further enhancements can reduce the gap more VESPA improves efficiency of silicon usage 900x worse area-delay than HW Scalar soft processor 2900x worse area-delay than HW Subsetting/width reduction can further reduce to 561x Enable software implementation for non-critical data-parallel computation

24 24 Thank You! Stay tuned for public release: 1. GNU assembler ported for VIRAM (integer only) 2. VESPA hardware design (DE3 ready)

25 25 Breaking Down Performance Components of performance Loop: goto Loop Loop: goto Loop Loop: goto Loop … Iteration-level parallelism Cycles per iteration × Clock period a) b) c) Measure the HW advantage in each of these components

26 26 Breakdown of Performance Loss (16 lane VESPA vs HW) Benchmark Clock Frequency Iteration Level Parallelism Cycles Per Iteration autcor2.6x1x9.1x conven3.9x1x6.1x rgbcmyk3.7x0.375x13.8x rgbyiq2.2x0.375x19.0x ip_checksum3.7x0.5x4.8x imgblend3.6x1x4.4x GEOMEAN3.2x0.64x8.2x Largest factor Was previously worse, recently improved 17x Total

27 27 1-Lane VESPA vs Scalar 1. Efficient pipeline execution 2. Large vector register file for storage 3. Amortization of loop control instructions. 4. More powerful ISA (VIRAM vs MIPS): 1. Support for fixed-point operations 2. Predication 3. Built-in min/max/absolute instructions 5. Execution in both scalar and vector co-processor 6. Manual vectorization in assembly versus scalar GCC

28 28 Measuring the Gap Scalar: MIPS soft processor VESPA: VIRAM soft vector processor HW: Custom circuit for each benchmark COMPARE (simplified & idealized) (complete & real) COMPARE (complete & real) EEMBC C Benchmarks C assembly Verilog

29 29 Reporting Comparison Results Performance (wall clock time) Area (actual silicon area) HW Speed Advantage = Execution Time of Processor Execution Time of Hardware HW Area Advantage = Area of Processor Area of Hardware 1. Scalar (C) 2. 3. VESPA (Vector assembly) HW (Verilog) vs HW (Verilog)

30 30 Cache Design Space – Performance (Wall Clock Time) Best cache design almost doubles performance of original VESPA 122MHz 123MHz 126MHz 129MHz More pipelining/retiming could reduce clock frequency penalty Cache line more important than cache depth (lots of streaming)

31 31 Vector Length Prefetching - Performance Peak 29% 2.2x Not receptive 1*VL prefetching provides good speedup without tuning, 8*VL best no cache pollution 21%

32 32 Overall Memory System Performance (4KB)(16KB) 67% 48% 31% 4% 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles 16 lanes


Download ppt "Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009."

Similar presentations


Ads by Google