Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008.

Similar presentations


Presentation on theme: "Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008."— Presentation transcript:

1 Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008

2 2 Soft Processors in FPGA Systems Soft Processor Custom Logic HDL + CAD C + Compiler Easier Faster Smaller Less Power Data-level parallelism → soft vector processors Configurable – how can we make use of this?

3 3 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] vadd 1 Vector Lane

4 4 Vector Processing Primer // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b Each vector instruction holds many units of independent operations vadd 16 Vector Lanes b[0]+=a[0] b[1]+=a[1] b[2]+=a[2] b[4]+=a[4] b[3]+=a[3] b[5]+=a[5] b[6]+=a[6] b[7]+=a[7] b[8]+=a[8] b[9]+=a[9] b[10]+=a[10] b[11]+=a[11] b[12]+=a[12] b[13]+=a[13] b[14]+=a[14] b[15]+=a[15] 16x speedup

5 5 Sub-Linear Scalability Vector lanes not being fully utilized

6 6 Where Are The Cycles Spent? 67% 2/3 cycles spent waiting on memory unit, often from cache misses 16 lanes

7 7 Our Goals 1. Improve memory system Better cache design Hardware prefetching 2. Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)

8 8 Current Infrastructure Vectorized assembly subroutines GNU as + Vector support ELF Binary MINT Instruction Set Simulator scalar μP + vpu VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU Mem Unit x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift EEMBC C Benchmarks Modelsim (RTL Simulator) SOFTWAREHARDWARE Verilog Altera Quartus II v 8.0 cycles area, frequency GCC ld verification

9 9 VESPA Architecture Design Scalar Pipeline 3-stage Vector Control Pipeline 3-stage Vector Pipeline 6-stage IcacheDcache Decode RF ALUALU MUXMUX WB VC RF VS RF VC WB VS WB Logic Decode Repli- cate Hazard check VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift VR RF ALUALU x & satur. VR WB MUXMUX Satu- rate Rshift Mem Unit Decode Supports integer and fixed-point operations, and predication 32-bit datapaths Shared Dcache 10

10 Vector Memory Crossbar Memory System Design DDR Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Dcache 4KB, 16B line … Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 VESPA 16 lanes DDR 9 cycle access vld.w (load 16 contiguous 32-bit words)

11 11 Vector Memory Crossbar Memory System Design DDR Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Dcache 16KB, 64B line … Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 VESPA 16 lanes DDR 9 cycle access vld.w (load 16 contiguous 32-bit words) 4x Reduced cache accesses + some prefetching

12 12 Improving Cache Design Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware Measure area cost Equate silicon area of all resources used Report in units of Equivalent LEs

13 13 Cache Design Space – Performance (Wall Clock Time) Best cache design almost doubles performance of original VESPA 122MHz 123MHz 126MHz 129MHz More pipelining/retiming could reduce clock frequency penalty Cache line more important than cache depth (lots of streaming)

14 14 Cache Design Space – Area M4K MRAM 16bits 4096 bits 64B (512 bits) 16bits 4096 bits 16bits 4096 bits … 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 16bits 4096 bits 32 => 16KB of storage System area almost doubled in worst case

15 15 Cache Design Space – Area M4K MRAM b) Don’t use MRAMs: big, few, and overkill a) Choose depth to fill block RAMs needed for line size

16 16 Hardware Prefetching Example DDR Dcache … vld.w No PrefetchingPrefetching 3 blocks DDR Dcache … vld.w MISS 9 cycle penalty 9 cycle penalty vld.w HIT MISS

17 17 Hardware Data Prefetching Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth Disadvantages Cache pollution We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss We measure performance/area using a 64B, 16KB dcache

18 18 Prefetching K Blocks – Any Miss Peak average speedup 28% 2.2x Not receptive Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%

19 19 dirty lines … Prefetching Area Cost: Writeback Buffer Two options: Deny prefetch Buffer all dirty lines Area cost is small 1.6% of system area Mostly block RAMs Little logic No clock frequency impact Prefetching 3 blocks DDR Dcache … vld.w MISS 9 cycle penalty WB Buffer

20 20 Any Miss vs Sequential Vector Miss Collinear – nearly all misses in our benchmarks are sequential vector

21 21 Vector Length Prefetching Previously: constant # cache lines prefetched Now: Use multiple of vector length Only for sequential vector memory instructions Eg. Vector load of 32 elements Guarantees <= 1 miss per vector memory instr vld.w 031 fetch + prefetch 28*k

22 22 Vector Length Prefetching - Performance Peak 29% 2.2x Not receptive 1*VL prefetching provides good speedup without tuning, 8*VL best no cache pollution 21%

23 23 Overall Memory System Performance (4KB)(16KB) 67% 48% 31% 4% 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles

24 24 Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes

25 25 Summary Explored cache design ~2x performance for ~2x system area Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%

26 26 Vector Memory Unit Dcache base stride*0 index0 + MUXMUX... stride*1 index1 + MUXMUX stride*L indexL + MUXMUX Memory Request Queue Read Crossbar … Memory Lanes=4 rddata0 rddata1 rddataL wrdata0 wrdata1 wrdataL... Write Crossbar Memory Write Queue L = # Lanes - 1 … …


Download ppt "Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008."

Similar presentations


Ads by Google