Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia Prepared for FPGA2008, Altera, and Xilinx February 26-28, 2008

2 Motivation  FPGAs for embedded processing High performance, computationally intensive Growing use of embedded processor on FPGA Nios/MicroBlaze too slow  Faster performance Faster Nios/MicroBlaze Multiprocessor-on-FPGA Custom hardware accelerator Synthesized accelerator

3 Problems…  Faster Nios/MicroBlaze not feasible 2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA Superscalar complex dependency checking  Multiprocessor-on-FPGA complexity Parallel programming and debugging System design Cache coherence, memory consistency  Custom hardware accelerator cost Need hardware engineer Time-consuming to design and debug 1 hardware accelerator per function

4 Possible Solutions…  Automatically synthesized hardware accelerators Change software  regenerate & recompile RTL  Altera C2H  Xilinx CHiMPS  Mitrion Virtual Processor  CriticalBlue Cascade  Soft vector processor Change software  same RTL, just recompile software  Purely software-based  Decouples hardware/software development teams

5 Advantages of Vector Processing  Simple programming model Short to long vector data parallelism Regular, easy to accelerate  Purely software-based One hardware accelerator supports many applications  Scalable performance and area

6 Contributions  Configurable soft vector processor Selectable performance/resource tradeoff Area customization  FPGA-specific enhancements Partitioned register file Vector reductions using MAC chain Local vector datapath memory

Overview of Vector Processing

8 Acceleration with Vector Processing  Organize data as long vectors  Data-level parallelism  Vector instruction execution Multiple vector lanes (SIMD) Repeated SIMD operation over length of vector Sourcevectorregisters Destinationvectorregister Vector lanes for (i=0; i<NELEM; i++) a[i] = b[i] * c[i] vmult a, b, c

9 Compared to CPUs with SIMD Extensions  Intel SSE2, PowerPC Altivec, etc  Short, fixed-length vectors (eg, 4)  Single cycle per instruction  Many data pack/unpack instructions SourceSIMDregistersDestinationSIMDregister SIMD Unit

10 Hybrid Vector-SIMD  Consider the code sequence Traditional Vector Hybrid Vector-SIMDSIMD For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } Loop iteration

11 Hybrid vector-SIMD vs Traditional Vector Traditional vector processing Hybrid Vector-SIMD processing For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } 0 1 2 3 C E C E 4 5 6 7

12 Vector ISA Features  Vector length (VL) register  Conditional execution Vector flag registers  Vector addressing modes Unit stride Constant stride Indexed offset Source registers Destination register Flag register Vector Merge Operation

13 Example: Simple 5x5 Median Filtering Pseudocode (Bubble sort) Load the 25 pixel vectors P[0..24] For i=0 to 12 { minimum = P[i] For j=i to 24 { if (P[j] < minimum) { swap (minimum, P[j]) }  Slide “window” over after 1 median  Repeated over entire image Many windows Output pixel

14 Example: Simple 5x5 Median Filtering Pseudocode (Bubble sort) Load the 25 pixel vectors P[0..24] For i=0 to 12 { minimum = P[i] For j=i to 24 { if (P[j] < minimum) { swap (minimum, P[j]) }  Bubble sort on vector registers  Vector flag register to mask execution  “VL” results at once! 25 rows -> 25 vector registers “VL” pixels each

Soft Vector Processor Architecture

16 Nios II core Shared instruction memory (scalar / vector instructions) Shared scalar / vector Memory interface Distributed vector register file Overlapped scalar / vector execution Configurable memory width Configurable number of lanes

17 0 0 1 1 3 3 4 4 5 5 7 7 One vector Register (eg, v0) Distributed vector register file

18 Local vector datapath memory MAC chain Result to VLane 0

19 Vector Sum Reduction with MAC  Sum reduction R =  A[i] * B[i] R =  A[i] (using B[i] = 1) Reduces VL elements in vector register to single number  Two instruction sequence: vmac  multiply accum. to accumulators vcczacc  compress copy and zero accumulators  Side effect: can only reduce 18-bit inputs Accumulate chain

20 Configurable Parameters  Some configurable features Number of vector lanes Vector ALU width Vector memory access granularity (8, 16, 32b) Local memory size (or none)  Strongly affect performance, area

21 Partial List of Configurable Parameters Primary Parameters Soft vector processors ParameterDescriptionTypicalV4V8V16M 32 NLaneNumber of vector lanes4-1284816 MVLMaximum vector length16-512163264 VPUWProcessor data width (bits)8, 16, 3232 MemMinWidthMinimum accessible data width in memory 8, 16, 328832 Parameters for Optional Features MultWMultiplier width (bits, 0 is off)0, 8, 16, 3216 MACLMAC chain length (0 is no MAC)0,1,2,4120 LMemNLocal memory number of words0-1024256 0 LMemShareShared local memory address space within lane On/OffOff

Performance Results

23 Benchmarking  3 sample application kernels 5x5 median filter Motion estimation (full search block matching) 128-bit AES encryption (MiBench)  C code, 3 versions Nios II Nios II with inline vector assembly Nios II with C2H accelerator

24 Methodology and Assumptions  Compile C code with nios2-gcc  Run time Instructions * cycles-per-instruction / Fmax  Nios II Instruction: 1 cycle Memory load: 1 cycle  Nios II with vectors Vector instruction: (VL / NLane) cycles Vector load: 2 * (VL / NLane) + 2 cycles

25 Altera C2H Compiler  Nios II with C2H accelerator Synthesizes HW accelerator from a C function C memory reference = master port to that memory Current limitations:  No automatic loop unrolling  Up to user to efficiently partition memory Memory Arbiter Avalon Fabric

26 C2H Methodology  Compile application kernels with C2H compiler Automatic pipelining and scheduling Manually unroll loops Manually “vectorize” C code  Nios II with C2H accelerator C2H compiler reports # of clock cycles Includes memory arbitration overhead

27 C2H Example  AES encryption round Shift 4 32-bit words (by different amounts) 4 table lookups XOR results, XOR with key  Acceleration steps 1. Process multiple blocks in parallel (increase array sizes) 2. Manually create 4 on-chip memories for 4 lookup tables 32-bit word

28 Synthesize system, place and route Synthesize system, place and route/

29 Resource Utilization Biggest Stratix III = 7x more resources Note: These Vector processors include a large local memory in each vector lane (an optional feature), hence the high M9K utilization. Removal would save 60% of M9K in V16.

30 Resource Utilization Estimates ALMDSP ElementsM9KFmax Smallest Stratix III19000216108- Nios II/s48984153 + C2H Median filtering82584*147 + C2H Motion estimation977104*135 + C2H AES encryption248086*119 UTIIe32403193 +V452152132115 +V870113453114 +V16102665895113 * C2H results are obtained from compiling to Stratix II; uses M4K memories

31 Results: Clock Cycles

32 Speedup vs Resource Utilization Summary Nios II/s V16 V32 C2H Vector Median filtering AES encryption Motion estimation

33 Summary of Effort  C2H accelerators 1. “Vectorize” code for C2H: 1 day 2. Extra-effort optimization: 1 day 3. Place-and-route waiting: 1 hour Each iteration = 1 day + P&R  Vector soft processor 1. Vector algorithm, write vector assembly: 2 days 2. Revise vector algorithm: 0.5 day Each iteration = 0.5 day + SW compile only

34 Lessons from Vector Processor Design  Register files 2-read, 1-write memory very common for CPUs Multiple write ports for wide-issue processing  Wide, flexible vector memory interface very costly Memory crossbars: several multi-bit multiplexers ~1/3 the resources of soft vector processor (128b, byte access)  Stratix III specific DSP shift chain can no longer dynamically select input MAC chain is useful  Would like 32-bit MAC chain

35 Current Progress  Development toolchain integration Packaged as SOPC builder component No built-in debug core  Uses real Nios II processor to download code on to system Inline vector assembly in Nios II IDE  Future work Compiler Floating-point

36 Conclusion  Vector processing maps well to FPGA Many small memories, DSP blocks Simple programming model  Soft vector processor Purely software-based acceleration  No hardware design / RTL recompile needed—just program  One hardware accelerator supports many applications Scalable performance and area  More vector lanes  more performance for more area  Soft core parameters/features  area customization

37 Conclusion  FPGA-specific enhancements Partitioned register file reduces resource utilization MAC chain for efficient vector reduction Local vector datapath memory  Table lookup operations  Download the processor now! http://www.ece.ubc.ca/~jasony/

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.

Similar presentations

Presentation on theme: "Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.

Similar presentations

Presentation on theme: "Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia."— Presentation transcript:

Similar presentations

About project

Feedback