Presentation is loading. Please wait.

Presentation is loading. Please wait.

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Similar presentations


Presentation on theme: "Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux"— Presentation transcript:

1 Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
VIPERS II: A Soft-core Vector Processor with Single-copy Data Scratchpad Memory Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

2 Outline Motivation New Pipeline Structure VIPERS II Architecture
Results Conclusion

3 Motivation VIPERS soft vector processor provides scalable performance for data-parallel applications on FPGAs Original VIPERS has a few shortcomings: High latency for copying data from memory to register file Duplicate copies of data in precious on-chip memory Scalar core not pipelined, and has no debug-core

4 Duplicate Copies of Data
VIPERS uses dual read-port vector register file 2 identical copies of the register file Plus an original copy of data in on-chip memory These data duplicates are wasteful Limited on-chip memory capacity Today’s FPGA offers fast on-chip memories. Why not access the memory directly?

5 Contribution Use address registers and scratchpad memory to replace vector register file Eliminate slow load/store operations More efficient on-chip memory usage Auto-increment/decrement and circular buffer features Reduce need for loop unrolling Lower loop overhead

6 Outline Motivation New Pipeline Structure VIPERS II Architecture
Results Conclusion 6

7 New Pipeline Structure
Classic 5-stage pipeline Swap the execution stage with the memory access stage Note the name of the stages are changed to memory read and memory write

8 Implementation The “data” register file is replaced by address registers and a scratchpad memory. Eliminates load/store when data set fits in scratchpad memory.

9 VIPERS II ISA

10 Outline Motivation New Pipeline Structure VIPERS II Architecture
Results Conclusion 10

11 VIPERS II Architecture

12 Architectural Changes
Vector address registers Vector scratchpad memory Data alignment crossbar network (DACN) Fracturable ALUs

13 Vector Address Registers
Features auto post-increment, pre-decrement, and circular buffer modes Reduce loop overheads Require less address registers than data registers to implement an application

14 Vector Address Register

15 Vector Scratchpad Memory
Reduced load/store latencies with simpler memory interface Operate at 2X clock

16 Vector Scratchpad Memory
Efficient data storage Flexible data set size restriction e.g. Median filter benchmark with byte-size data: Add backup slide

17 Data Alignment Crossbar Network
With vector lanes coupled directly to memory, input vectors must be aligned For misaligned operands, vector move instruction (vmov) is used to move data into alignment

18 Example

19 Data Alignment Crossbar Network
Implemented with multistage switching network to trade off performance for area Crossbar – quadratic growth DACN – Nlog(N) growth, slower

20 Fracturable ALUs Data elements are stored in their natural length
Fracturable ALUs are used to execute on operands with varying widths

21 Fracturable ALUs

22 Fracturable ALUs Increased processing power
4-Lane VIPERS II operating on byte-size data is equivalent to having a 16 lanes Mention if VL is increased to 64 to fully utilize the pipeline, it takes as little as 70 cycles per pixel!!

23 Outline Motivation New Pipeline Structure VIPERS II Architecture
Results Conclusion 23

24 Resource Usage Explain DSP breakdown

25 Simulated Performance

26 Hardware Performance

27 Future Work Increase operating frequency
Implement strided and indexed moves Implement DACN with Omega network Alternative implementation of address register

28 Related Works VESPA (Rose, CASES08) and VIPERS (Lemieux, FPGA08) are two previous soft-core vector processors VIPERS II uses vector scratchpad memory instead of register file IBM’s CELL processor (Pham, ISSCC05) features SRAM scratchpad memory populated by DMA VIPERS II does not require load/store operations Register pointer architecture (Dally, DATE07) reduces need for loop unrolling by dynamically changing the register pointer VIPERS II is the first vector processor to utilize this technique

29 Conclusion VIPERS II architecture provides many advantages:
Improve performance by eliminating slow load/store operations Achieve unrolled performance without unrolling Efficient usage of on-chip memory Increased processing power when executing smaller operands

30 Thank you

31 Vector Scratchpad Memory
e.g. Largest median filter that can be realized given a 64kb memory budget

32 Implementation

33 Strided/Indexed Access
Strided/indexed loads are replaced by strided/indexed move operations. Similar to ‘vmov’, strided move ‘vmovs’ simply moves scattered elements to contiguous locations in the memory. e.g. vmovs vA1, vA0, vstride0;

34 Permutation Requirement
Show by figure, offset, stride, and index


Download ppt "Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux"

Similar presentations


Ads by Google