Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Similar presentations

Presentation on theme: "A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University."— Presentation transcript:

1 A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University of Florida Presentation by John Potts, University of Guelph

2 2 Outline ● Introduction – What is a sliding-window application? – Justification ● Background – Applications ● Methodologies ● Results ● Analysis ● Conclusions

3 3 Introduction: Sliding-Window Applications ● What is a Sliding-Window Application? ● 2-Dimensional Signal Analysis ● x by y image, n by m kernal or window

4 4 Introduction: Sliding-Window Applications

5 5 Introduction - Justification ● Computing Architectures tending towards parallelism and heterogeneity ● GPUs are common ● Multitude of accelerator options available ● Metrics for different devices vary widely with applications ● Often many Pareto-optimal solutions ● Study focuses on a particular application type and two particular design criteria

6 6 Introduction: Devices Devices tested: ● Altera Stratix III E260 FPGA ● NVIDIA GeForce GTX 295 GPU using CUDA framework ● Quad-core xeon W3520 using OpenCL multicore framework ● Single Chip Systems also examined

7 7 Background: Previous Work ● Application performance for FPGAs and GPUs – Sinha et al.: feature tracker on GPU – Porter et al.: stereo matching algorithms on FPGA ● FPGA, GPU and CPU comparisons – Baker et al.: matched filter algorithm, Cell Processor for performance and energy, GPU for performance per dollar – Pauwels et al.: Vision-based algorithms, FPGAs best for single stage algorithms only ● Different use cases: – Cope et al.: 2D convolution and colour correction, performance dependant on kernal size – Asano et al.: CPU, GPU and FPGA for applications of 2D filter, SAD stereo vision disparity, k-means clustering

8 8 Background: Improvements offered by this study ● Study provides a more in-depth analysis of sliding- window applications ● Wider range of image and kernal sizes ● Presents a generalized circuit architecture ● Optimizations deliver real-time sliding-window processing of HD video on single GPU or FPGA ● Evaluates a new application based on Information Theoretic Learning

9 9 Background: Applications ● Applications where the kernal is fully immersed ● SAD – Sum of Absolute Differences ● 2D Convolution ● Correntropy ● 2D FFT – GPU and Multicore only

10 10 Applications: Sum of Absolute Differences ● Detect a degree of similarity between images ● Eg: security system ● Operation ● Output: structure of size (x-n+1)x(y-m+1)

11 11 Applications: 2D Convolution ● Used in digital signal processing, scientific computing, small to high-performance embedded systems ● Operation ● Equation: ● Common Optimization

12 12 Applications: Correntropy ● Measure of similarity based on Information Theoretic Learning ● Many possible applications, study focuses on one similar to SAD ● Equation: ● Operation

13 13 Methodology: FPGA Circuit Architecture:

14 14 Methodology: FPGA ● Uses a window generator to reduce bandwidth requirements ● Controller and host software transfers image, initializes, polls, reads output ● SAD implementation ● 2D Convolution

15 15 Methodology: FPGA ● Correntropy

16 16 Methodology: FPGA Resources LUTsRegistersBlock Memory Bits DSP Blocks SAD137,260156,3772,256,4640 2D Convolution: Fixed point 33,54757,1221,601,104738 2D Convolution: Floating Point 129,024126,8211,633,872676 Correntropy141,633143,1372,256,4640

17 17 Methodology: GPU ● Uses Specialized memory organisation ● a x b output pixels, 64x32 selected ● Macroblock size balances between threads per block and memory bank conflicts. 2X2 chosen ● SAD: calculated between kernal and four windows in the corresponding macroblock

18 18 Methodology: GPU ● 2D Convolution: Similar to SAD – Frequency domain also implemented (2D FFT) ● Correntropy: SAD with extra step – Challenge: locating maximum similarity values

19 19 Methodology: Multicore ● Utilized OpenCL parallel programming standard ● Optimizations focused on minimizing communication between threads ● Implementation consists of straightforward specification of the window function

20 20 Results ● Results examined include FPS, speedup analysis, energy efficiency ● Single chip systems such as APUs and standalone FPGA examined – Upper bound estimates found by removing PCIe transfer times ● Sequential C++ implementations used as baseline ● Implementations evaluated for 480p, 720p, 1080p video ● Kernal sizes of 4x4, 9x9, 16x16, 25x25 evaluated for all applications, also 36x36 and 45x45 for SAD and correntropy

21 21 Results: Sum of Absolute Difference

22 22 Results: 2D Convolution

23 23 Results: Correntropy

24 24 Results: Speedup

25 25 Results: Application Comparison

26 26 Results: Analysis ● GPU is best for smaller (4x4 and 9x9 kernals), equivalent in 16x16 ● FPGA speedup reached 240x, 45x, 298x over sequential baseline for SAD, 2D Convolution, Correntropy ● 2D Convolution: GPU-FFT was faster than FPGA ● FPGA implementations were near constant time due to pipelining, extra steps present as latency rather than throughput

27 27 Results: Single Chip Systems ● PCIe transfer times were as much as 65% of GPU execution time, 64% of FPGA execution time ● FPGA single chip is consistently ~2x PCIe ● At time of writing, GPU times minus PCIe transfer time is not a realistic representation as standalone or single chip GPU systems do not have nearly the capability of the device tested

28 28 Results: Energy Comparison Energy Consume for one frame

29 29 Results: Energy Comparison Theoretical Wattage for 30 fps:

30 30 Results: Energy Comparison ● Example application: Embedded system using correntropy for target tracking

31 31 Conclusions ● Performance and Energy requirements of sliding-window applications for a variety of devices and use cases ● FPGAs were faster except for small inputs ● FPGAs had lower power requirements ● Consistency of results suggests applicability to other sliding window applications

32 32 Questions?

Download ppt "A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University."

Similar presentations

Ads by Google