Presentation is loading. Please wait.

Presentation is loading. Please wait.

Abhinav Podili, Chi Zhang, Viktor Prasanna

Similar presentations


Presentation on theme: "Abhinav Podili, Chi Zhang, Viktor Prasanna"— Presentation transcript:

1 Fast and Efficient Implementation of Convolutional Neural Networks on FPGA
Abhinav Podili, Chi Zhang, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California {podili, zhan527, fpga.usc.edu ASAP, July 2017

2 Convolutional Neural Networks
Convolutional layer (followed by ReLU and pooling layer) Fully connected layer

3 Motivation 90 % of the total computation time of CNN: convolutional layer Accelerating Convolution => Accelerating CNN Reducing the convolutional layer computation complexity: Higher power efficiency Higher Throughput per DSP AlexNet(Gops) VGG16(Gops) Convolutional layer 1.52 30.7 Fully Connected layer 0.12 0.5

4 Related Work Dynamic precision data quantization
J. Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proc. ACM FPGA, 2016 Convolutional layer computation reduction C. Zhang et al. Frequency Domain Acceleration of CNNs on CPU-FPGA Shared Memory System. In Proc. ACM FPGA, 2017 Automatic code generation H. Sharma et al. From High-Level Deep Neural Models to FPGAs. In IEEE/ACM International symposium on Microarchitecture, 2016 Model compression S. Han et al. Efficient Inference Engine on Compressed Deep Neural Network. In ISCA, 2016

5 Goal Reduce the latency of the inference on FPGA
Fast Inference Reduce the power and resource consumption for inference Efficient Inference Achieve state-of-the-art throughput by using mid-range devices Output class Latency Memory FPGA CNN Design

6 Results Summary VGG16 Layer(Metric) [1] Our work Power (W) 9.63 8.04
Resource Efficiency (GOP/s/Multiplier) 0.24 0.89 Memory Efficiency (GOP/s/Memory) 88.17 208.3 Power Efficiency (GOP/s/Power) 19.5 28.5 Overall delay is the sum of all the delays of convolution layers. The throughput of our work is GOPS/sec, whereas their work achieved GOPS/sec. Here number of operations equal to number of additions plus number of multiplications. The formula to calculate throughput is equal to overall number of operations divided by overall latency. The number of operations using spatial convolution and Winograd algorithm is almost the same. [1] J. Qiu, et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. ACM, FPGA 2016

7 Main Idea (1) Reduce # of multiplications required for convolution by using Winograd minimal filtering algorithm F(m =2, r =3) = m = Number of outputs 𝑑 𝑥 = Input feature map data r = Size of the filter 𝑔 𝑥 = Filter data 𝑑 0 𝑑 1 𝑑 𝑚 𝑚 𝑚 2 𝑑 1 𝑑 2 𝑑 𝑚 𝑚 𝑚 3 𝑔 0 𝑔 1 = 𝑔 2

8 Main Idea (2) 2D minimal filtering algorithm:
𝑚∗𝑚 outputs with 𝑟∗𝑟 filter is computed by nesting F(m, r) 𝐹(𝑚,𝑟) 1, 𝐹(𝑚,𝑟) 1,𝑚 𝑔 𝑟,1 F(𝑚∗𝑚, 𝑟∗𝑟) = 𝐹(𝑚,𝑟) 2, 𝐹(𝑚,𝑟) 2,𝑚 𝑔 𝑟,2 𝐹(𝑚,𝑟) 𝑚, 𝐹(𝑚,𝑟) 𝑚,𝑚 𝑔 𝑟,𝑟 The input tile size for F(m*m, r*r) is (𝑚+𝑟−1) 2 . F(m*m, r*r) = F(2*2, 3*3) Winograd Algorithm Spatial Convolution Multiplications 16 36 Additions 77 32

9 % reduction of Multiplications
Algorithm Complexity Multiplication complexity = (𝒎+𝒓−𝟏) 𝟐 𝒎 𝟐 .H.W.C Addition complexity = 𝑯.𝑾.𝑪 𝒎 𝟐 [Mul. 𝑨𝒅𝒅 𝒇 + 𝑨𝒅𝒅 𝒇 .m + 2.Mul. 𝑨𝒅𝒅 𝒅 ] H, W, C = Height, Width and No. of channels of input feature map 𝑀𝑢𝑙, 𝐴𝑑𝑑 𝑓 , 𝐴𝑑𝑑 𝑑 = # multiplications, data and final additions in F(m, r) Percentage reduction for various state-of-the-art models: Model % reduction of Multiplications AlexNet 53 VGG16 55 GoogleNet 38

10 Multiplication Complexity
Reduced multiplication complexity leads to: Lower power consumption Lower DSP/logic consumption

11 Implementation (1) Sustain Peak Computational Throughput
Increase Data Reuse High Performance CNN Design Effectively Utilize Memory Bandwidth Effectively Utilize On-chip Memory

12 Sustaining peak computational throughput offered by the device
Implementation (2) Sustaining peak computational throughput offered by the device 6-stage pipeline design utilizes resources fully to achieve high throughput

13 Memory Bandwidth Requirement
Implementation (3) Effectively utilize the memory bandwidth Various design choices: Parallelism across Memory Bandwidth Requirement Data reuse Overhead of additions Only kernel buffers Less High Only Image buffers Kernel & Image buffers

14 Implementation (4) High data reuse and parallelism across kernel buffers  Peak throughput at lower memory bandwidth requirement Data reuse for the kernel data = 𝑁𝑜. 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛𝑠 𝑁𝑜. 𝑜𝑓 𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙 𝑚𝑒𝑚𝑜𝑟𝑦 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠 ≥ 196 (VGG16) Memory bandwidth to completely hide data access latency : FPGA External Memory 𝑡 1 𝑡 2 P. E 𝑡 1 ≤ 𝑡 2 𝑩𝒂𝒏𝒅𝒘𝒊𝒅𝒕𝒉 ≥ 𝟒∗ 𝒎+𝒓 −𝟏 ∗𝒎∗𝒇

15 Effectively use on-chip memory
Implementation (5) Effectively use on-chip memory On-chip memory consumed depends on Number of buffers Size of the buffers Storage optimization for image buffer Number of image buffers Storage optimization for kernel buffer Size of kernel buffers

16 Data Layout Data layout for VGG16 model (𝐹(2*2,3*3)):
Only 4*2*C tile has to be brought into on-chip memory instead of 4*4*C Memory bandwidth requirement is reduced by 2×

17 Overall System Design Number of memory accesses:
Reduced Tile based output accumulation: Eliminates the need for adder trees Convolution and pooling layers are pipelined: Hides latency of pooling layer

18 Target Platform Intel QuickAssist QPI FPGA Platform (Intel Heterogeneous Architecture Research Platform v1) 10 Core Intel Xeon E v2 processor Altera Stratix V FPGA 6.25 MB BRAM 234,720 ALM 256 DSP CPU and FPGA share 2 GB address space FPGA can access data through QPI from the last level cache in the memory system of CPU

19 Experimental Results (1)
VGG16 Layer(Metric) [1] Our work Data Precision 16 bit fixed 32 bit fixed Frequency (MHz) 150 200 Memory (MB) 2.13 1.05 # Multipliers 780 256 Overall Delay (ms) 163.4 142.3 Throughput (GOP/s) 187.8 229.2 Overall delay is the sum of all the delays of convolution layers. The throughput of our work is GOPS/sec, whereas their work achieved GOPS/sec. Here number of operations equal to number of additions plus number of multiplications. The formula to calculate throughput is equal to overall number of operations divided by overall latency. The number of operations using spatial convolution and Winograd algorithm is almost the same. [1] J. Qiu, et al., Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. ACM, FPGA 2016

20 Experimental Results (2)
VGG16 Layer(Metric) [1] Our work Power (W) 9.63 8.04 GOP/s/Multiplier 0.24 0.89 Memory Efficiency (GOP/s/Memory) 88.17 208.3 Power Efficiency (GOP/s/Power) 19.5 28.5 Overall delay is the sum of all the delays of convolution layers. The throughput of our work is GOPS/sec, whereas their work achieved GOPS/sec. Here number of operations equal to number of additions plus number of multiplications. The formula to calculate throughput is equal to overall number of operations divided by overall latency. The number of operations using spatial convolution and Winograd algorithm is almost the same. [1] J. Qiu, et al., Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. ACM, FPGA 2016

21 Scalability Linearly scalable with number of multipliers

22 Choice of Parameter(𝒎)
Small 𝒎 Large 𝒎 Latency of Inference High Less Bandwidth requirement Loss of accuracy No Yes Overhead of additions No. of multiplications

23 FFT or Winograd or Spatial
Approach Winograd Approach Spatial Convolution Storage size for filters 𝑂( 𝑛 2 ) 𝑂( 𝑘 2 ) Overhead of operations High(for small 𝑘) Low(for small 𝑘) No Overhead Low(for large 𝑘) High(for large 𝑘) On-chip memory & Memory bandwidth requirement High Low Degree of parallelism (restricted by on-chip memory) 𝑛, 𝑘 => size of input feature map and filter

24 Conclusion High throughput, resource and energy efficient implementation for CNN inference is possible by using Winograd convolution engine Convolution and pooling layer can be pipelined by using Winograd convolution engine Performance improvements will be even higher by using reduced precision arithmetic

25 Thank You fpga.usc.edu


Download ppt "Abhinav Podili, Chi Zhang, Viktor Prasanna"

Similar presentations


Ads by Google