Fast and Efficient Implementation of Convolutional Neural Networks on FPGA Abhinav Podili, Chi Zhang, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California {podili, zhan527, prasanna}@usc.edu fpga.usc.edu ASAP, July 2017
Convolutional Neural Networks Convolutional layer (followed by ReLU and pooling layer) Fully connected layer
Motivation 90 % of the total computation time of CNN: convolutional layer Accelerating Convolution => Accelerating CNN Reducing the convolutional layer computation complexity: Higher power efficiency Higher Throughput per DSP AlexNet(Gops) VGG16(Gops) Convolutional layer 1.52 30.7 Fully Connected layer 0.12 0.5
Related Work Dynamic precision data quantization J. Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proc. ACM FPGA, 2016 Convolutional layer computation reduction C. Zhang et al. Frequency Domain Acceleration of CNNs on CPU-FPGA Shared Memory System. In Proc. ACM FPGA, 2017 Automatic code generation H. Sharma et al. From High-Level Deep Neural Models to FPGAs. In IEEE/ACM International symposium on Microarchitecture, 2016 Model compression S. Han et al. Efficient Inference Engine on Compressed Deep Neural Network. In ISCA, 2016
Goal Reduce the latency of the inference on FPGA Fast Inference Reduce the power and resource consumption for inference Efficient Inference Achieve state-of-the-art throughput by using mid-range devices Output class Latency Memory FPGA CNN Design
Results Summary VGG16 Layer(Metric) [1] Our work Power (W) 9.63 8.04 Resource Efficiency (GOP/s/Multiplier) 0.24 0.89 Memory Efficiency (GOP/s/Memory) 88.17 208.3 Power Efficiency (GOP/s/Power) 19.5 28.5 Overall delay is the sum of all the delays of convolution layers. The throughput of our work is 229.2 GOPS/sec, whereas their work achieved 187.8 GOPS/sec. Here number of operations equal to number of additions plus number of multiplications. The formula to calculate throughput is equal to overall number of operations divided by overall latency. The number of operations using spatial convolution and Winograd algorithm is almost the same. [1] J. Qiu, et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. ACM, FPGA 2016
Main Idea (1) Reduce # of multiplications required for convolution by using Winograd minimal filtering algorithm F(m =2, r =3) = m = Number of outputs 𝑑 𝑥 = Input feature map data r = Size of the filter 𝑔 𝑥 = Filter data 𝑑 0 𝑑 1 𝑑 2 𝑚 0 + 𝑚 1 + 𝑚 2 𝑑 1 𝑑 2 𝑑 3 𝑚 0 - 𝑚 2 - 𝑚 3 𝑔 0 𝑔 1 = 𝑔 2
Main Idea (2) 2D minimal filtering algorithm: 𝑚∗𝑚 outputs with 𝑟∗𝑟 filter is computed by nesting F(m, r) 𝐹(𝑚,𝑟) 1,1 𝐹(𝑚,𝑟) 1,𝑚 𝑔 𝑟,1 F(𝑚∗𝑚, 𝑟∗𝑟) = 𝐹(𝑚,𝑟) 2,1 𝐹(𝑚,𝑟) 2,𝑚 𝑔 𝑟,2 𝐹(𝑚,𝑟) 𝑚,1 𝐹(𝑚,𝑟) 𝑚,𝑚 𝑔 𝑟,𝑟 The input tile size for F(m*m, r*r) is (𝑚+𝑟−1) 2 . F(m*m, r*r) = F(2*2, 3*3) Winograd Algorithm Spatial Convolution Multiplications 16 36 Additions 77 32
% reduction of Multiplications Algorithm Complexity Multiplication complexity = (𝒎+𝒓−𝟏) 𝟐 𝒎 𝟐 .H.W.C Addition complexity = 𝑯.𝑾.𝑪 𝒎 𝟐 [Mul. 𝑨𝒅𝒅 𝒇 + 𝑨𝒅𝒅 𝒇 .m + 2.Mul. 𝑨𝒅𝒅 𝒅 ] H, W, C = Height, Width and No. of channels of input feature map 𝑀𝑢𝑙, 𝐴𝑑𝑑 𝑓 , 𝐴𝑑𝑑 𝑑 = # multiplications, data and final additions in F(m, r) Percentage reduction for various state-of-the-art models: Model % reduction of Multiplications AlexNet 53 VGG16 55 GoogleNet 38
Multiplication Complexity Reduced multiplication complexity leads to: Lower power consumption Lower DSP/logic consumption
Implementation (1) Sustain Peak Computational Throughput Increase Data Reuse High Performance CNN Design Effectively Utilize Memory Bandwidth Effectively Utilize On-chip Memory
Sustaining peak computational throughput offered by the device Implementation (2) Sustaining peak computational throughput offered by the device 6-stage pipeline design utilizes resources fully to achieve high throughput
Memory Bandwidth Requirement Implementation (3) Effectively utilize the memory bandwidth Various design choices: Parallelism across Memory Bandwidth Requirement Data reuse Overhead of additions Only kernel buffers Less High Only Image buffers Kernel & Image buffers
Implementation (4) High data reuse and parallelism across kernel buffers Peak throughput at lower memory bandwidth requirement Data reuse for the kernel data = 𝑁𝑜. 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛𝑠 𝑁𝑜. 𝑜𝑓 𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙 𝑚𝑒𝑚𝑜𝑟𝑦 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠 ≥ 196 (VGG16) Memory bandwidth to completely hide data access latency : FPGA External Memory 𝑡 1 𝑡 2 P. E 𝑡 1 ≤ 𝑡 2 𝑩𝒂𝒏𝒅𝒘𝒊𝒅𝒕𝒉 ≥ 𝟒∗ 𝒎+𝒓 −𝟏 ∗𝒎∗𝒇
Effectively use on-chip memory Implementation (5) Effectively use on-chip memory On-chip memory consumed depends on Number of buffers Size of the buffers Storage optimization for image buffer Number of image buffers Storage optimization for kernel buffer Size of kernel buffers
Data Layout Data layout for VGG16 model (𝐹(2*2,3*3)): Only 4*2*C tile has to be brought into on-chip memory instead of 4*4*C Memory bandwidth requirement is reduced by 2×
Overall System Design Number of memory accesses: Reduced Tile based output accumulation: Eliminates the need for adder trees Convolution and pooling layers are pipelined: Hides latency of pooling layer
Target Platform Intel QuickAssist QPI FPGA Platform (Intel Heterogeneous Architecture Research Platform v1) 10 Core Intel Xeon E5-2600 v2 processor Altera Stratix V FPGA 6.25 MB BRAM 234,720 ALM 256 DSP CPU and FPGA share 2 GB address space FPGA can access data through QPI from the last level cache in the memory system of CPU
Experimental Results (1) VGG16 Layer(Metric) [1] Our work Data Precision 16 bit fixed 32 bit fixed Frequency (MHz) 150 200 Memory (MB) 2.13 1.05 # Multipliers 780 256 Overall Delay (ms) 163.4 142.3 Throughput (GOP/s) 187.8 229.2 Overall delay is the sum of all the delays of convolution layers. The throughput of our work is 229.2 GOPS/sec, whereas their work achieved 187.8 GOPS/sec. Here number of operations equal to number of additions plus number of multiplications. The formula to calculate throughput is equal to overall number of operations divided by overall latency. The number of operations using spatial convolution and Winograd algorithm is almost the same. [1] J. Qiu, et al., Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. ACM, FPGA 2016
Experimental Results (2) VGG16 Layer(Metric) [1] Our work Power (W) 9.63 8.04 GOP/s/Multiplier 0.24 0.89 Memory Efficiency (GOP/s/Memory) 88.17 208.3 Power Efficiency (GOP/s/Power) 19.5 28.5 Overall delay is the sum of all the delays of convolution layers. The throughput of our work is 229.2 GOPS/sec, whereas their work achieved 187.8 GOPS/sec. Here number of operations equal to number of additions plus number of multiplications. The formula to calculate throughput is equal to overall number of operations divided by overall latency. The number of operations using spatial convolution and Winograd algorithm is almost the same. [1] J. Qiu, et al., Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. ACM, FPGA 2016
Scalability Linearly scalable with number of multipliers
Choice of Parameter(𝒎) Small 𝒎 Large 𝒎 Latency of Inference High Less Bandwidth requirement Loss of accuracy No Yes Overhead of additions No. of multiplications
FFT or Winograd or Spatial Approach Winograd Approach Spatial Convolution Storage size for filters 𝑂( 𝑛 2 ) 𝑂( 𝑘 2 ) Overhead of operations High(for small 𝑘) Low(for small 𝑘) No Overhead Low(for large 𝑘) High(for large 𝑘) On-chip memory & Memory bandwidth requirement High Low Degree of parallelism (restricted by on-chip memory) 𝑛, 𝑘 => size of input feature map and filter
Conclusion High throughput, resource and energy efficient implementation for CNN inference is possible by using Winograd convolution engine Convolution and pooling layer can be pipelined by using Winograd convolution engine Performance improvements will be even higher by using reduced precision arithmetic
Thank You fpga.usc.edu