Abhinav Podili, Chi Zhang, Viktor Prasanna

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

Fast and Efficient Implementation of Convolutional Neural Networks on FPGA Abhinav Podili, Chi Zhang, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California {podili, zhan527, prasanna}@usc.edu fpga.usc.edu ASAP, July 2017

Convolutional Neural Networks Convolutional layer (followed by ReLU and pooling layer) Fully connected layer

Motivation 90 % of the total computation time of CNN: convolutional layer Accelerating Convolution => Accelerating CNN Reducing the convolutional layer computation complexity: Higher power efficiency Higher Throughput per DSP AlexNet(Gops) VGG16(Gops) Convolutional layer 1.52 30.7 Fully Connected layer 0.12 0.5

Related Work Dynamic precision data quantization J. Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proc. ACM FPGA, 2016 Convolutional layer computation reduction C. Zhang et al. Frequency Domain Acceleration of CNNs on CPU-FPGA Shared Memory System. In Proc. ACM FPGA, 2017 Automatic code generation H. Sharma et al. From High-Level Deep Neural Models to FPGAs. In IEEE/ACM International symposium on Microarchitecture, 2016 Model compression S. Han et al. Efficient Inference Engine on Compressed Deep Neural Network. In ISCA, 2016

Goal Reduce the latency of the inference on FPGA Fast Inference Reduce the power and resource consumption for inference Efficient Inference Achieve state-of-the-art throughput by using mid-range devices Output class Latency Memory FPGA CNN Design

Results Summary VGG16 Layer(Metric) [1] Our work Power (W) 9.63 8.04 Resource Efficiency (GOP/s/Multiplier) 0.24 0.89 Memory Efficiency (GOP/s/Memory) 88.17 208.3 Power Efficiency (GOP/s/Power) 19.5 28.5 Overall delay is the sum of all the delays of convolution layers. The throughput of our work is 229.2 GOPS/sec, whereas their work achieved 187.8 GOPS/sec. Here number of operations equal to number of additions plus number of multiplications. The formula to calculate throughput is equal to overall number of operations divided by overall latency. The number of operations using spatial convolution and Winograd algorithm is almost the same. [1] J. Qiu, et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. ACM, FPGA 2016

Main Idea (1) Reduce # of multiplications required for convolution by using Winograd minimal filtering algorithm F(m =2, r =3) = m = Number of outputs 𝑑 𝑥 = Input feature map data r = Size of the filter 𝑔 𝑥 = Filter data 𝑑 0 𝑑 1 𝑑 2 𝑚 0 + 𝑚 1 + 𝑚 2 𝑑 1 𝑑 2 𝑑 3 𝑚 0 - 𝑚 2 - 𝑚 3 𝑔 0 𝑔 1 = 𝑔 2

Main Idea (2) 2D minimal filtering algorithm: 𝑚∗𝑚 outputs with 𝑟∗𝑟 filter is computed by nesting F(m, r) 𝐹(𝑚,𝑟) 1,1 𝐹(𝑚,𝑟) 1,𝑚 𝑔 𝑟,1 F(𝑚∗𝑚, 𝑟∗𝑟) = 𝐹(𝑚,𝑟) 2,1 𝐹(𝑚,𝑟) 2,𝑚 𝑔 𝑟,2 𝐹(𝑚,𝑟) 𝑚,1 𝐹(𝑚,𝑟) 𝑚,𝑚 𝑔 𝑟,𝑟 The input tile size for F(m*m, r*r) is (𝑚+𝑟−1) 2 . F(m*m, r*r) = F(2*2, 3*3) Winograd Algorithm Spatial Convolution Multiplications 16 36 Additions 77 32

% reduction of Multiplications Algorithm Complexity Multiplication complexity = (𝒎+𝒓−𝟏) 𝟐 𝒎 𝟐 .H.W.C Addition complexity = 𝑯.𝑾.𝑪 𝒎 𝟐 [Mul. 𝑨𝒅𝒅 𝒇 + 𝑨𝒅𝒅 𝒇 .m + 2.Mul. 𝑨𝒅𝒅 𝒅 ] H, W, C = Height, Width and No. of channels of input feature map 𝑀𝑢𝑙, 𝐴𝑑𝑑 𝑓 , 𝐴𝑑𝑑 𝑑 = # multiplications, data and final additions in F(m, r) Percentage reduction for various state-of-the-art models: Model % reduction of Multiplications AlexNet 53 VGG16 55 GoogleNet 38

Multiplication Complexity Reduced multiplication complexity leads to: Lower power consumption Lower DSP/logic consumption

Implementation (1) Sustain Peak Computational Throughput Increase Data Reuse High Performance CNN Design Effectively Utilize Memory Bandwidth Effectively Utilize On-chip Memory

Sustaining peak computational throughput offered by the device Implementation (2) Sustaining peak computational throughput offered by the device 6-stage pipeline design utilizes resources fully to achieve high throughput

Memory Bandwidth Requirement Implementation (3) Effectively utilize the memory bandwidth Various design choices: Parallelism across Memory Bandwidth Requirement Data reuse Overhead of additions Only kernel buffers Less High Only Image buffers Kernel & Image buffers

Implementation (4) High data reuse and parallelism across kernel buffers  Peak throughput at lower memory bandwidth requirement Data reuse for the kernel data = 𝑁𝑜. 𝑜𝑓 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛𝑠 𝑁𝑜. 𝑜𝑓 𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙 𝑚𝑒𝑚𝑜𝑟𝑦 𝑎𝑐𝑐𝑒𝑠𝑠𝑒𝑠 ≥ 196 (VGG16) Memory bandwidth to completely hide data access latency : FPGA External Memory 𝑡 1 𝑡 2 P. E 𝑡 1 ≤ 𝑡 2 𝑩𝒂𝒏𝒅𝒘𝒊𝒅𝒕𝒉 ≥ 𝟒∗ 𝒎+𝒓 −𝟏 ∗𝒎∗𝒇

Effectively use on-chip memory Implementation (5) Effectively use on-chip memory On-chip memory consumed depends on Number of buffers Size of the buffers Storage optimization for image buffer Number of image buffers Storage optimization for kernel buffer Size of kernel buffers

Data Layout Data layout for VGG16 model (𝐹(2*2,3*3)): Only 4*2*C tile has to be brought into on-chip memory instead of 4*4*C Memory bandwidth requirement is reduced by 2×

Overall System Design Number of memory accesses: Reduced Tile based output accumulation: Eliminates the need for adder trees Convolution and pooling layers are pipelined: Hides latency of pooling layer

Target Platform Intel QuickAssist QPI FPGA Platform (Intel Heterogeneous Architecture Research Platform v1) 10 Core Intel Xeon E5-2600 v2 processor Altera Stratix V FPGA 6.25 MB BRAM 234,720 ALM 256 DSP CPU and FPGA share 2 GB address space FPGA can access data through QPI from the last level cache in the memory system of CPU

Experimental Results (1) VGG16 Layer(Metric) [1] Our work Data Precision 16 bit fixed 32 bit fixed Frequency (MHz) 150 200 Memory (MB) 2.13 1.05 # Multipliers 780 256 Overall Delay (ms) 163.4 142.3 Throughput (GOP/s) 187.8 229.2 Overall delay is the sum of all the delays of convolution layers. The throughput of our work is 229.2 GOPS/sec, whereas their work achieved 187.8 GOPS/sec. Here number of operations equal to number of additions plus number of multiplications. The formula to calculate throughput is equal to overall number of operations divided by overall latency. The number of operations using spatial convolution and Winograd algorithm is almost the same. [1] J. Qiu, et al., Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. ACM, FPGA 2016

Experimental Results (2) VGG16 Layer(Metric) [1] Our work Power (W) 9.63 8.04 GOP/s/Multiplier 0.24 0.89 Memory Efficiency (GOP/s/Memory) 88.17 208.3 Power Efficiency (GOP/s/Power) 19.5 28.5 Overall delay is the sum of all the delays of convolution layers. The throughput of our work is 229.2 GOPS/sec, whereas their work achieved 187.8 GOPS/sec. Here number of operations equal to number of additions plus number of multiplications. The formula to calculate throughput is equal to overall number of operations divided by overall latency. The number of operations using spatial convolution and Winograd algorithm is almost the same. [1] J. Qiu, et al., Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. ACM, FPGA 2016

Scalability Linearly scalable with number of multipliers

Choice of Parameter(𝒎) Small 𝒎 Large 𝒎 Latency of Inference High Less Bandwidth requirement Loss of accuracy No Yes Overhead of additions No. of multiplications

FFT or Winograd or Spatial Approach Winograd Approach Spatial Convolution Storage size for filters 𝑂( 𝑛 2 ) 𝑂( 𝑘 2 ) Overhead of operations High(for small 𝑘) Low(for small 𝑘) No Overhead Low(for large 𝑘) High(for large 𝑘) On-chip memory & Memory bandwidth requirement High Low Degree of parallelism (restricted by on-chip memory) 𝑛, 𝑘 => size of input feature map and filter

Conclusion High throughput, resource and energy efficient implementation for CNN inference is possible by using Winograd convolution engine Convolution and pooling layer can be pipelined by using Winograd convolution engine Performance improvements will be even higher by using reduced precision arithmetic

Thank You fpga.usc.edu