Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai
Outline Introduction Background Methodology Results Evaluation of the system Criticism Q&A
Introduction CNN is extend from artificial neural network Application include image processing Requires high performance computation hardware Design exploration is a must ! -well know learning architeture(mechine learning) -it used mainly in imgaeing -object detection -servillance -robot vision -due to CNN computation pattern it requires very high performance hardware A GP is not aduqate -can use FPGA, GPU, ASIC -there could be a diff of 90% between 2 solution on FPGA -concerns are mem, bandwidth , utiliztion - problem got harder with new FPGA since size and performance increase
What is Deep Convolution neural Networks ? Type of Machine learning 8 steps Limitations Feed forward computation -used to detect objects commonly faces -32x32 image only Steps Extract the Y (RGB or YUV don’t work)(black and white) Normalize* 1- filter (Convolution layer) learned = edge detect photos (spatial Convolution ) bigger size of layer more details 2- hyperbolic tanh to increase accuracy Rectified Linear Units. 3-pooling or max averages the data of feature over regain of image, ensure data stays same **4-sub sampling to get features 5- apply filter (Convolution map) random input to random filter (trained) 6- get result in form of -1 to 1 for one objective (face or background
Roof Line model Provide a graphical representation of performance and productivity Rates and efficiencies(Gflops, % of peak) limitation Benefits Focus Computation Communication locality Not for fine tuning Loop based modeling Limiting factor Execution and data transfer Goals for Roofline: Provide everyone with a graphical aid that provides: realistic expectations of performance and productivity Show inherent hardware limitations for a given kernel Show potential benefit and priority of optimizations -shows limitation -good comparison
Types of data Irrelevant Independent Dependent
Double buffering Allows for two way communication Increase throughput
Main concerns Communication overhead Buffer management Bandwidth optimization Better Utilization of FPGA What is there goal -roofline model -decrease external communication -double buffering -increase input (bandwidth -use FPGA better -data reuse -speed
Design Exploration Computation Communication ratio Loop scheduling Loop tile sizes Communication ratio -How they went on testing different designs -Plotted the design against the two factors -Example.
Directives loop Pipeline Software pipelining Increase throughput
Directives Loop Unrolling Maximizes computation Data flow design
Directives Loop Tilling Divides loops into smaller loops ensure data stays in cache Great for Data reuse
Memory Optimization Polyhedral based optimization Local memory promotion for irrelevant type communications Data reuse -Polyhedral based optimization is optimization for nested loops -Using local memory for loops with irrelevant data -tilling for reuse
Designed Model -Using VHLS to program on virtex 7 485t -Freq 100 MHz -Soft GP gives command through an AXI4 bus -microblaze helps with start up, communication, with host cpu, time measuremnet -DDR3 Dam external to FPGA
Detail of the final design Cumputation engine 2 L unrolled loop -tree shape poly structure for better cross layer design
Results Virtex 7 100 MHz as IP using VHLS Intel Xeon E5 2.2 GHz 15 MB cache Pre synthesis report used for performance and exploration Test using 256x256 RGB image 3 channel V7 is 28nm Xeon E5-2430 v2SR1AH (S1) 6 2.5 GHz3/3/3/3/4/56 × 256 KB15 MB 80 W LGA 13567.2 GT/s QPI3 × DDR3-1600 Jan 9, 2014 $551
Evaluation of the system 17.42 X speedup on 1 thread GP implementation 4.8 X speedup on 16 thread GP implementation 18.6 watts vs 95 watts GP 3.62X speedup on ICCD2013 Design
My opinion The techniques used to optimize loops are well thought out It’s a unique way of looking at an accelerator The memory enhancement offer great insight
Pitfall of the claim Pre cached data tests Evaluation metrics when comparing other designs Only tested using one image Technology difference Claiming Design has best utilization GP watts is estimated 1.8x performance density over second best The Virtex-7 family is based on a 28 nm design and is reported to deliver a two-fold system performance improvement at 50 percent lower power compared to previous generation Virtex-6 devices. In addition, Virtex-7 doubles the memory bandwidth compared to previous generation Virtex FPGAs with 1866 Mbit/s memory interfacing performance and over two million logic cells The Virtex-6 family is built on a 40 nm process for compute-intensive electronic systems, and the company claims it consumes 15 percent less power and has 15 percent improved performance over competing 40 nm FPGAs
Q&A
Reference http://crd.lbl.gov/assets/pubs_presos/parlab0 8-roofline-talk.pdf https://www.youtube.com/watch?v=n6hpQw q7Inw http://en.wikipedia.org/wiki/Loop_tiling http://en.wikipedia.org/wiki/Polytope_model Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, Jason Cong ,Center for Energy-Efficient Computing and Applications, Peking University, China, Computer Science Department, University of California, Los Angeles, USA