Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.

Similar presentations


Presentation on theme: "Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai."— Presentation transcript:

1 Optimizing FPGA Accelerator Design for Deep Convolution neural Networks
By: Mohamad Kanafanai

2 Outline Introduction Background Methodology Results
Evaluation of the system Criticism Q&A

3 Introduction CNN is extend from artificial neural network
Application include image processing Requires high performance computation hardware Design exploration is a must ! -well know learning architeture(mechine learning) -it used mainly in imgaeing -object detection -servillance -robot vision -due to CNN computation pattern it requires very high performance hardware A GP is not aduqate -can use FPGA, GPU, ASIC -there could be a diff of 90% between 2 solution on FPGA -concerns are mem, bandwidth , utiliztion - problem got harder with new FPGA since size and performance increase

4 What is Deep Convolution neural Networks ?
Type of Machine learning 8 steps Limitations Feed forward computation -used to detect objects commonly faces -32x32 image only Steps Extract the Y (RGB or YUV don’t work)(black and white) Normalize* 1- filter (Convolution layer) learned = edge detect photos (spatial Convolution ) bigger size of layer more details 2- hyperbolic tanh to increase accuracy Rectified Linear Units.  3-pooling or max averages the data of feature over regain of image, ensure data stays same **4-sub sampling to get features 5- apply filter (Convolution map) random input to random filter (trained) 6- get result in form of -1 to 1 for one objective (face or background

5 Roof Line model Provide a graphical representation of performance and productivity Rates and efficiencies(Gflops, % of peak) limitation Benefits Focus Computation Communication locality Not for fine tuning Loop based modeling Limiting factor Execution and data transfer  Goals for Roofline:  Provide everyone with a graphical aid that provides: realistic expectations of performance and productivity  Show inherent hardware limitations for a given kernel  Show potential benefit and priority of optimizations -shows limitation -good comparison

6 Types of data Irrelevant Independent Dependent

7 Double buffering Allows for two way communication Increase throughput

8 Main concerns Communication overhead Buffer management
Bandwidth optimization Better Utilization of FPGA What is there goal -roofline model -decrease external communication -double buffering -increase input (bandwidth -use FPGA better -data reuse -speed

9 Design Exploration Computation Communication ratio Loop scheduling
Loop tile sizes Communication ratio -How they went on testing different designs -Plotted the design against the two factors -Example.

10 Directives loop Pipeline
Software pipelining Increase throughput

11 Directives Loop Unrolling
Maximizes computation Data flow design

12 Directives Loop Tilling
Divides loops into smaller loops ensure data stays in cache Great for Data reuse

13 Memory Optimization Polyhedral based optimization
Local memory promotion for irrelevant type communications Data reuse -Polyhedral based optimization is optimization for nested loops -Using local memory for loops with irrelevant data -tilling for reuse

14 Designed Model -Using VHLS to program on virtex 7 485t -Freq 100 MHz
-Soft GP gives command through an AXI4 bus -microblaze helps with start up, communication, with host cpu, time measuremnet -DDR3 Dam external to FPGA

15 Detail of the final design
Cumputation engine 2 L unrolled loop -tree shape poly structure for better cross layer design

16 Results Virtex 7 100 MHz as IP using VHLS
Intel Xeon E5 2.2 GHz 15 MB cache Pre synthesis report used for performance and exploration Test using 256x256 RGB image 3 channel V7 is 28nm Xeon E v2SR1AH (S1) 6 2.5 GHz3/3/3/3/4/56 × 256 KB15 MB 80 W LGA GT/s QPI3 × DDR Jan 9, 2014 $551

17 Evaluation of the system
17.42 X speedup on 1 thread GP implementation 4.8 X speedup on 16 thread GP implementation 18.6 watts vs 95 watts GP 3.62X speedup on ICCD2013 Design

18 My opinion The techniques used to optimize loops are well thought out
It’s a unique way of looking at an accelerator The memory enhancement offer great insight

19 Pitfall of the claim Pre cached data tests
Evaluation metrics when comparing other designs Only tested using one image Technology difference Claiming Design has best utilization GP watts is estimated 1.8x performance density over second best The Virtex-7 family is based on a 28 nm design and is reported to deliver a two-fold system performance improvement at 50 percent lower power compared to previous generation Virtex-6 devices. In addition, Virtex-7 doubles the memory bandwidth compared to previous generation Virtex FPGAs with 1866 Mbit/s memory interfacing performance and over two million logic cells The Virtex-6 family is built on a 40 nm process for compute-intensive electronic systems, and the company claims it consumes 15 percent less power and has 15 percent improved performance over competing 40 nm FPGAs

20 Q&A

21 Reference 8-roofline-talk.pdf q7Inw Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, Jason Cong ,Center for Energy-Efficient Computing and Applications, Peking University, China, Computer Science Department, University of California, Los Angeles, USA


Download ppt "Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai."

Similar presentations


Ads by Google