Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.

Slides:

Advertisements

Similar presentations

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Derivation of Efficient FSM from Polyhedral Loop Nests Tomofumi Yuki, Antoine Morvan, Steven Derrien INRIA/Université de Rennes 1 1.

Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Presenting: Itai Avron Supervisor: Chen Koren Final Presentation Spring 2005 Implementation of Artificial Intelligence System on FPGA.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Medical Imaging Pipeline Yu-Ting Chen, Young-kyu Choi, Jason Cong, and Bingjun Xiao Center for Domain-Specific Computing Center for Future Architectures.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.

Philipp Gysel ECE Department University of California, Davis

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

TI Information – Selective Disclosure

MadeCR: Correlation-based Malware Detection for Cognitive Radio

A DFA with Extended Character-Set for Fast Deep Packet Inspection

FPGA implementation of CNN Convolution layer logic

Yuanrui Zhang, Mahmut Kandemir

FPGA Acceleration of Convolutional Neural Networks

Implementing Boosting and Convolutional Neural Networks For Particle Identification (PID) Khalid Teli .

FPGAs in AWS and First Use Cases, Kees Vissers

Overcoming Resource Underutilization in Spatial CNN Accelerators

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu

Bit-Pragmatic Deep Neural Network Computing

Stripes: Bit-Serial Deep Neural Network Computing

FUNDAMENTALS OF MACHINE LEARNING AND DEEP LEARNING

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Department of Computer Science University of California, Santa Barbara

Memory Hierarchies.

Final Project presentation

Zhiyuan Shao, Ruoshi Li, Diqing Hu, Xiaofei Liao, and Hai Jin

Dynamic Partial Reconfiguration of FPGA

LANMC: LSTM-Assisted Non-Rigid Motion Correction

Department of Computer Science University of California, Santa Barbara

Model Compression Joseph E. Gonzalez

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

Sculptor: Flexible Approximation with

Presentation transcript:

Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai

Outline Introduction Background Methodology Results Evaluation of the system Criticism Q&A

Introduction CNN is extend from artificial neural network Application include image processing Requires high performance computation hardware Design exploration is a must ! -well know learning architeture(mechine learning) -it used mainly in imgaeing -object detection -servillance -robot vision -due to CNN computation pattern it requires very high performance hardware A GP is not aduqate -can use FPGA, GPU, ASIC -there could be a diff of 90% between 2 solution on FPGA -concerns are mem, bandwidth , utiliztion - problem got harder with new FPGA since size and performance increase

What is Deep Convolution neural Networks ? Type of Machine learning 8 steps Limitations Feed forward computation -used to detect objects commonly faces -32x32 image only Steps Extract the Y (RGB or YUV don’t work)(black and white) Normalize* 1- filter (Convolution layer) learned = edge detect photos (spatial Convolution ) bigger size of layer more details 2- hyperbolic tanh to increase accuracy Rectified Linear Units. 3-pooling or max averages the data of feature over regain of image, ensure data stays same **4-sub sampling to get features 5- apply filter (Convolution map) random input to random filter (trained) 6- get result in form of -1 to 1 for one objective (face or background

Roof Line model Provide a graphical representation of performance and productivity Rates and efficiencies(Gflops, % of peak) limitation Benefits Focus Computation Communication locality Not for fine tuning Loop based modeling Limiting factor Execution and data transfer  Goals for Roofline:  Provide everyone with a graphical aid that provides: realistic expectations of performance and productivity  Show inherent hardware limitations for a given kernel  Show potential benefit and priority of optimizations -shows limitation -good comparison

Types of data Irrelevant Independent Dependent

Double buffering Allows for two way communication Increase throughput

Main concerns Communication overhead Buffer management Bandwidth optimization Better Utilization of FPGA What is there goal -roofline model -decrease external communication -double buffering -increase input (bandwidth -use FPGA better -data reuse -speed

Design Exploration Computation Communication ratio Loop scheduling Loop tile sizes Communication ratio -How they went on testing different designs -Plotted the design against the two factors -Example.

Directives loop Pipeline Software pipelining Increase throughput

Directives Loop Unrolling Maximizes computation Data flow design

Directives Loop Tilling Divides loops into smaller loops ensure data stays in cache Great for Data reuse

Memory Optimization Polyhedral based optimization Local memory promotion for irrelevant type communications Data reuse -Polyhedral based optimization is optimization for nested loops -Using local memory for loops with irrelevant data -tilling for reuse

Designed Model -Using VHLS to program on virtex 7 485t -Freq 100 MHz -Soft GP gives command through an AXI4 bus -microblaze helps with start up, communication, with host cpu, time measuremnet -DDR3 Dam external to FPGA

Detail of the final design Cumputation engine 2 L unrolled loop -tree shape poly structure for better cross layer design

Results Virtex 7 100 MHz as IP using VHLS Intel Xeon E5 2.2 GHz 15 MB cache Pre synthesis report used for performance and exploration Test using 256x256 RGB image 3 channel V7 is 28nm Xeon E5-2430 v2SR1AH (S1) 6 2.5 GHz3/3/3/3/4/56 × 256 KB15 MB 80 W LGA 13567.2 GT/s QPI3 × DDR3-1600 Jan 9, 2014 $551

Evaluation of the system 17.42 X speedup on 1 thread GP implementation 4.8 X speedup on 16 thread GP implementation 18.6 watts vs 95 watts GP 3.62X speedup on ICCD2013 Design

My opinion The techniques used to optimize loops are well thought out It’s a unique way of looking at an accelerator The memory enhancement offer great insight

Pitfall of the claim Pre cached data tests Evaluation metrics when comparing other designs Only tested using one image Technology difference Claiming Design has best utilization GP watts is estimated 1.8x performance density over second best The Virtex-7 family is based on a 28 nm design and is reported to deliver a two-fold system performance improvement at 50 percent lower power compared to previous generation Virtex-6 devices. In addition, Virtex-7 doubles the memory bandwidth compared to previous generation Virtex FPGAs with 1866 Mbit/s memory interfacing performance and over two million logic cells The Virtex-6 family is built on a 40 nm process for compute-intensive electronic systems, and the company claims it consumes 15 percent less power and has 15 percent improved performance over competing 40 nm FPGAs

Q&A

Reference http://crd.lbl.gov/assets/pubs_presos/parlab0 8-roofline-talk.pdf https://www.youtube.com/watch?v=n6hpQw q7Inw http://en.wikipedia.org/wiki/Loop_tiling http://en.wikipedia.org/wiki/Polytope_model Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, Jason Cong ,Center for Energy-Efficient Computing and Applications, Peking University, China, Computer Science Department, University of California, Los Angeles, USA