Final Project presentation

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

Distributed Arithmetic

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.

Branch Prediction with Neural- Networks: Hidden Layers and Recurrent Connections Andrew Smith CSE Dept. June 10, 2004.

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

A comprehensive method for the evaluation of the sensitivity to SEUs of FPGA-based applications A comprehensive method for the evaluation of the sensitivity.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.

MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

ShiDianNao: Shifting Vision Processing Closer to the Sensor

1 Presenter: Min Yu,Lo 2015/12/21 Kumar, S.; Jantsch, A.; Soininen, J.-P.; Forsell, M.; Millberg, M.; Oberg, J.; Tiensyrja, K.; Hemani, A. VLSI, 2002.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

Philipp Gysel ECE Department University of California, Davis

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro

Stanford University.

Jehandad Khan and Peter Athanas Virginia Tech

TI Information – Selective Disclosure

Backprojection Project Update January 2002

Analysis of Sparse Convolutional Neural Networks

Improved Resource Sharing for FPGA DSP Blocks

Hiba Tariq School of Engineering

Dynamo: A Runtime Codesign Environment

Andrea Acquaviva, Luca Benini, Bruno Riccò

FPGA implementation of CNN Convolution layer logic

Evaluating Register File Size

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

EKT 221 : Digital 2 Serial Transfers & Microoperations

Embedded Systems Design

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

FPGA Acceleration of Convolutional Neural Networks

Overcoming Resource Underutilization in Spatial CNN Accelerators

Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

Short Circuiting Memory Traffic in Handheld Platforms

The University of Adelaide, School of Computer Science

Anne Pratoomtong ECE734, Spring2002

Bit-Pragmatic Deep Neural Network Computing

Stripes: Bit-Serial Deep Neural Network Computing

Power-Efficient Machine Learning using FPGAs on POWER Systems

Digital Processing Platform

A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,

Approximate Fully Connected Neural Network Generation

EVA2: Exploiting Temporal Redundancy In Live Computer Vision

STUDY AND IMPLEMENTATION

network of simple neuron-like computing elements

Optimization for Fully Connected Neural Network for FPGA application

Optimizing stencil code for FPGA

ECE 448: Lab 4 FIR Filters.

Hyoukjun Kwon*, Michael Pellauer**, and Tushar Krishna*

Convolution Layer Optimization

Face Recognition: A Convolutional Neural Network Approach

Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arXiv.org,

August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab

Samira Khan University of Virginia Feb 6, 2019

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Learning and Memorization

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

Presentation transcript:

Final Project presentation Design space exploration and optimization of FPGA-based Neural Network Accelerator Final Project presentation Maedeh Hemmat ECE734 - VLSI Array Processors for Digital Signal Processing

Goal: For any CNN on any arbitrary FPGA platform, find a mapping that: Outline Goal: For any CNN on any arbitrary FPGA platform, find a mapping that: Minimizes the data transfer between computational units and memory Maximizes data reuses Increases the throughput of the accelerator Approach and Progress: A software-hardware co-design Software: Developing an algorithm to explore the design space by considering the structure of networks and characteristics of CNN Hardware: Implementing the optimal mapping in hardware (ModelSim) Measure power consumption and throughput of the network under different scenarios NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited

Motivation Input feature Map: 𝑁 𝑥 × 𝑁 𝑦 × 𝑁 𝑖𝑓 Convolutional layer: ( 𝐾 𝑥 × 𝐾 𝑦 × 𝑁 𝑖𝑓 × 𝑁 𝑜𝑓 ) output feature Map: 𝑁 𝑥′ × 𝑁 𝑦′ × 𝑁 𝑜𝑓 NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited The four-nested loop leads to a large design space with various choices to implement the CONV layer while exploiting parallelism

Proposed Algorithm What are the choices? Which one is the optimal one? Unroll over input feature maps (Channels) NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited Processing 𝑃 𝑖𝑓 channels at one clock cycle 𝑃 𝑖𝑓 pixels from different kernels/feature maps are used An accumulator is required to accumulate the results from different channels If 𝑃 𝑖𝑓 = 𝑁 𝑖𝑓 => one output /clock cycle

Proposed Algorithm What are the choices? Which one is the optimal one? Unroll over output feature maps (Kernels) NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited Processing 𝑃 𝑜𝑓 kernels at one clock cycle Input pixels from same feature map are multiplied with weights from different kernels Each input pixel is reused for 𝑃 𝑜𝑓 times - Partial results over channels need to be stored

Pick the mapping with less latency and less on-chip buffer access Proposed Algorithm Pick the mapping with less latency and less on-chip buffer access NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited

Experimental Results Trade off between different mapping under different number of channels Size of weight matrix : 5 x 5 x Number_of_channels x 32 NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited

Experimental Results Trade off between different mapping under different number of channels Size of weight matrix : 5 x 5 x Number_of_channels x 32 NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited

Number of kernels per Tile Output pixels per cycle Experimental Results Target CNN: LeNet5 Two convolutional layer followed by one fully connected layer Trained and tested on MNIST data First Conv layer : 5x5x1x20 (Unrolling over channels) Kernel Tiling Factor Number of kernels per Tile On-chip buffer access Output pixels per cycle 1 20 525 10 4 5 1025 8 3 1225 1.5 NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited Second Conv layer: 5x5x20x50

Hardware Simulation Implementing the second layer of LeNet5 in hardware 16-bit fixed point operation Two-input multipliers and adder trees to implement MAC Quantizing the generated output to 16-bit to control dynamic range growth Using Modelsim to implement the design Synthesizing the design at 45 nm technology Measuring the power consumption of the layer under different scenarios ( by changing the available on-chip buffer size) NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited

Experimental Results Power Consumption of second layer of LeNet5 under different scenarios NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as .... . To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited The on-chip buffer size varies from 40kb to 8 kb

Conclusion Developing an algorithm which finds the optimal mapping of CNN on FPGA platforms Minimizing the number of on-chip buffer accesses to minimize power consumption Implementing and synthesizing one layer of LeNet5

Q & A