Approximate Computing on FPGA using Neural Acceleration Presented By: Mikkel Nielsen, Nirvedh Meshram, Shashank Gupta, Kenneth Siu.

Slides:



Advertisements
Similar presentations
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Advertisements

Computer Abstractions and Technology
The University of Adelaide, School of Computer Science
Programmable Interval Timer
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
Presented by Euiwoong Lee Accelerators/Specialization/ Emerging Architectures.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli
Device Driver for Generic ASC Module - Project Presentation - By: Yigal Korman Erez Fuchs Instructor: Evgeny Fiksman Sponsored by: High Speed Digital Systems.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Image Processing for Remote Sensing Matthew E. Nelson Joseph Coleman.
1 Introduction to Artificial Neural Networks Andrew L. Nelson Visiting Research Faculty University of South Florida.
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Chapter 1 What is Programming? Lecture Slides to Accompany An Introduction to Computer Science Using Java (2nd Edition) by S.N. Kamin, D. Mickunas, E.
F LEX J AVA : Language Support for Safe and Modular Approximate Programming Jongse Park, Hadi Esmaeilzadeh, Xin Zhang, Mayur Naik, William Harris Alternative.
GBT Interface Card for a Linux Computer Carson Teale 1.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
EE3A1 Computer Hardware and Digital Design
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
SNNAP : Approximate Computing on Programmable SoCs via Neural Acceleration Mojes Koli Thierry Moreau Mark Wyse Jacob Nelson Adrian Sampson Hadi Esmaeilzadeh.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Performed by: Dor Kasif, Or Flisher Instructor: Rolf Hilgendorf Jpeg decompression algorithm implementation using HLS PDR presentation Winter Duration:
Full and Para Virtualization
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
EnerJ: Approximate Data Types for Safe and General Low-Power Computation (PLDI’2011) Adrian Sampson, Werner Dietl, Emily Fortuna Danushen Gnanapragasam,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
1 Asstt. Prof Navjot Kaur Computer Dept PRESENTED BY.
Software. Introduction n A computer can’t do anything without a program of instructions. n A program is a set of instructions a computer carries out.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Computer System Structures
Current Generation Hypervisor Type 1 Type 2.
Central Processing Unit Architecture
Microprocessor and Assembly Language
Embedded Systems Design
ENG3050 Embedded Reconfigurable Computing Systems
Introduction to Operating System (OS)
Java programming lecture one
Implementing Boosting and Convolutional Neural Networks For Particle Identification (PID) Khalid Teli .
FPGAs in AWS and First Use Cases, Kees Vissers
Inception and Residual Architecture in Deep Convolutional Networks
Implementation of Efficient Check-pointing and Restart on CPU - GPU
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Hot & Spicy: Improving Productivity with Python and HLS for FPGAs
Power-Efficient Machine Learning using FPGAs on POWER Systems
Chapter 2: The Linux System Part 1
Final Project presentation
Implementation of a GNSS Space Receiver on a Zynq
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
EE 193: Parallel Computing
ADSP 21065L.
Sculptor: Flexible Approximation with
Presentation transcript:

Approximate Computing on FPGA using Neural Acceleration Presented By: Mikkel Nielsen, Nirvedh Meshram, Shashank Gupta, Kenneth Siu

Approximate Computing Involves computations that do not need to be exact (tolerance to quality degradation) Neural Network’s (NN) speed can be exploited Optimization (performance and energy efficiency) in favor of accuracy Implement a NN accelerator that interacts with the CPU Useful in many computer vision and image processing applications like edge detection

Motivation To combine approaches of specialized logic (accelerator) and approximate computing for enhanced performance and energy efficiency Top Level System Design

Architecture Design of NPU Top Level Diagram of NPU

Architecture and Features Total of 8 Processing Elements in one Processing Unit (in initial design) Weights needed for Neural Processing loaded into the weight FIFO at time of configuration A Scheduling Buffer is configured in configuration phase and use to generate control signals used for Input, Output, Sigmoid and Accumulator FIFO, PE input selection and Sigmoid Function After this Inputs are loaded into input FIFO (using enqd instruction) Inputs & Weights are 16 bit wide Fixed with 7 fractional bits. NPU supports 32 bit integers and single precision Floating Points. Input interface does required format conversion

Architecture and Features Compute Unit: Performs Multiplication and Addition operation State Machine: Controls &configures NPU – stall due to insufficient input/Push output to FIFO Accumulator FIFO: Stores intermediate results when No of Inputs > No of PE Sigmoid Function Unit: Current NPU supports tan sigmoid and linear functions Output FIFO: Holds output of NPU

Software for Configuration Weights can be generated through custom MATLAB code or through a compiler A perl based compiler which expects weights and the structure of the neural network as input The compiler will then generate a sequence of instructions which will be loaded into the NPU These instructions will load values in the weight buffers as well as the scheduling buffer

Zedboard Implementation Used Vivado tools to set programmable logic and generate a bitstream for gates Implements bitstream as a First-stage boot loader by wrapping bitstream with boot files On Zedboard boot, programmable logic is loaded with design Driver interfaces C code with programmable logic(NPU) Comparison between C code runs native on Digilent Linux on Zedboard to test ARM core

Zedboard Challenges Configuring Vivado to generate bitstream. Synthesis/Implementation Debugging/Errors Creating appropriate wrapper so Zedboard does not crash on boot

Benchmarks Sobel Edge Detection Good program for approximate computing Uses convolution of a 3x3 matrix to find edges Took 0.4 ms for 512x512 image

AxBench Benchmarks Using AxBench Utilizes software NN (FANN Library) Need a hardware NN to fully utilize efficiency Benchmarks run both with and without NN NN (s)Original (s)%Error fft inversek2j jpeg

In Progress 1.Compare the performance against another Processing Unit with 16 PEs and check speedup gains 2.Build an NPU with 2 Processing units with 8 PEs each and again compare the performance & speedup 3.Modify the scheduler to remove stalls due to unavailable data 4.More benchmarks

References [1] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general- purpose approximate programs. MICRO, [2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, 1986,vol.1, pp. 318–362. [3] Marc de Kruijf and K. Sankaralingam, “Exploring the synergy of emerging workloads and silicon reliability trends” in SELSE, [4] Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, Mark Oskin, SNNAP: Approximate computing on programmable SoCs via neural acceleration. HPCA 2015: