Generic SOC Architecture for Convolutional Neural Networks CDR 12.01.2015 By: Merav Natanson & Yotam Platner Supervisor: Guy Revach HSDSL Lab, Technion.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

purpose Search : automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability.
Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
Programmable Interval Timer
Processor System Architecture
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Computer Organization and Architecture
Computer Organization and Architecture
CS-334: Computer Architecture
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Handwritten Character Recognition Using Artificial Neural Networks Shimie Atkins & Daniel Marco Supervisor: Johanan Erez Technion - Israel Institute of.
Operating Systems High Level View Chapter 1,2. Who is the User? End Users Application Programmers System Programmers Administrators.
© 2006 Pearson Education, Upper Saddle River, NJ All Rights Reserved.Brey: The Intel Microprocessors, 7e Chapter 13 Direct Memory Access (DMA)
Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.
© 2004 Xilinx, Inc. All Rights Reserved Implemented by : Alon Ben Shalom Yoni Landau Project supervised by: Mony Orbach High speed digital systems laboratory.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
OS Spring’03 Introduction Operating Systems Spring 2003.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.
Presenting: Itai Avron Supervisor: Chen Koren Characterization Presentation Spring 2005 Implementation of Artificial Intelligence System on FPGA.
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?
Dr. Kimberly E. Newman Hybrid Embedded wk3 Fall 2009.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
INPUT/OUTPUT ORGANIZATION INTERRUPTS CS147 Summer 2001 Professor: Sin-Min Lee Presented by: Jing Chen.
Group 7 Jhonathan Briceño Reginal Etienne Christian Kruger Felix Martinez Dane Minott Immer S Rivera Ander Sahonero.
General Purpose FIFO on Virtex-6 FPGA ML605 board midterm presentation
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Introduction to Embedded Systems
System Calls 1.
Processor Structure & Operations of an Accumulator Machine
Chapter 10: Input / Output Devices Dr Mohamed Menacer Taibah University
MICROPROCESSOR INPUT/OUTPUT
Interrupts and DMA CSCI The Role of the Operating System in Performing I/O Two main jobs of a computer are: –Processing –Performing I/O manage and.
Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
The Functions of Operating Systems Interrupts. Learning Objectives Explain how interrupts are used to obtain processor time. Explain how processing of.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
Final Presentation Implementation of DSP Algorithm on SoC Student : Einat Tevel Supervisor : Isaschar Walter Accompanying engineer : Emilia Burlak The.
Interrupt driven I/O. MIPS RISC Exception Mechanism The processor operates in The processor operates in user mode user mode kernel mode kernel mode Access.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.
Interrupt driven I/O Computer Organization and Assembly Language: Module 12.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Performed by: Yotam Platner & Merav Natanson Instructor: Guy Revach המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
Introduction to Operating Systems Concepts
CS501 Advanced Computer Architecture
Process Management Process Concept Why only the global variables?
UNIT – Microcontroller.
Processor Fundamentals
Lecture Topics: 11/1 General Operating System Concepts Processes
Presentation transcript:

Generic SOC Architecture for Convolutional Neural Networks CDR By: Merav Natanson & Yotam Platner Supervisor: Guy Revach HSDSL Lab, Technion

NN Coprocessor and Algorithm on SOC Hardware Implementation of a generic & modular NN coprocessor on FPGA logic Software driver and API Software implementation of specific test-case algorithms Linux OS running on ARM processor Our Board: Avnet ZedBoard (System On Chip) Programmable Logic - Xilinx Zync XC7Z020-1 (FPGA) Processing System - Dual ARM Cortex-A9 Memory - 512MB DDR3 External Interface - 10/100/1000 Ethernet

Project A Stages Research and learning of Convolutional Neural Networks, with focus on LENET5 algorithm Ramp up on Zedboard and Zync platforms Hardware architecture document for FPGA coprocessor Analysis of practicability & throughput for different operation modes (software-hardware configuration) Architecture document for software API and algorithm Functional simulation for coprocessor (Modelsim)

Background – Neural Networks (NN) NN networks are based on the biological neural system. The basic units that construct the network are Neurons and Weights. Neuron operation : - Multiply all relevant pixels with the appropriate weights. - Sum the outputs and add a constant bias. - Apply an activation function (e.g. tanh) on the current output Neuron Weight Is connected to multiple inputs and outputs. The neuron output is the result of an activation function on the sum of the inputs Is the basic unit that connects the neurons. The weight multiplies the data passing through it with the weight value.

Background – Neural Networks (NN) From Neurons and Weights we can construct a neural network with as many layers as we like. Each layer contains a certain amount of neurons and a set of weights connects the layer to other layers. The complexity of the network is determined by the dimension of the inputs, the more complex and more variable the input is, so does the network. input output

Example of an algorithm – LENET 5 Purpose – hand written digit recognition. Input – A hand written digit represented by a 32x32 pixel matrix. Output – 10 values of +1 or -1. The suitable digit should be the only one represented by +1.

LENET 5 - layer types Convolution – Matrix convolution between single/multiple input feature maps (FM) and a small weights kernel matrix. Sub-sampling – Performs local averaging, reduces the resolution of a feature map and the sensitivity of the outputs. Fully connected – Each output neuron receives all the previous layer’s neurons as inputs, with different weights for each input. d9

FPGA – Block Scheme

Neuron Bank Neuron Operation : Data & weights inputs received into FIFO Multiply and accumulate until finish flag is received Return result to neuron read controller

Execution Units

Neuron Write Controller : Read stage configuration Write data & weights to neurons (transfer order is decided according to mode) Raise finish flags to neurons Write each configuration field to the relevant controller Repeat operation for new stage (if available) Neuron Read Controller : Pulling the results from the assigned neurons in a cyclic order, until all outputs are finished Send the results to the “calculation” unit, with the appropriate bias (from the bias memory block) and a “finish” flag In fully connected mode, the module pulls results from multiple neurons (with counter), otherwise, every read produce an output

Calculation Unit: Sum all its inputs until receiving finish flag Adds bias to the previous result Put the result as an input to an activation function (defined in a LUT) Pass the results (in order) to the "results write controller“ Results Write Controller : Writes results into the data memory block with an adjacent valid bit Execution Units

Memory Blocks Priorities : Data memory read priority is higher than write When read FIFO is full, its send "force write" signal.

Register Bank

Coprocessor Configuration “Configuration block” is the way of the processor to manage the coprocessor. Therefore all its data (input) was transferred by the ARM. The “Configuration block” is built from fields. The EU reads the configuration (Neuron write controller) and transfer every field to its relevant FSMs. For example, the kernel dim is needed to the Neuron write controller for knowing when to raise the “finish flag” (neuron will stop accumulate and produce an output).

Configuration Methods Possible configurations – to achieve good performance & throughput for a specific stage : Allocate a large number of neuron to single execution unit Divide each stage into several parallel execution units Use multiple EU to run the algorithm on several inputs in parallel.

Processor - Coprocessor Roles of the CPU : Allocation of neurons to the execution units Transfer of data and weights to the FPGA’s RAM (through DMA) Configuration flow per execution unit Start execution units

Software API & driver Running on ARM processor and embedded Linux OS Low level - Drivers for Xilinx CDMA IP and for AXI register access Mid level - API functions for coprocessor : Add new algorithm Add stages to algorithm Switch between different configurations Run algorithm stages (on single input or multiple inputs in parallel) High level – Specific application per algorithm

General Structs sDataBlock: Holds a block of data in the DDR. Used for images, weights, bias an configuration blocks. Fields : - Size of block in bytes - Pointer to start of data in DDR. - Valid flag on data. sSlotArray: A slot array is used as a double buffer for images or weights in order to allow "online" writing/reading to/from the coprocessor memory and thus to increase throughput. For example: - When images are loaded or read during the operation of the coprocessor, - When weights are too big to fit in the device memory. Fields : - List of addresses on the AXI bus for the slots in the slot array - List of data blocks to be written, or data blocks that were read. - ID of the execution unit that the slot array is assigned to. - Number of slots to be written in parallel (when slot array is advanced).

sExecUnitConfig: A struct that includes information for the algorithm run function, about what operations to do after an interrupt is received from a specific EU. Fields : - A wait vector of EU IDs (operation will not continue until all of them finish) - Boolean determining whether to start the EU again. - EU configuration address. - Neuron configuration for the EU. - Slot operations to perform (enum). - A list of other EUs that should be configured.

cNeuronBank A static class that implements the main API functions. Key Members : - A list of algorithms (cAlgorithm) that were added by the user [sorted by name] - A list of activation functions [sorted by name] - A list of configuration methods [sorted by name] - Four cMemHandler instances for the different memory blocks in the FPGA. - Configuration address & neuron allocation for each Execution Unit. - Pointer to current loaded algorithm.

addAlgorithm: Adds a new deep neural network algorithm. Parameters - Algorithm name, chosen configuration method & activation function, input image size. Return value - Pointer to the created cAlgorithm object. To add additional (private) configuration methods and activation functions : addPrivateMethod, addLUT configAlgorithm: Activates method pointed by the chosen algorithm's "config function pointer“ (for example – cascadeMethod). Readies coprocessor and API for algorithm run. Parameters: Algorithm name. receivedIRQ: Called when an interrupt is recieved from the coprocessor. Checks the interrupt cause (execution unit id) and calls the chosen. Runs method for the current loaded algorithm. Execution unit handling : setExecUnitConfiguration, setExeutionUnitNeurons, startExecUnit

cAlgorithm This class holds the algorithm parameters and configuration data. Key Members : - A list of stages (cStage) that were added by the user - Input image size - Pointer to LUT and configuration function chosen for this method. - Execution unit configuration lists for IRQ handling addStage: Adds a new stage to the algorithm. Parameters - The stage type (CONV/SUBS/FC/EUCLIDEAN), stage dimension (kernel size for CONV/SUBS, output size for FC/EUCLIDEAN), run stage on COP/NEON. Return value - Pointer to the created cStage object.

For LUT configuration : setLUTAddr, getLUTConfig cascadeMethod (OR singleMethod OR splitMethod OR private) : Called by configAlgorithm when the chosen config method is "cascade". - Calculates assignment of the algorithm stages to the different execution units, - Writes weights and bias to appropriate memory handlers, - Creates write and read slots for input and output images, - Generates the configuration blocks from the stage objects, - Writes the configuration blocks to the memory handler. - Creates IRQ handling lists for each EU (used in run method). run: Called by receiveIRQ for the current loaded algorithm. Goes over appropriate configuration lists for the execution units, and accordingly changes neuron assignment, changes config addresses, advances read and write slots, and activates EUs. Parameters - Vector of execution unit IDs for the received interrupt.

cStage Abstract class that contains the basic stage data. Key Members: - Stage type - Run on FPGA/NEON (enum) - Next configuration address runOnNeon: Each derived class implements it according to its needs. Runs the relevant NEON functions for each stage on the data. Parameters - Input data picture/FM for running the stage. Return value - Result data block (address & size). getStageConfig: Creates configuration data block. The configuration is made from the derived class data (the class that implements this virtual method). See table for configuration block structure. Return value - Configuration data block (address & size).

cStageFM (inherits cStage) Data structure for sub sampling and convolution stages. Contains all the necessary data for creating the configuration data block. Key Members: - Input width & height - Output width & height - Kernel dimension - A list of relevant output feature maps [cOfm] for this stage. addOfm: Creates new output feature map class and associate it with the current stage. Parameters - Output feature map Id, bias block (address & size). Return value – A pointer to the new cOfm object. getStageConfig: Creates configuration data block for the method input variables (output feature maps) only. In this case one stage can contain more than one configuration. Parameters - Ids of all the output feature maps for this configurations. Return value - Configuration data block (address & size).

cOfm Data structure for output feature map data. Key Members: - DDR addresses - for bias and for all input feature maps weights. - BRAM addresses - for stage output, bias and input feature maps weights & data. addIFM: Adds input feature map (Id & weights) to output feature map class. For sub sampling stage, only one IFM is allowed. Parameters - Input feature map Id, weights block (DDR address & size).

cStageFC (inherits cStage) Data structure for fully connected stage. Contains all the necessary data for creating the configuration data block. Key Members : - DDR addresses - for bias and weights. - BRAM addresses - for stage output and input, bias and weights. - Input & output size. addWeights – Receives weights and bias blocks. cStageEuclidean (inherits cStage) Data structure for euclidean stage. Contains all the necessary data for creating the configuration data block. Runs on NEON only. Key Members: Input & output size, weights DDR address.

cMemHandler Data structure for handling all the FPGA memory usage (BAC, BRAMS). Our implementation will contain 4 cMemHandler objects : Data, Weights, Bias & Configuration Memory Blocks (=objects). Key Members: - Number of memory units for this object in bytes (all object units size are identical). - Start Address for this object in the BAC. - Map structure for all the slot arrays that are used to return results from the ARM. - Map structure for all the slot arrays that are used to write new data by the ARM. - Next 'empty' (BRAM) address to write for each inner memory unit. This structure is used both for data blocks & slots, before the algorithm start running.

createSlot: Creates new slot, and slotArray if needed. Parameters - Slot array Id, memory unit Id, execution unit Id, slot size, write/read slot (flag - T/F), parallel number of slots to advance at once. Return value - The new slot BRAM Address. createSlotDirect: This method will be called from the second run of the algorithm and will do the same as createSlot method (but now, the old data from the previous run already exists). writeBlock: Write data block to specific memory unit and return it's BRAM address. writeBlockDirect: Writes data (weights/picture/NEON output) to BRAM address. This method will be called from the second run of the algorithm (same as writeBlock method but now, the old data from the previous run already exists). Parameters - BRAM address & DDR address. writeDataToSlot: Writes data (weights/picture/NEON output) to slot. Parameters- Slot array Id, data block (address & size), useOnlyOnce - Flag for identification whether this data will be written once (e.g. picture) or more (e.g. weights) - will be saved in the slotArray and influence the advanceWriteSlots behaviour. readDataFromSlot: read data from specific block (and return it's DDR address). advanceWriteSlots / advanceReadSlots : Advance nextSlotId in the sSlotArray that was defined by the method input. The method will be called after every write/read from slot. Parameters: vector of EU id - define which sSlotArray to advance (can be more than one).

Upper Bounds

Next task - Functional Simulation Simulation at Execution Unit level Partial VHDL implementation – controllers only Read from files & write to files (no RAM)

Project A - Gantt