1 הפקולטה להנדסת חשמל הפקולטה להנדסת חשמל Department of Electrical Engineering הטכניון - מכון טכנולוגי לישראל Technion - Israel Institute of Technology.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Programmable FIR Filter Design
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Internal Logic Analyzer Final presentation-part B
Internal Logic Analyzer Final presentation-part A
Altera FLEX 10K technology in Real Time Application.
© 2003 Xilinx, Inc. All Rights Reserved Looking Under the Hood.
CENG536 Computer Engineering Department Çankaya University.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Handwritten Character Recognition Using Artificial Neural Networks Shimie Atkins & Daniel Marco Supervisor: Johanan Erez Technion - Israel Institute of.
Performed by: Moshe Emmer, Harar Meir Instructor: Alkalay Daniel Cooperated with: AE faculty המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory.
1 Student: Khinich Fanny Instructor: Fiksman Evgeny המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי לישראל.
Performed by: Lin Ilia Khinich Fanny Instructor: Fiksman Eugene המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.
Moving NN Triggers to Level-1 at LHC Rates Triggering Problem in HEP Adopted neural solutions Specifications for Level 1 Triggering Hardware Implementation.
© 2004 Xilinx, Inc. All Rights Reserved Implemented by : Alon Ben Shalom Yoni Landau Project supervised by: Mony Orbach High speed digital systems laboratory.
The Logic Machine We looked at programming at the high level and at the low level. The question now is: How can a physical computer be built to run a program?
1 Performed by: Lin Ilia Khinich Fanny Instructor: Fiksman Eugene המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.
Performed by : Rivka Cohen and Sharon Solomon Instructor : Walter Isaschar המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון.
Presenting: Itai Avron Supervisor: Chen Koren Final Presentation Spring 2005 Implementation of Artificial Intelligence System on FPGA.
Performed by: Ariel Wolf & Elad Bichman Instructor: Yuri Dolgin המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.
הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel institute of technology department of Electrical Engineering Virtex II-PRO Dynamical.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Device Driver for Generic ASC Module - Project Presentation - By: Yigal Korman Erez Fuchs Instructor: Evgeny Fiksman Sponsored by: High Speed Digital Systems.
הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel institute of technology department of Electrical Engineering Virtex II-PRO Dynamical.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.
Prepared by: Hind J. Zourob Heba M. Matter Supervisor: Dr. Hatem El-Aydi Faculty Of Engineering Communications & Control Engineering.
Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.
Chapter 14: Artificial Intelligence Invitation to Computer Science, C++ Version, Third Edition.
Student : Andrey Kuyel Supervised by Mony Orbach Spring 2011 Final Presentation High speed digital systems laboratory High-Throughput FFT Technion - Israel.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
FPGA IRRADIATION and TESTING PLANS (Update) Ray Mountain, Marina Artuso, Bin Gui Syracuse University OUTLINE: 1.Core 2.Peripheral 3.Testing Procedures.
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Implementation of Finite Field Inversion
J. Christiansen, CERN - EP/MIC
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Final Presentation Final Presentation OFDM implementation and performance test Performed by: Tomer Ben Oz Ariel Shleifer Guided by: Mony Orbach Duration:
Final Presentation Implementation of DSP Algorithm on SoC Student : Einat Tevel Supervisor : Isaschar Walter Accompanying engineer : Emilia Burlak The.
Performed by: Guy Assedou Ofir Shimon Instructor: Yaniv Ben-Yitzhak המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון.
Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
November 29, 2011 Final Presentation. Team Members Troy Huguet Computer Engineer Post-Route Testing Parker Jacobs Computer Engineer Post-Route Testing.
PHY 201 (Blum)1 Shift registers and Floating Point Numbers Chapter 11 in Tokheim.
Performed by: Alexander Pavlov David Domb Instructor: Mony Orbach המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.
Generic SOC Architecture for Convolutional Neural Networks CDR By: Merav Natanson & Yotam Platner Supervisor: Guy Revach HSDSL Lab, Technion.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Perceptrons Michael J. Watts
Performed by:Liran Sperling Gal Braun Instructor: Evgeny Fiksman המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory.
Performed by: Yotam Platner & Merav Natanson Instructor: Guy Revach המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Programmable Logic Devices
FPGAs in AWS and First Use Cases, Kees Vissers
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Final Project presentation
Presentation transcript:

1 הפקולטה להנדסת חשמל הפקולטה להנדסת חשמל Department of Electrical Engineering הטכניון - מכון טכנולוגי לישראל Technion - Israel Institute of Technology

2 Background Neural Network is a Machine Learning System designed for supervised learning using examples. Such network can be used for hand written digit recognition, and when used in software is in-efficient in time and resources. This project is the third part of a 3-parts project. Our goal is to implement an efficient hardware solution to the handwritten digit recognition problem. Implementing a dedicated HW to this task is part of a new trend in VLSI architecture called heterogeneous computing- design of a system on chip with many accelerators for different tasks which will achieve better performance/power ratio, each for its purposed task.

NN networks are based on the biological neural system. The basic units that construct the network are Neurons and Weights. Neuron Weight Is connected to multiple inputs and outputs. The neuron output is the result of an activation function on the sum of the inputs Is the basic unit that connects the neurons. The weight multiplies the data passing through it with the weight value. Theoretical Background The Network

From Neurons and Weights we can construct a neural network with as many layers as we like. Each layer contains a certain amount of neurons and a set of weights connects the layer to other layers. The complexity of the network is determined by the dimension of the inputs, the more complex and more variable the input is, so does the network. input output Theoretical Background The Network

5 The output of a network can be a multiple neurons. Each neuron can be represented mathematically as a function with multiple variables. If we will approach the weights as parameters. So, for each Input X, we would like to minimize the average error between Y and the desired vector D. Learning algorithm

6 The method we use to reach the minimum error it a gradient based algorithm. For each example input we compute: This step is done for each weight in each layer, what the calculation actually does is walking us one small step towards the minimum of the error. An error function has lots of local minimums, the algorithm does not guarantee we will reach the global minimum. min a min b Learning algorithm

Output: 10 neurons +1 – answer -1 – other 9 layer 0 layer 1 layer 2 layer 3 layer 4 Convolutional layers layers Fully-Connected Input: 29x29grayscaleimage Network Description Structure & Functionality

Feature map #0 Feature map #5 13x1 13x13 13x13 29x29 Image Input Image Layer #0 841 neurons Layer # neurons NN Structure & Functionality

Layer #1 Feature map #0 Feature map #1 Feature map #5 Layer # neurons map #0 map #1 map #2 map #49 13x13 5x5 Layer Layer #3 100 neurons Layer #4 10 neurons 8 Output Layer n#0 n#1 n#99 d#0 d#9 NN Structure & Functionality

Layer #0: The first layer. The input to this layer is the 29x29 pixels image. The pixels can be seen as 841 neurons without weights. The pixels are the input to the next layer. Layer #1: The first convolution layer that produces 6 feature maps, each has 13x13 pixels/neurons. Each neuron on the output layer is a result of a masking operation between 5x5 (+1 for the bias, total of 26 weights) weights kernel, different for each one of the 6 maps, and 25 pixels from the input image. The 25 results are summed with a bias and entered to the activation function (tanh). Each feature map is the result of a non-standard 2D masking between a 5x5 weight kernel (each weight kernel results in a different feature map) and the 29x29 input neurons, summed with an added bias. The masking is of non-standard form, because the 5x5 neuron sub-matrices are derived by shifts of 2 (instead of 1), both vertically and horizontally, starting with the 5x5 sub-matrix at the upper left corner of the 29x29 input neurons. 10 Network Description Structure & Functionality

Layer #2: This layer is the second convolution layer. Its output is 50 feature maps of 5x5 neurons (summing up to a total of 1250 neurons). Each neuron is a result of a similar masking calculation as in the previous layer, only now each of the 50 feature maps is the sum of six 2D mask operations, each masking has its own 5x5 (+1 bias weight) weight kernel, and is between the kernel and its matching feature map of layer 1 (horizontal and vertical shift are 2, as in previous layer). 11 Network Description Structure & Functionality

12 Layer #3: This is a fully connected layer that contains 100 neurons. Each has 1250 entries (output neurons of previous layer) that are multiplied by 1250 corresponding weights. There are weights on this layer. Layer #4: Last fully connected layer. It contains 10 output neurons, each of which is connected to the previous layer's 100 neurons by a different weights vector. The 10 outputs represent the 10 possible recognition options. The neuron with the highest value out of the 10 neurons corresponds to the recognized image. There are 1010 weights in this layer (101x10). 12 Network Description Structure & Functionality

13 Network Description Structure & Functionality Summary Table: TypeNum of neuronsInputsNum of weights #029x29 = #113x13x6 = 10145x5 = 2526*6 = 156 #25x5x50 = 12505x5x6 = 15050*6*26 = 7800 # *100 = # *10 = 1010 Total of 3215 neurons and 134,066 weights

14 SW simulator implementation summary In project A, we have implemented the neural network which is described in previous slides using matlab: The matlab implementation achieved 98.5% correct digit recognition rate.

15 Software simulation results usage In the current project, we used the results of the previous software implementation as reference point to the hardware implementation, both in the success rate and in the performance of the implementation. We tried to achieve the same success rate, and in order to do so we have simulated several fixed-point implementations of the network (opposed to previous matlab floating point arithmetic), and chose the minimal format that achieved ~98.5% : Another usage of SW simulation results is for the weight parameters of the network, which were produced using the learning process implementation. Number of bits for the fractional partNumber of bits for the integer part %49.93%53.08%50.73%10.26% %98.33%98.32%98.02%69.07% %98.43% 98.42% 98.08%78.97% %98.43%98.42%98.08%78.97% 4 Chosen implementation

16 Project Goals Devise efficient & scalable HW architecture for the algorithm Implement dedicated HW for handwritten digit recognition. Achieving SW model’s recognition rate (~98.5%). Major performance improvement compared to SW simulator. Low cell count, low power consumption. Fully functional system – NN HW implementation on FPGA with PC I/F that runs a digit recognition application.

17 Project Top Block Diagram

18 Architecture aspects This architecture tries to optimize the resources/throughput tradeoff. Neuron networks have strong parallel nature, and so our implementation tries to pursue this parallel nature (which is expressed by high throughput at the output of the system). A fully parallel implementation would require 338,850 multipliers (for a total of 338,850 multiplications needed for a single digit recognition input), which is obviously not a feasible implementation. In our architecture, we have decided to make use of 150 multipliers. This number was chosen with careful attention to current FPGA technology – On the one hand we didn't want to utilize all the multipliers of the FPGA, but on the other we do want to utilize a substantial number of multipliers, in order to support the parallel nature of the algorithm.

19 Our destination technology is VIRTEX 6 XC6VLX240T, which offers 768 DSP slices, each containing (among other things) a 25X18 bit multiplier, meaning we are utilizing 150 of the 768 DSP blocks. In theory, we can add future functionalities to the FPGA, as we are far from the resources limit. This was done intentionally, as modern FPGA DSP designs are usually systems which integrates many DSP modules. Architecture aspects

20 Memory aspects Another important guideline for the architecture is memory capacity. The algorithm requires ~135,000 weights and ~3250 neurons, each of them represented by 8 bits (fixed point 3.5 format, which is the minimum number of bits required to achieve the same success rate as the MATLAB double precision model). This means that a minimum of 1.1 Mb (Megabit) of memory is required. VIRTEX 6 XC6VLX240T offers 416 RAM blocks of 36 kb (kilobit), totaling in Mb. This means that we utilize only 7.5% of the internal FPGA RAM memory.

21 Micro-architecture implementation Memories: All RAM memories were generated using Coregen, specifically designed for the target technology (VIRTEX 6). Small memories (~ 10 kb) were implemented as distributed RAM, and large memories were implemented using block RAM. Overall, 4 memory blocks were generated: Layer 0 neuron memory: single-port distributed RAM block of depth 32 and width 29*8=232. total memory size ~ 9 kb Layer 1 neuron memory: single-port distributed RAM block of depth 16 and width 13*6*8=624. total memory size ~ 10 kb Weights bias memory: single-port ROM block of depth 261 and width 6*8=48. total memory size ~ 12 kb

22 Weights and layer 2 memory: dual-port block RAM. One port has read & write width of 1200 (depth of 970 each), and the second port has write width of 600 (depth 1940) and read width of 1200 (depth 970). total memory size ~ 1.15 Mb Layer 2 neuron memory and weights memory were combined to one big RAM block for better utilization of the memory architecture provided by VIRTEX 6. Layer 3 & Layer 4 neuron memory: implemented in registers (110 Bytes). Micro-architecture implementation

23 Micro-architecture implementation Mult_add_top: This unit receives 150 neuron,150 weights & 6 bias weights, and returns the following output: This arithmetic operation is implemented using 150 DSP blocks and 6 adder trees, each adder tree containing 15 signed adders (all adders generated using coregen and implemented using fabric and not DSP blocks), totaling at 90 adders.

24

25 Micro-architecture implementation Tanh: Implemented as a simple LUT – 8 input bits are mapped to 8 output bits (total LUT size therefore 256 x 8)

26 Micro-architecture implementation Rounding & saturation unit: Logic to cope with the bit-expansion caused by multiplication (results in twice the amount of bits of the multiplicands) & addition (results in 1 bit expansion compared to the added numbers representation) operands. Neurons are represented using 8 bits, of 3.5 fixed point format. This format was decided upon after simulating several fixed point formats, and finding the minimal number of bits needed to achieve 98.5% accurate digit recognitions (which is equal to the success rate of matlab’s floating point). The rounding & saturation logic operates according to the following rules: if input < -4 then assign output = -4 (binary ' ') else if input > then assign output = (binary ' ') else assign output = round(input*25)*2-5 where round() is the hardware implementation of MATLAB's round() function.

27 Micro-architecture implementation UART interface: uart_if module receives 29x29 bytes image from the PC in a bit-by-bit manner (serial I/F). RX module coalesces each 29 bytes to 1 memory word written to L0 neuron memory. After all 29 memory words are written, start signal rises and image processing begins. When image processing is done, digit recognition unit outputs the result (10B bus- byte per digit) to UART TX. UART TX outputs the results to the serial I/F.

28 Resource utilization summary Our implementation Naïve implementation resource ~1.2 Mb~1.1 Mbmemory 150~350,000Multipliers ~100~175,000Adders 6~2400Activation function (tanh) units As can be seen, the naïve implementation (which is a brute-force, totally parallel implementation of the network) is not feasible in hardware, because of impractical resource demands. Our architecture offers reasonable resource utilization, while still improving performance substantially in comparison to software implementation.

29 Development Environment SW development platforms- MATLAB. HW development platforms- Editor – Eclipse. Simulation- Modelsim 10.1b Synthesis- XST (ISE). FPGA- Virtex 6 XC6VLX240T

30 HDL implementation & verification All of the modules described in previous slides, were successfully implemented using Verilog HDL. A testbench was created for each module, input vectors were created & injected, and simulation results were compared to a bit-accurate matlab model. After simulation results of all stimuli vectors were consistent with the bit-accurate model, the model was considered verified. After each module was individually verified, we have connected between the different modules, and implemented a controller over The entire logic. A testbench & bit accurate matlab models for all stages of the controller were created for the entire project.

31 HDL implementation & verification Here are modelsim simulation results for recognition of the digit ‘9’:

32 Performance Analysis As mentioned before, the purpose of implementing the system in hardware, was better utilizing the parallel nature of the algorithm, thus increasing performance. Previous SW implementation required ~5000 μs to perform a single digit recognition. The FPGA implementation, requires ~3000 clock cycles to perform a single digit recognition. In our implementation, we have used a system clock which operates in a frequency of 150 MHz, meaning that total time required for a single digit recognition is ~20 μs. In summary, the hardware implementation achieves a performance speed-up of about 250.

33 FPGA resource utilization Virtex 6 basic resources: 1)Each Virtex-6 FPGA slice contains four LUTs and eight flip-flops 2)Each DSP48E1 slice contains a 25 x 18 multiplier, an adder, and an accumulator. 3)Block RAMs are fundamentally 36 Kbits in size. Each block can also be used as two independent 18 Kb blocks. Utilization of the FPGA’s basic resources in our implementation:

In order to make the matlab implementation usable for future users, we have designed a GUI (Graphic User Interface) which wraps the network’s functions while allowing configurability. The GUI includes 4 modes of work: training mode, verification mode, user mode and FPGA mode. First 3 modes (training, verification and user mode) are running the SW simulator from project’s first part. In FPGA mode the user sends a digit image via UART to the digit recognition unit implemented on the FPGA. User interface (GUI)

36 Project challenges The main goal of the project, which was implementing a highly functioning handwritten digit recognition system, proved very challenging. We can divide the challenges into 3 main categories: Architecture, implementation and verification. Architecture-oriented problems – Devising efficient hardware architecture for the algorithm proved one of the biggest challenges of the project. Neural networks theoretically can be implemented completely in parallel, but this solution is not practical resources- wise. A lot of thought had been put into the tradeoff between parallelism and resource usage. In order to allow a degree of scalability, we had to think of the common things between the different layers, and figure out an architecture that will allow all layers to use the same logic modules, instead of implementing each layer in a straight forward fashion. Resource estimation – our target device is a Virtex 6 FGPA, and therefore our architecture had to take that under consideration. We had a well defined limit regarding the amount of memory & multipliers which were available for us, and therefore needed to devise an architecture which will not exceed these limits.

37 Project challenges Implementation-oriented problems – Our target device (Xilinx Virtex 6 FPGA) was unfamiliar to us, and so we had to learn how to operate Xilinx's tools to implement logic modules which are compatible to the target technology We have used fixed point arithmetic for the first time, and gained much experience in this area, including implementing hardware rounding & saturation logic. At first, we have implemented a simple truncation rounding, but found out that it did not satisfy us and lowered the success rate. Therefore, we needed to implement a more complicated rounding method, which imitates matlab's round() function. In order to implement such modular & scalable architecture, a smart controller had to be implemented. Composing a control algorithm and afterwards coding this controller proved very challenging

38 Project challenges Verification-oriented problems – Verification of the system was probably the most challenging aspect As stated in the previous section (Implementation-oriented problems), we were unfamiliar with Xilinx's tools, and therefore after creating the desired logic (such as memories, multipliers etc.) using this tools, we had to verify that they work. Nothing worked at first, so this proved a long process, until we have learned to properly use Xilinx's IP's Most challenging of all was to achieve a successful verification of the entire system. Our system contains an extremely large amount of data (neurons, weights, partial results), and so every small mistake in the controller leads to a lot of wrong output data, and it is very difficult to pinpoint the origin of the mistake. For example, if we accidently coded the controller such that a data_valid strobe arrives to a certain module one clock earlier than it actually should have, than the entire data flow continues with data which is essentially garbage, and it is hard to find the origin of the mistake. To overcome this, we had to produce a matlab bit accurate module to each step of the design, not only to the final results.

39 Thank you