Final Project presentation

Final Project presentation
Design space exploration and optimization of FPGA-based Neural Network Accelerator Final Project presentation Maedeh Hemmat ECE734 - VLSI Array Processors for Digital Signal Processing

Goal: For any CNN on any arbitrary FPGA platform, find a mapping that:
Outline Goal: For any CNN on any arbitrary FPGA platform, find a mapping that: Minimizes the data transfer between computational units and memory Maximizes data reuses Increases the throughput of the accelerator Approach and Progress: A software-hardware co-design Software: Developing an algorithm to explore the design space by considering the structure of networks and characteristics of CNN Hardware: Implementing the optimal mapping in hardware (ModelSim) Measure power consumption and throughput of the network under different scenarios NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited

Motivation Input feature Map: 𝑁 𝑥 × 𝑁 𝑦 × 𝑁 𝑖𝑓 Convolutional layer: ( 𝐾 𝑥 × 𝐾 𝑦 × 𝑁 𝑖𝑓 × 𝑁 𝑜𝑓 ) output feature Map: 𝑁 𝑥′ × 𝑁 𝑦′ × 𝑁 𝑜𝑓 NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited The four-nested loop leads to a large design space with various choices to implement the CONV layer while exploiting parallelism

Proposed Algorithm What are the choices? Which one is the optimal one?
Unroll over input feature maps (Channels) NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited Processing 𝑃 𝑖𝑓 channels at one clock cycle 𝑃 𝑖𝑓 pixels from different kernels/feature maps are used An accumulator is required to accumulate the results from different channels If 𝑃 𝑖𝑓 = 𝑁 𝑖𝑓 => one output /clock cycle

Proposed Algorithm What are the choices? Which one is the optimal one?
Unroll over output feature maps (Kernels) NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited Processing 𝑃 𝑜𝑓 kernels at one clock cycle Input pixels from same feature map are multiplied with weights from different kernels Each input pixel is reused for 𝑃 𝑜𝑓 times - Partial results over channels need to be stored

Pick the mapping with less latency and less on-chip buffer access
Proposed Algorithm Pick the mapping with less latency and less on-chip buffer access NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited

Experimental Results Trade off between different mapping under different number of channels Size of weight matrix : 5 x 5 x Number_of_channels x 32 NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited

Number of kernels per Tile Output pixels per cycle
Experimental Results Target CNN: LeNet5 Two convolutional layer followed by one fully connected layer Trained and tested on MNIST data First Conv layer : 5x5x1x20 (Unrolling over channels) Kernel Tiling Factor Number of kernels per Tile On-chip buffer access Output pixels per cycle 1 20 525 10 4 5 1025 8 3 1225 1.5 NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited Second Conv layer: 5x5x20x50

Hardware Simulation Implementing the second layer of LeNet5 in hardware 16-bit fixed point operation Two-input multipliers and adder trees to implement MAC Quantizing the generated output to 16-bit to control dynamic range growth Using Modelsim to implement the design Synthesizing the design at 45 nm technology Measuring the power consumption of the layer under different scenarios ( by changing the available on-chip buffer size) NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited

Experimental Results Power Consumption of second layer of LeNet5 under different scenarios NOwadays, CNNs have gained significant attentoiin to perform more complicated tasks such as To be able to perform these complicated tasks, neural networks are becomming deeper and deeper, that is infact the increase in number of layers. Deeper neural network makes the power-efficient hardware implementation more challenging, especialy when it comes to embedded devices, cell phones, and wearable debices where the energy budget is limited The on-chip buffer size varies from 40kb to 8 kb

Conclusion Developing an algorithm which finds the optimal mapping of CNN on FPGA platforms Minimizing the number of on-chip buffer accesses to minimize power consumption Implementing and synthesizing one layer of LeNet5

Final Project presentation

Similar presentations

Presentation on theme: "Final Project presentation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Final Project presentation

Similar presentations

Presentation on theme: "Final Project presentation"— Presentation transcript:

Similar presentations

About project

Feedback