CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms Authors: Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy,

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms
Authors: Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, Nachiket Kapre Published in: 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES)

Contents Abstract Introduction Background Exploiting parallelism
Caffepresso flow Results

Abstract Low-complexity classifiers used in power-constrained and performance-limited scenarios. Embedded systems with 5-20W power budgets. Automated code generation and auto-tuning approach. Interesting results Slower GPU processing compared to most other systems for small dataset Faster and energy efficient implementation on the older 28nm TI Keystone II DSP over the newer 20nm Nvidia TX1. This paper focuses on low-complexity classifiers used in power-constrained and performance-limited scenarios are characterized by operations on small image maps with 2-3 deep layers and few class labels. They consider a range of embedded systems with 5-20W power budgets such as Xilinx ZC706 board (with MXP soft vector processor), Nvidia Jetson TX1 (GPU), TI Keystone II (DSP) as well as the Adapteva Parallella board (custom multi-core NoC). This paper presents CaffePresso which is an optimized Caffe-compatible framework targets various accelerators such as FPGAs, DSPs, GPUs… They use automated code generation and auto-tuning approach based on knowledge of the ConvNet requirements as well as platform-specific constraints. We may expect the Jetson TX1 with cuDNN to deliver high performance for ConvNet. Nevertheless, they observe a flip result with slower GPU processing compared to most other systems for small datasets such as MNIST and CIFAR10, and faster and more energy efficient implementation on TI Keystone II DSP over newer Nvidia TX1 SoC in all cases.

Introduction Accelerator-based embedded SoC platforms are able to support computer vision applications. Embedded classification task is restricted to a few classes. DNN can be offloaded to DSP, GPU and FPGA accelerators as they support high processing throughputs and very low power consumption. Potential of commercial off-the-shelf SoC hardware for efficiently implementing these tasks at lower cost and power requirements in an embedded context. Modern accelerator-based embedded SoC platforms are able to support cv applications such as video analytics in smart cameras, drone-based image processing, medical patient monitoring, automotive navigational intelligence. The scope of the embedded classification task is restricted to a few classes (e.g. detecting humans, identifying roadblocks, classifying a few faces…) because the primary objective is energy efficiency and low latency of response. For embedded scenarios, DNN can be offloaded to DSP, GPU and FPGA accelerators as they support high processing throughputs with very low power consumption. The potential of commercial off-the-shelf SoC hardware for efficiently implementing these tasks at lower cost and power requirements in an embedded context.

Introduction (cont…) The challenges for the platforms
GPUs: Nvidia GPUs offers high data-parallel processing throughputs and highly-optimized cuDNN library. DSPs: TI keystone 2 exploits an energy-efficient multicore VLIW organization, these DSPs have optimized DSPLib and IMGLib libraries that take full advantage of the DSP cores. Multi-Cores: The Adapteva Epiphany III SoC has key benefit of the low power consumption of the 16-core chip. FPGA: Xilinx ZC706 can deliver higher energy efficiencies. GPUs: GPU-based Soc platforms such as Nvidia Jetson TX1 is a popular choice for supporting embedded computer vision problems because Nvidia offers high data-parallel processing throughputs and highly-optimized cuDNN library that reformulate parallelism in the DNN. DSPs: DSP-based platforms such as TI keystone 2 exploits an energy-efficient multi-core VLIW organization that combine multiple instructions into a single cycle for high performance. DSPs have optimized DSPLib and IMGLib libraries that take full advantage of the DSP cores. Multi-Cores: The Adapteva Epiphany III Soc is an exotic multi-core floating-point architecture supported by a message-passing NoC. The key benefit of this alternate organization is the low power consumption of the 16-core chip. FPGAs: can deliver higher energy efficiencies.

Introduction (cont…) The key contributions:
Development of Caffe-compatible backends for Deep Learning on various accelerator-based SoCs. Automated code generation and performance tuning. Quantification and analysis of performance and energy efficiency for different datasets and various optimization strategies across the embedded SoC platforms.

Background Convolutional Neural Networks
Embedded implementation is parameterized in term of Number of layers in network. Number of feature maps in each layer. Kernel sizes for 2D convolutions in each layer Pooling and subsampling factors. For embedded implementation in CNN, they primarily interested in energy-efficient acceleration of classifiers with few class labels. (MNIST and CIFAR10 only have 10 object classes). In these scenarios, the classification model is trained offline on a more capable machine and requires fast evaluation of the forward pass with real data on embedded devices.

Background (cont…) Convolutional Neural Networks
Raw timing numbers for one 32x32 image patch on Jetson TX1 (CPU and GPU) using Caffe + cuDNNv4. From the table above, we can see that beyond the obvious 8-13x acceleration advantage when using the GPU, 2D convolutions are overwhelmingly the slowest computations in this stack. When sequencing a series of ConvNet layers, the storage of intermediate maps can become a challenge for embedded platforms with limited on-chip capacity. Therefore, the optimization focuses on constrained embedded platforms needs to be on faster 2D convolutions as well as smarter data sharing between multiple layers of the ConvNet stack.

Background (cont…) Architecture Potential
Comparing various embedded Accelerator-based SoC Platforms GPU: Jetson Tx1 contains 256 single-precision floating-point Maxwell cores run at 1GHz supported by a 64KB L1 cache gives 256 Gops/s and consumes W. DSP: The TI Keystone II has 8 cores C66 DSPs that can process 32 16x16 integer multiply-add operations per cycle in VLIW fashion running at 1.4 GHz gives Gops/s and consumes W. Multi-Core +NoC: The multi-core Epiphany III SoC supports 16 eCores run at 667MHz gives 10.5 Gops/s and consumes 3-4 W. FPGA: The FPGA on the SoC uses higher-performance Kintex-class fabric gives 11.5 Gops/s and consumes 19 W.

Exploiting parallelism
Optimizing convolutional networks strategy Parallelizing the problem Data storage management for intermediate maps Communication management for moving inter-layer results. Challenge: Unlike solutions using abundant hardware resources, embedded solutions are severely constrained by the limited capacity of logic, on-chip memory and communication bandwidth. Most mappings of DNN computations will spend a lot of their time performing 2D convolutions. Furthermore, the sequencing of various convolutional layers will generate large amounts of memory traffic due to dependencies of inter-layer intermediate maps

Exploiting parallelism (cont…) Parallelism
Each pixel can be processed in parallel provided sufficient storage and bandwidth for neighboring pixels is available. The platform-optimized libraries directly use VLIW or SIMD intrinsics as appropriate to exploit this obvious parallelism.

Exploiting parallelism (cont…) Memory Storage
Patch-based partitioning strategy to decompose large image frames into smaller sizes that can operate entirely on-chip with ease. Contribution is the design of an auto-tuning optimization backend. For small-scale datasets (such as MNIST or CIFAR10), it’s easily to fit the intermediate maps and temporary results in on-chip RAMs of most SoCs. However, large-scale datasets (such as Caltech101 or ImageNet) have memory requirements that exceed available on-chip capacity. The Jetson TX1 manages its memory through cuDNN and Caffe while other platforms require explicit memory management. When required, they use a patch-based partitioning strategy to decompose large image frames into smaller sizes that can operate entirely on-chip with ease. However, this requires an extra redundant copy of the image border ghost pixel region around each patch to enable fully on-chip operation when the kernel window falls outside the patch region. The idea of decomposing images into patches is nothing new, the contribution is the design of an auto-tuning optimization backend that balances the benefits of fast on-chip processing possible via small patch sizes vs the extra DMA copying times for redundant data in choosing a patch size for each SoC platform.

CaffePresso Flow Mapping methodology
Caffe input: prototxt files in Google ProtoBuf format which stores network layers and trained kernel weights in .lmdb file. Code-Generation: Individual Caffe layers are translated to low-level platform-specific automatically. Auto-Tuning: This layer tailors the final mapping for a given ConvNet spec to the target platform.

CaffePresso Flow (cont…) Platform-Specific Optimization
The optimization strategy is generalized: Identify the performance limits of each platform. Determine a high-level parallelization and partitioning strategy as well as the memory organization. Auto-tuning framework chooses specific implementation parameters and degree of optimizations to fully customize the mapping for each platform and ConvNet combination.

TI DSPs Memory Management: write their own memory allocator to store intermediate maps in the local 6MB MSMC RAM. Patch-Based Map Partitioning: The availability of on-chip memory is enough to use large patch sizes. Improving ALU utilization: Unrolling the instructions to reduce loop overheads. Data Type: Fixed-point IMGLib routines for pixel- processing.

FPGA MXP Engine AXI instruction dispatch latencies: Avoiding this requires restructuring the parallel vector operation. Kernel Access: Pre-assembling kernel coefficients into long vectors with repeated scalar entries on the MXP. Instruction reordering: Manually reorganize the order of VMUL and VADD operations to avoid sequential dependencies in the vector engines. CPU-FPGA partitioning: Fully connected layers and pooling layers run faster on the ARM CPU than MXP.

Parallela On-chip NoC: When dataset is small enough, they can fit the entire stack on the on-chip memory . Patch-based Map Partitioning: statically determine the largest patch size.

Nvidia GPU Using the optimized cuDNN library. Matrix arithmetic based on BLAS routines are some of the fastest operations possible on a GPU.

Results The Keystone II DSP offers the best performance and energy efficiency in all cases. For larger configurations, GPU shows better performance. The Parallela implementation is surprisingly competitive for small dataset.

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms Authors: Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy,

Similar presentations

Presentation on theme: "CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms Authors: Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms Authors: Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy,

Similar presentations

Presentation on theme: "CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms Authors: Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy,"— Presentation transcript:

Similar presentations

About project

Feedback