YOLO-based Object Detection on ARM Mali GPU

YOLO-based Object Detection on ARM Mali GPU
Trunghai Do, Jemin Lee, Hyungshin Kim* Dept. of Computer Science and Engineering, Chungnam National University Good morning, today I bring to this conference our work named “YOLO-based Object Detection on ARM Mali GPU”. My name is Trung Hai, and I am a master student at Chungnam National University, Embedded System Laboratory under supervised by Prof. Kim Hyungshin.

Contents Introduction Related works Implementation Results Conclusions
First, I will introduce to you about our work, the motivation and the challenges when we implement this work. There are some studies related to our work, I will talk about them in the second part. Then the next part is our implementation, and after applying our method, we obtain some interesting results. Finally, I will conclude our work in short.

Introduction Object detection = classify object + localize object.
Object detection requires more computation and memory than classification. CNN based object detection for embedded devices. Training: high-end NDIVIA GPUs Deploying: embedded devices. Recently, convolutional neural networks (CNNs) are widely used in vision applications. Object detection is one of the most challenging tasks in computer vision field because it must not classify object’s category but also localize object’s position in an image. It is known that object detection in real time requires more computation and memory than image classification. For training models, we can still use high-end NDIVIA GPUs, but after obtaining trained models, they should be deployed on a wide range of target devices including high-end GPUs as well as embedded devices.

Introduction (cont…) Motivation: autonomous systems such as self-driving vehicles, drones or robots, all of them need real-time object detection to guarantee control. Challenges: Deploying deep network on resource-constrained systems compared to high-end GPUs. Maintaining accuracy and detection speed. Models should be small. The motivation of this investigation is that the trend towards automation in processing and manufacturing sectors, as well as continued evolution of low cost pervasive computing have become key market drivers for the last few years. Autonomous systems such as self-driving vehicles, drones or robots, all of them need real-time object detection to guarantee control without delay. However, most of recently researches focus on implementing real-time object detection on high-end devices. We can obtain higher accuracy if we deepen the CNN architecture. As you know from 2012, start with AlexNet and then 2014, 2015, GoogLeNet, VGGNet or ResNet had given us this evidence. And this fact can be applied for both classification and object detection. However, the model size becomes bigger when deepening the CNN architecture. We need small models because, with memory hierarchy, large networks do not fit in on-chip SRAM and require more costly off-chip DRAM accesses. Accessing to SRAM is faster and consumes less energy than accessing to DRAM. Therefore, the models need to be deep enough to maintain the accuracy but still small to be deployed on resource-constrained as embedded systems.

Introduction (cont…) This work
Training caffe-yolo models on NVIDIA GTX-1080 Deploying models on ARM Mali-T628 MP6 GPU. GoogLeNet based model takes 7.5 seconds per image. Fast YOLO based model takes 1.6 seconds per image. In this work, we use YOLO algorithm for object detection.

Related works Hardware solutions: FPGA or ASIC.
Software solutions on high-end GPUs Classification based solutions: R-CNN series. Regression based solutions: YOLO. There are some works are related to our research. For object detection, some hardware solutions have been proposed. FPGA and ASIC are custom hardware solutions for CNN-based vision tasks. They have the potential to achieve both high performance and relatively low power consumption. Compared with custom hardware solutions, software solutions can be easily deployed on various popular embedded hardware. In the object detection task, there are two kinds of attractive methods. The first one is classification-based solution (e.g R-CNN series) and the second one is regression-based solution (e.g YOLO)

Related works (cont…) Software solutions on embedded devices
One research relied on Fast R-CNN method has been experimented on Jetson TK1 board. LCDet is inspired by YOLO, they train model by using TensorFlows framework for the face detection task and experiment on NVIDIA K40 GPU. SqueezeDet is also based on YOLO, they train model by using TensorFlows, evaluate the model on 3-category object detection data set and they experiment on NVDIA TITAN X GPUs. When the researchers pay attention on embedded devices, they try to migrate those object detection algorithms to embedded devices. One research relied on Fast R-CNN method has been experimented on Jetson TK1. However, this board still takes advantage of NVIDIA CUDA library. LCDet is inspired by YOLO, they train model by using TensorFlows framework for the face detection task and experiment on NVIDIA K40 GPU. SqueezeDet is also based on YOLO, they train model by using TensorFlows, evaluate the model on 3-category object detection dataset comprised of car, cyclist and pedestrian because their work focuses on autonomous driving car. And they also experiment on NVIDIA TITANX GPUs. All these works have a common point that they use NVIDIA GPU for inferencing.

Implementation Deep learning framework
Most well known deep learning frameworks use CUDA library. Most embedded GPUs are supported OpenCL library. CK-Caffe is a deep learning framework based on Caffe. CK-Caffe runs on OpenCL API. Most of well known deep learning frameworks such as Caffe, Caffe2, TensorFlows and Torch take advantage of CUDA library which only supports NVIDIA GPU. However most embedded GPUs are supported OpenCL library. In this work, we choose CK-Caffe. This is an open source deep learning framework based on Caffe. Unlike Caffe, CK-Caffe runs on OpenCL API. At training phase, we can easily define Caffe model architecture and train it on NVIDIA GTX-1080 GPU. Once the model is trained, it can be deployed on ARM Mali-T628 GPU which runs CK-Caffe.

Implementation (cont…)
Object detection algorithm: YOLO We employ YOLO in our work because it’s faster than faster R-CNN algorithm. YOLO adopts a single-pass detection pipeline combining bounding box localization and classification by a single network. Input image is divided into a SxS grids. Each grid cell predicts B bounding boxes and confidence scores for those boxes. Each bounding box consists of 5 predictions including 4 coordinates of bounding box and a confidence. YOLO prediction system is encoded as an SxSx(5B +C). For evaluating on PASCAL VOC, S=7, B=2, C=20 is the number of classes in Pascal VOC. Therefore, the final prediction is a 7x7x30 tensor. Joshep Redmon at al. “You Only Look Once: Unified, Real-Time Object Detection”-2015

Network architectures: Originally, Base YOLO and Fast YOLO models use fully connected layers for detection. We use GoogLeNet model and Fast YOLO model as backbone models. Our models replace all fully connected layers with convolutional layer for detection. Our work focuses on designing the network architecture. Originally, Base YOLO and Fast YOLO models all use full connected layers for detection. We use GoogLeNet model and Fast YOLO model as backbone models. Compared to Base YOLO, the Inception module of GoogLeNet was designed to work well even under strict constrains on memory and computational budget. So we use GoogLeNet model instead of Base YOLO model. In both cases, we use the same strategy that stacking 2 more convolutional layers to obtain rich feature before forwarding these feature maps through detection layer. And as you know, most of parameters lie in fully connected layer. This makes model size bigger. To overcome this problem, we replace all fully connected layers with convolutional layer for detection (let’s call it ConvDet). As illustrated in the table, the number of parameters in detection layer have been dramatically reduced.

Training: all models are trained on PASCAL VOC 2007 & “trainval” set with input images are of size 448x448. GoogLeNet+ConvDet model: pretrained GoogLeNet from Caffe’s Model Zoo + ConvDet. Fast YOLO + ConvDet model: pretrained Fast YOLO on ILSVRC dataset. Then appending ConvDet layer for detection and fine tune on PASCAL VOC dataset. We apply transfer learning method in our work. All models are pre-trained on ILSVRC 2012 image net dataset and then fine tuning on PASCAL VOC 2007 and 2012 “trainval” set with input images are of size 448x448. We employ pre-trained GoogLeNet model from Caffe’s Model Zoo and then apply our strategy to construct the architecture before training. Because there is no pre-trained Fast YOLO model in Caffe’s Model Zoo, we need to train it from scratch on ILSVRC 2012 dataset about 1 week until it obtains a top-1 accuracy % and a top-5 accuracy 80.1% on validation set. Finally, we append ConvDet layer into pre-trained model and fine tuning the network on Pascal VOC for the object detection task.

Results Inference accuracy
After designing our model architectures, training them on PASCAL VOC dataset on NVIDIA GTX-1080 GPU, we deploy them on ODROID-XU4 board and obtain the following results. As you can see from inference accuracy results, we obtain comparable mean Average Precision with base YOLO (63.4%) by employing GoogLeNet as backbone architecture. Meanwhile, our Fast YOLO architecture has relatively low mAP compared to original Fast YOLO (52.7%). This poor mAP can be explained by low accuracy on feature extraction layers pre-trained on ILSVRC 2012.

Results (cont…) Execution time
The Odroid-xu4 is equipped by Exynos 5422 with four big cores (ARM Cortex A15) and four small cores (ARM Cortex A7). It is also equipped with ARM Mali-T628 MP6 GPU. The MP6 version has six shader cores including 4 shader cores for GPU-0 and 2 shader cores for GPU-1. The table shows that deploying models on GPU-0 is more efficient than CPU or GPU-1. GPU-0 has twice as many shader cores as GPU-1 and thus GPU-0 has twice the speed of comparison for GPU-1. Additionally, CK-Caffe takes advantage of OpenCL API which helps deep learning framework can be run more efficiently on GPU than CPU.

Conclusions By leveraging YOLO, GoogLeNet and Fast YOLO based models have been trained on NVIDIA GPU but deployed on ARM Mali-T628 GPU of ODROID-XU4. We have experimented and found that real-time object detection can be deployed on resource-constrained ARM based GPU. In the future, we will investigate YOLO9000 to improve our results.

Thank you

YOLO-based Object Detection on ARM Mali GPU

Similar presentations

Presentation on theme: "YOLO-based Object Detection on ARM Mali GPU"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

YOLO-based Object Detection on ARM Mali GPU

Similar presentations

Presentation on theme: "YOLO-based Object Detection on ARM Mali GPU"— Presentation transcript:

Similar presentations

About project

Feedback