Boston University & USTC

Boston University & USTC
A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing Tong Geng, Tianqi Wang, Ahmed Sanaullah, Chen Yang, Rushi Patel, Martin Herbordt Boston University & USTC FPL’18

CNN: most popular algorithm in ML
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 CNN: most popular algorithm in ML Widely used in computer vision such as object classification, detection and recognition In machine learning, CNN has become one of the most widely used algorithm especially for computer vision, such as object classification, detection and recognition.

Acceleration of CNN Training on FPGA-based Clusters
7/13/2019 Inference of CNN Inference of CNN is useful but difficult to accelerate. Tens of billions are invested! Millions of researchers and engineers has been devoted to it! Inference acceleration is difficult and many researchers are working on it.

Faster Training is important!
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Faster Training is important! “Training in days” make the design space exploration of neural network design difficult; Fast Training helps the development of ML. Comparing to inference, training has not gotten sufficient attentions. Acceleration of training is the next step and deserves to be explored! The slow training has become the barrier on the road of development of neural network. Because when researchers try to explore new structure of network or new topology to train a more accurate network, the long training time such as 1 week/couple of days makes it impossible to do sufficient design space exploration. Hence, for the development of ML field, fast training is very important.

Training is much more difficult than Inference!
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Training is much more difficult than Inference! Inference (Alexnet): 720 Million Flops per image Training: 2.2 Billion Flops per image * Batch size * Epoch * Iteration → can be 550 Trillion Flops To train a CNN, clusters/clouds are necessary Comparing to inference which design space of inference has been explored sufficiently, however, acceleration of CNN training is yet to receive enough attention, especially in FPGA field. Training a network is a very heavy work. Take Alexnet as an example, it takes around 550 Trillion Flops to train it. So, we’d better use clusters or clouds to train a network. The problem is how to map training logic of CNN to multi devices efficiently. Ben-Nun, Tal, and Torsten Hoefler. "Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis." arXiv preprint arXiv: (2018).

How to map training logics to multi-devices?
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 How to map training logics to multi-devices? Most widely used: Data parallelism Large batch size limits the scalability; Weights are broadcast between all clients and server in centralized topology; or all-to-all communication in decentralized topology; There are 3 methods to map a network on multi devices(no matter FPGA/GPU/CPU): Data parallelism, Model parallelism, Layer parallelism; Data parallelism is the most widely used method in ML training, especially in GPU field. (most of the training work is done in GPU) However, there are several serious problems: 1. big cluster with large number of devices require extremely large minibatch size when using SGD to train a network. And using large batch size, multi-device training becomes inefficient. 2. the communication among devices (in forward propagation, parameters is transferred between devices, for backward, gradients of parameters are transferred) is extremely heavy. For centralized topology, a parameter sever broadcast parameters to all clients in forward, while all clients send their gradients of weights to parameter server for update in backward all most at the same time leading to a heavy bandwidth pressure. Let’s focus on the first problem; Hundreds of works have been done solving the large batch problem. However, such works increase the upper bound on feasible minibatch sizes, but do not remove it. In these two picture, we can observe that when batch size increase to over 10k, the error becomes inacceptable and the performance of training improve very slowly. Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." arXiv preprint arXiv: (2017).

7/13/2019 Cont ideal This figure is from facebook paper: training imagenet in 1 hour. They applied a lot of smart technologies to make the training time reduces linearly along with the increase of device number. Here, the mini-batch size increase linearly along with the number of device. The ideal relation of “time per epoch” vs “number of device” should be linearity. (epoch is the time to finish training of all data set. And normally, to train a network needs around 80 epoch. Using multiple devices, the time for each epoch should be reduced linearly along with the increase of number of device, but facebook’s work didn’t reach it. ) However, facebook’s result is “y=-a(2^x)+b”. (the red line) Ideal Result: Linearity between “time per epoch” vs “number of device” Facebook works a lot to solve the batch size limitation, but still not as good as expected; Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." arXiv preprint arXiv: (2017).

How to map training logics to multi-devices?
FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters 7/13/2019 How to map training logics to multi-devices? Layer parallelism Model parallelism Weights are distributed stored; Too much data exchange among devices to combine intermediate results Weights are distributed stored; Load balancing is difficult, as the workloads of different layers vary greatly; GPU is not a good platform for both No direct inter-SM communication Inefficient inter-device communication FPGA is the right choice? How to balance the workload if Layer parallelism How to reduce the massive data exchange if Model parallelism? FPDeep is proposed! Another two mapping method. For model parallelism, weights are stored in distributed way. For each layer, all intermediate results from all devices have to be added up and then broadcast to every devices leading to heavy communication. For layer parallelism, the workload of each devices are not balanced leading to low utilization. Using these two methods, there is no large batch size problem. Any batch size can be used during training even if using 100s or 1000s of devices. In my opinion, the reason they are not widely used in training nowadays is that GPU can not support them well, as GPU doesn't have direct inter-SM communication and inter-device communication is expensive. However, FPGA is the right choice. But we need solve the problems of MP and LP, i.e., workload balancing and reduce the massive data exchange. FPDeep is proposed to solve these problem.

Hybrid Model/Layer Parallelism
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Hybrid Model/Layer Parallelism Weights are stored in distributed fashion Less data exchange comparing to Model Parallelism Workload are balanced among devices Small batch size is always supported even if using hundreds of devices; In FPDeep we use hybrid model/layer parallelism. In this case, using layer parallelism, data exchange becomes much less than model parallelism. Model parallelism brings the opportunity to balance the workload. Also comparing to data parallelism, batch size is not a problem in our work. Hence, in our work, the training time is reduced linearly along with the increase of number of FPGA used.

7/13/2019 FPDeep A framework to map CNN training to FPGA clusters using hybrid Model/Layer Parallelism; Provides: High Performance and Energy Efficiency Workload balancing Storage balancing for parameters Only on-chip memory at CONV layer Good portability: 1-D topology High scalability： up to 83 FPGAs with only 5 transceivers FPDeep uses Network configuration and hardware constraints as inputs. Via partition and mapping, the network is mapped to multiple FPGAs and each FPGA gets balanced workload. The framework generates rtl implementation for each FPGA automatically. This framework provides high performance and energy efficiency, there are two reasons: first, In FPDeep, we propose a novel inter- and intra-layer partitioning and mapping methodologies which improve workload balancing among FPGAs by increasing the ﬂexibility of workload allocation. Second, CNN training is executed with a fine-grained pipelined manner, features propagate very fast, which reduces the time that features have to be cached waiting for backward propagation, reducing storage demand to the point that only on-chip memory is used for CONV layers. In addition, it has very good scalability and portability. FPDeep uses 1-D topology, which makes it suitable for many existing platforms.

How to partition the workload?
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 How to partition the workload? Inter-Layer Partition Resources are allocated according to workload of each layer For Example: According to number of Flops, Layer1 need 4.8 FPGA, Layer2 need 2.2 FPGA The computation resources of $T_{FP}$ FPGAs are allocated to $L$ layers in proportion to their computational requirements, i.e., numbers of operations for FP, EB, and WG.

CONT Intra-Layer Partition:
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 CONT Intra-Layer Partition: IFP: each FPGA evaluate a part of input features This figure illustrates how the target CONV layer is partitioned and mapped onto 4.8 FPGAs using IFP. 192 input features and corresponding weights are partitioned into 5 segments including four big segments each containing 40 features and one smaller segment containing 32 features. Each FPGA receives one of the five segments and performs partial evaluations. Each output feature is calculated by summing up the related partial results from the 5 FPGAs. Because of the hardware resource constraints, it is not possible to calculate partial results of all output feature on a single FPGA in parallel, hence, by comparing the required hardware for calculating each partial output result and the available hardware on chip, the number of output channels processed in parallel is decided. In Fig. 2 channels are processed in parallel.

CONT Integration of Inter- and Intra- Partition
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 CONT Integration of Inter- and Intra- Partition Each FPGA do its part of workload and forward the result to the next FPGA For Example: 4.8 FPGAs for Layer 1, 2.2 FPGAs for Layer 2 (with IFP method) This figure illustrates how the partitioning schemes work together. All input features (Fi1) of layer1 are partitioned into 5(4.8) segments and are infered in 5(4.8) FPGAs respectively. Each FPGA produces partial results of all output features (Fo1) of layer1. complete of all output features are computed by summing up partial results from 5(4.8) FPGAs. Afterward, these output features are partitioned into 3(2.2) segments and fed into 3(2.2) FPGAs in layer 2. each FPGA in layer2 consumes data from all FPGAs in layer 1. The most straightforward idea to build the communication network is to connect the FPGAs involved directly. In this case, for each FPGA, tens of SMA ports need to be used for the direct connection. This results in a very complex network topology, which makes the FPGA cluster difficult to maintain and limits scalability and portability. Hence a 1-D daisy chain topology is used.

Layer Fusion: fine-grained pipelining
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Layer Fusion: fine-grained pipelining Activations propagate faster, no need wait for the completion of whole layer The time features have to be cached waiting for backward propagation is reduced Reduced storage demand CNN training is executed with a fine-grained pipelined manner, features propagate very fast, which reduces the time that features have to be cached waiting for backward propagation, reducing storage demand to the point that only on-chip memory is used for CONV layers. Output pixels sharing the same data dependency start to be calculated as soon as all dependent data are ready. For the 3-layer BNN illustrated in upper Figure, when the data covered by the gray window in the first layer are ready, the gray pixel in the second layer is computed. When all data covered in the yellow rectangle are ready in the first layer, all data in the yellow rectangle in layer 2 can be calculated and afterwards, the yellow pixels in layer 3 are ready to be calculated. At this point, using these data in the layer 3, partial results of FC layer can be computed. The yellow rectangles indicates the data dependency from the first CONV layer to the first FC layer. After the data under yellow window is completely processed, the window slides to right and then down, until reaching the red one. At the same time, the data of the rest layers covered in yellow are processed. Hence, different layers are fused and processed in parallel and their latency are overlapped.

Architecture Acceleration of CNN Training on FPGA-based Clusters
7/13/2019 Architecture Here is the architecture of FPDeep accelerator. All modules are parameterized and written in HDL. FPGAs work in deeply-pipelined manner. As shown in this figure, multiple FPGAs can cooperatively compute the same layer, while multiple layers can also be mapped to the same device.

Weight Balancing Acceleration of CNN Training on FPGA-based Clusters
7/13/2019 Weight Balancing

Overall Resource Allocate Optimization
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Overall Resource Allocate Optimization Solve the optimization problem for the whole FPGA cluster’s resource

Overall Resource Allocate Optimization
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Overall Resource Allocate Optimization According to optimization result, allocate Conv Engine and Buffer to each FPGA device.

7/13/2019 Resource Utilization Example: Mapping AlexNet/VGG-16/VGG-19 to 15 FPGAs BRAM Utilization DSP Utilization & Throughput AlexNet: Without Weight Balancing Unbalanced BRAM Utilization VGG-16/19: With Weight Balancing Balanced BRAM Utilization Map three different network to a cluster with 15 FPGAs. With the help of weight balancing, BRAM utilization is better.

Results Scalability: 80+ FPGA Utilization: >98%
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Results Scalability: 80+ FPGA Utilization: >98% Here are the results using ALEXnet and VGGnet as testbeds. As we can see in the figures of roofline model, FPDeep has great scalability, when the number of FPGA increases, throughput shows linearity up to 80+ FPGAs, using only 4,5 transceiver . Each FPGA can provide 1TOPS performance using VC709. As shown in the utilization figures, rate of Idle stages always lower than 5%. With sufficient FPGAs, even stable under 2%.

Performance Power Efficiency: (GOPS/J) Versus K80 GPU: 5.5x
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Performance Power Efficiency: (GOPS/J) Versus K80 GPU: 5.5x Versus Titan X GPU: 8.8x Versus other FPGA designs: 5.7x Here, we compare our design with commercial GPU and existing FPGA designs. We have lots of results, most important results are energy efficiency. Energy efﬁciency is evaluated with respect to GOPs/J. Our design is more than 5 times better than others. Training with 16 bit fixed point using rounding algorithm proposed in ICML 2015.

Off-chip memory access
Trustees Presentation 7/13/2019 Off-chip memory access BW of DDR is a significant bottleneck. Memory access optimization is always a challenge. Example: Zhao’s work Zhao, Wenlai, et al. "F-CNN: An FPGA-based framework for training convolutional neural networks." Application-specific Systems, Architectures and Processors (ASAP), 2016 IEEE 27th International Conference on. IEEE, 2016.

Off-chip memory access
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Off-chip memory access Another Example: Zhang’s work Bandwidth constraint limits the overall performance of the design, though lots of memory access optimizations are applied. Zhang, Chen, et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks." Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015.

Unbalanced Storage and workload
Acceleration of CNN Training on FPGA-based Clusters 7/13/2019 Unbalanced Storage and workload Storage and Operation demand of each layer for AlexNet shown in the table Implementation example: Zhang’s work Table list the parameter of AlexNet Fig is the architecture of using multiple FPGAs to accelerate CNN in deeply pipelined manner Unbalanced storage Unbalanced workload Zhang, Chen, et al. "Energy-efficient CNN implementation on a deeply pipelined FPGA cluster." Proceedings of the 2016 International Symposium on Low Power Electronics and Design. ACM, 2016.

Boston University & USTC

Similar presentations

Presentation on theme: "Boston University & USTC"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Boston University & USTC

Similar presentations

Presentation on theme: "Boston University & USTC"— Presentation transcript:

Similar presentations

About project

Feedback