Neural Network Optimizations for On-Device AI

Neural Network Optimizations for On-Device AI
TinyML Meetup Neural Network Optimizations for On-Device AI Machine Learning Technology Group Arm, San Jose Lingchuan Meng and Naveen Suda December 2019

Why is ML Moving to the Edge?
Bandwidth Reliability Power Security Cost Latency We can simply not defy the laws of physics or economics – as it costs a lot of bandwidth and power to transmit data from the billions of edge devices to the cloud and get back the result. It adds latency to the application and the network may not be reliable or secure.

Wide Range of “Edge” Applications
Autonomous drive Image enhancement Voice & image recognition ~100W Increasing performance (Ops/second) Object detection Tiny-ML region ~10W Pattern training Different people have their different view of their “edge”. It can range from self- driving cars to small devices sitting in your smart light bulb. In this talk, we focus on … The range of ML applications that can be powered by Cortex-M systems include – always-on keyword spotting, pattern matching/learning e.g. smart thermostat, low- range high-volume surveillance cams. Keyword detection ~1W ~1-10mW Increasing power and cost (silicon)

On-Device ML - Challenges
Tiny-edge device constraints for deploying ML algorithms Limited memory Flash (32 kB - few MB) SRAM (16 kB kB) Limited compute capability (100 MHz - 1 GHz) Hardware features Compression HW: pruning, clustering, etc. Mixed precision: 8-bit, 16-bit, etc. Algorithmic: Winograd, etc. Layer fusion: conv-add-pool-relu, etc. Model optimization investigation Enabling end-to-end exploration of Arm ML products Consistent and reproducible technologies and results Developing expertise and playbook Target platform for RSH tech transfer (Matt Mattina) Complete ML products = Model Optimization  Software  Hardware On-Device ML solutions = Model Optimization  Software  Hardware

End-to-end Technology Exploration
ML Networks Vision Voice Vibration Model Optimizations Software Drivers Libraries Hardware Models RTL FPGA PPAB Perf Power Area Bandwidth Pruning Quantization Clustering Algorithms Sparsity Low-Prec Arith Compression HW Algorithms Algo/SW/HW co-dev IanB jumpstarted the initiative for network optimization framework to realize end-to- end technology exploration End-to-end technology exploration infrastructure with representative workloads Algorithm/ Software / Hardware co-development

Accuracy versus Performance
1 2 3 Region 1 Minimal accuracy impact No downside Region 2 Trading off accuracy vs performance Region 3 Limit of optimizations Accuracy Quick deployment Highest performance Accuracy driven will take more iteration. Quick deployment is some optimization and accommodating accuracy loss. Highest performance means push to the limits.

Overview of Model Optimizations
Cascading with optimization preservation: sparsity/clusters/quant Wide support of networks: CNNs/SSD/KWS/DS(prelim) Hyper-parameter optimization techniques: heuristic/RL/simulated annealing Pretty good accuracies: from small degradations to better-than-original Collaborative Optimizations

Accuracy Loss/Increase
Optimized Models Networks Optimization Accuracy Loss/Increase Inception V3 pruned (50%), clustered (5-bit), quantized (8-bit) 1% Resnet 50 1.1% VGG-16 pruned (50%), quantized (8-bit), clustered (3 clusters for last 3 layers) 0.3% increase * Post-training quantization applied. Accuracy further improves with fine-tuning. This is a placeholder slide for showing optimization numbers. We’ll refine and reduce the number of networks shown here. Pruning only, Clustering only Cascaded Application domains image classification, object detection, speech recognition, etc. Reduce model size and improve compressibility Enable efficient on-device computation

Algorithmic Optimizations
Complex-domain Winograd 8-bit Winograd Scalar Vector 2.24X Winograd Conv2D Filter Transform Batch-MatMul Output Transform Input Transform Input Filter FakeQuant Accuracy Training Top-1 Top-5 FP32 X 76.94% 93.40% Int8 (FQ) Im2Col 76.38% 93.13% F(4x4) Real 0.52% 1.79% √ 60.49% 82.86% F(4x4) Complex 74.86% 92.43% 76.27% 93.10% VGG 16 ResNet 18 GoogleNet SqueezeNet Speedup vs NCNN 2x2 94.55% 21.13% 12.82% 8.86% Inception V3 Lingchuan Meng et al. “Efficient Winograd Convolution via Integer Arithmetic ” arXiv: NCNN:

Overview of Pruning Techniques
Magnitude Pruning Channel Pruning Structured Pruning Song Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding” arXiv: (2015). Sajid Anwar et al. “Structured Pruning of Deep Convolutional Neural Networks” arXiv: (2015). Yihui He et al. “Channel Pruning for Accelerating Very Deep Neural Networks” arXiv: (2017).

Hyper-Parameter Optimization
Deterministic Reinforcement Learning Uniform Heuristic Per-layer target sparsity: 𝛼 ∙ log 𝑣𝑎𝑟 𝑖 𝛼= 𝑝𝑟 ∙ 𝑣𝑎𝑟 𝑖 ( 𝑣𝑎𝑟 𝑖 ∙ log 𝑣𝑎𝑟 𝑖 ) Dynamically increase pruning ratio during training 𝑝𝑟 =𝑝𝑟 ∙(1− 1− 𝑡− 𝑡 0 𝑛Δ𝑡 3 ) Michael Zhu et al. “To prune, or not to prune: exploring the efficacy of pruning for model compression” arXiv: (2017). Yihui He et al. “AMC: AutoML for Model Compression and Acceleration on Mobile Devices” arXiv: (2018).

Clustering: Non-uniform Quantization
Cluster n-weights to the k-centroids (n>>k). Use K-Means for initial clustering Enables weight compression Update centroids during retraining. Sparsity preservation Song Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding” arXiv: (2015).

Uniform Quantization: Balancing Range vs. Resolution
Finding Optimal (min, max) for Quantization saturate saturate x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xmin xmax xmin xmax Large quantization error due to limited resolution Large quantization error due to saturation Outside range values are clamped – so saturation error. Goal: Find (xmin_opt, xmax_opt) that minimizes quantization error Solution: Signal-to-Quantization Noise Ratio (SQNR) as a metric to choose optimal quantization ranges.

Quantization Workflow
Trained model (floating point) Find optimal quantization ranges for weights and evaluate with quantized weights Weight quantization Collect statistics of activations. Insert histogram Ops and collect histogram stats. Find act. quantization ranges Find optimal quantization thresholds for activations Quantize and evaluate Evaluate weight/activation quantized model Parse the model definition, get the node names (wts, biases, activations) Quantized model Optional – finetuning Optionally finetune if training data is available Finetuned quantized model

Quantizing Nodes b W x y Fused layers x1 y x2 x3 Concat Layer x1 x2 y
tf.fake_quant Conv/FC b W x Q BiasAdd y Batch-Norm ReLU Fused layers Concat x1 Q y x2 x3 Concat Layer + x1 Q x2 y Residual Add Layer Conv/FC W x Q y Quantize fused-layers wherever possible – Improves accuracy. Depends on the software/hardware on which the model will be deployed.

Equalization Making Models easier to Quantize
Min/max ranges vary across different feature maps in a layer. Equalization makes the min/max equal across all the feature maps in a layer. Scale the current layer weights and un-scale the downstream layer weights. Layer-1 Layer-2 Layer-3 … Equalize Mobilenet-v2 Accuracy Channel min/max ranges Weight ranges before equalization Weight ranges after equalization Quantization bit-width

Collaborative Optimizations
50% sparsity Pruning -3.0 3.1 -3.21 -0.276 0.276 3.56 Original weights distribution Clustering (16 clusters) 50% sparsity 50% sparsity where things can go wrong TF APIs disadvantages Quantize 33 128 227 255 -2.476 2.753 16-unique uint8 values 16-unique fp32 values

Sensitivity-based Mixed-Precision Quantization
Find lowest bit-width without significant accuracy drop. Consider the cascaded effect of quantization error from layer-to-layer. Start from the largest layer, so it is compressed the most. MobilenetV2/Logits/Conv2d_1c_1x1 # weights: Lowest bit-width: 3 bits MobilenetV2/expanded_conv_16/project # weights: Lowest bit-width: 5 bits MobilenetV2/Conv_1 # weights: Lowest bit-width: 3 bits Mobilenet V2 Effective bitwidth: 4.5 bits 2% accuracy drop (without retraining) Fine-tuning recovered 1.5% accuracy

Summary Model optimizations: Pruning, Clustering, Quantization
Critical for hardware and algorithm exploration and verification Huge optimization space + multi-day trials Size of optimization space = 𝑆 𝑝 × 𝑆 𝑐 × 𝑆 𝑞 × … Learning rate and schedule further expand the hyperparameter search space Duration of trials = 𝑇 𝑝 + 𝑇 𝑐 + 𝑇 𝑞 + … Collaborative model optimizations Sparsity preserving clustering Sparsity and cluster preserving quantization On-Device ML solutions = Model Optimization  Software  Hardware

Neural Network Optimizations for On-Device AI

Similar presentations

Presentation on theme: "Neural Network Optimizations for On-Device AI"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neural Network Optimizations for On-Device AI

Similar presentations

Presentation on theme: "Neural Network Optimizations for On-Device AI"— Presentation transcript:

Similar presentations

About project

Feedback