Noah’s Ark Lab, Huawei Inc. (华为诺亚方舟实验室)

Noah’s Ark Lab, Huawei Inc. (华为诺亚方舟实验室)
Differentiable Neural Architecture Search: Promises, Challenges, and Our Solutions Speaker: Lingxi Xie (谢凌曦) Noah’s Ark Lab, Huawei Inc. (华为诺亚方舟实验室) Slides available at my homepage (TALKS)

Outline Neural Architecture Search: what and why?
Popular pipelines of Neural Architecture Search Why Differentiable Neural Architecture Search? Challenges: Over-fitting Issues in Super-network Optimization Progressive DARTS: Bridging the Gap between Search and Evaluation Partially-Connected DARTS: Regularization and Normalization Stabilizing DARTS with Amended Optimization: Delving into Mathematics Conclusions and future directions

Take-Home Messages There are two major challenges in the field of computer vision, or even AI Knowledge representation: how to build a mathematical model to cover real data? Model optimization: how to train a complex model with a limited amount of data? State-of-the-art solutions? Knowledge representation: hierarchical features and interaction – as pointed out by David Marr in 1970’s, but there still lack effective ways of representing common sense Model optimization: training deep neural networks – one step further, AutoML Automated Machine Learning (AutoML) is the future Deep learning makes feature learning automatic, AutoML makes deep learning automatic Neural Architecture Search (NAS) is an important subtopic of AutoML

Fundamental Ideology of AutoML
One step further in discarding human expertise and trusting training data Background: the big data era Knowing this principle helps to discriminate which work is good for AutoML Very promising for the industry, in particular companies like Huawei Engineers often do not have so much expertise But you very much need to develop different models in a short period of time Is this the correct direction for academia? Yes, but we need to keep reminding ourselves of the other problem of AI – knowledge representation Today, we focus on NAS

Introduction: Neural Architecture Search
Neural Architecture Search (NAS) Instead of manually designing neural network architecture (e.g., AlexNet, VGGNet, GoogLeNet, ResNet, DenseNet, etc.), exploring the possibility of discovering unexplored architecture with automatic algorithms A brief timeline of NAS 2016: the first NAS paper submitted (best paper ICLR’17) 2017: NAS requires thousands of GPU-days to surpass human-designed architectures 2018: NAS becomes approachable to everyone (a few GPU-hours) 2019: researchers found several critical issues of NAS, in particular differentiable NAS [Krizhevsky, 2012] A. Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012. [Simonyan, 2015] K. Simonyan et al., Very Deep Convolutional Networks for Large-scale Image Recognition, ICLR, 2015. [Szegedy, 2015] C. Szegedy et al., Going Deeper with Convolutions, CVPR, 2015. [He, 2016] K. He et al., Deep Residual Learning for Image Recognition, CVPR, 2016. [Huang, 2017] G. Huang et al., Densely Connected Convolutional Networks, CVPR, 2017. [Zoph, 2017] B. Zoph et al., Neural Architecture Search with Reinforcement Learning, ICLR, 2017.

Human-designed & Searched Architectures
From to : things change very fast! [He, 2016] K. He et al., Deep Residual Learning for Image Recognition, CVPR, 2016. [Xie, 2017] L. Xie et al., Genetic CNN, ICCV, 2017. [Zoph, 2018] B. Zoph et al., Learning Transferable Architectures for Scalable Image Recognition, CVPR, 2018. [Pham, 2018] H. Pham et al., Efficient Neural Architecture Search via Parameter Sharing, ICML, 2018. [Bi, 2019] K. Bi et al., Stabilizing DARTS with Amended Gradient Estimation on Architectural Parameters, submitted, 2019.

Framework: Trial and Update
Almost all NAS algorithms are based on the “trial and update” framework Starting with a set of initial architectures (e.g., manually defined) as individuals Assuming that better architectures can be obtained by slight modification Applying different operations on the existing architectures Preserving the high-quality individuals and updating the individual pool Iterating till the end Three fundamental requirements The search space: defining possible architectures (depth, layers, operators, etc.) The search method: defining the transition between individuals The evaluation method: determining if a generated individual is of high quality

Framework: Search Space
Search space determines the complexity of search, as well as challenges Ideally, it should be sufficiently large to contain all manual-designed operators, but… Question 1: ideally, the space should be open, but currently, most are closed Question 2: how large is the grainuity? Genetic CNN: only 3×3 convolution Others: a fixed set of operations (right) Question 3: how complex can the inter- layer connections be? A fixed subset of all connections [Xie, 2017] L. Xie et al., Genetic CNN, ICCV, 2017. [Zoph, 2018] B. Zoph et al., Learning Transferable Architectures for Scalable Image Recognition, CVPR, 2018. [Liu, 2018] C. Liu et al., Progressive Neural Architecture Search, ECCV, 2018. [Pham, 2018] H. Pham et al., Efficient Neural Architecture Search via Parameter Sharing, ICML, 2018. [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019.

Framework: Search Method
Finding new individuals that have potentials to work better Heuristic search in the large space Two mainly applied methods: the genetic algorithm and reinforcement learning Both are heuristic algorithms applied to the scenarios of a large search space and limited ability to explore every single element in the space A fundamental assumption: both of these heuristic algorithms can preserve good genes and based on which discover possible improvements Also, it is possible to integrate architecture search to network optimization These algorithms are often much faster [Real, 2017] E. Real et al., Large-Scale Evolution of Image Classifiers, ICML, 2017. [Xie, 2017] L. Xie et al., Genetic CNN, ICCV, 2017. [Zoph, 2018] B. Zoph et al., Learning Transferable Architectures for Scalable Image Recognition, CVPR, 2018. [Liu, 2018] C. Liu et al., Progressive Neural Architecture Search, ECCV, 2018. [Pham, 2018] H. Pham et al., Efficient Neural Architecture Search via Parameter Sharing, ICML, 2018. [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019.

Framework: Evaluation Method
Evaluation aims at determining which individuals are good and to be preserved Conventionally, this was often done by training a network from scratch This is extremely time-consuming, so researchers often train NAS on a small dataset like CIFAR and then transfer the found architecture to larger datasets like ImageNet Even in this way, the training process is really slow: Genetic-CNN requires 17 GPU- days for a single training process, and NAS-RL requires more than 20,000 GPU-days Efficient weight-sharing methods were proposed later Ideas include parameter sharing (without the need of re-training everything for each new individual) and using a differentiable architecture (joint optimization) Now, an efficient search process on CIFAR can be reduced to a few GPU-hours, though training the searched architecture on ImageNet is still time-consuming [Xie, 2017] L. Xie et al., Genetic CNN, ICCV, 2017. [Zoph, 2017] B. Zoph et al., Neural Architecture Search with Reinforcement Learning, ICLR, 2017. [Pham, 2018] H. Pham et al., Efficient Neural Architecture Search via Parameter Sharing, ICML, 2018. [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019.

State-of-the-Art Search Methods
Method 1: discrete optimization (individual search and evaluation) Pros: the search pipeline is relatively flexible; multi-target optimization is easier to achieve; search and evaluation are decoupled and thus more stable Cons: the search process is time-consuming (speed-accuracy tradeoff) Method 2: continuous optimization (joint search and evaluation) Pros: the search process is computationally efficient; the search space can be complex; the overall pipeline is closer to the design nature of NAS Cons: the stability of exploring a large space is still below satisfaction [Tan, 2019] M. Tan et al., EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, ICML, 2019. [Xie, 2019] S. Xie et al., Exploring Randomly Wired Neural Networks for Image Recognition, ICCV, 2019. [Pham, 2018] H. Pham et al., Efficient Neural Architecture Search via Parameter Sharing, ICML, 2018. [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019.

An Example of Discrete Optimization
EfficientNet: a conservative “NAS” algorithm Based on MobileNet cells, only allowing rescaling to be “searched” Using reinforcement learning for optimization State-of-the-art top-1 accuracy (84.4%) on ImageNet [Tan, 2019] M. Tan et al., EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, ICML, 2019.

We focus on Differentiable NAS
We believe this is in the right way of developing NAS algorithms An example: DARTS vs. EfficientNet However… Differentiable search (in particular, DARTS) has a lot of problems In what follows, we discuss our solution to some of them [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019.

Mathematics of DARTS Constructing a super-network 𝑓 𝐱;𝝎,𝜶
𝐱: input data (e.g., an image); 𝝎: network parameters; 𝜶: architectural parameters Each cell is a graph of 𝑁 nodes, each edge 𝑖,𝑗 performs a mixed operation 𝐲 𝑖,𝑗 𝐱 𝑖 = 𝑜∈𝒪 exp 𝛼 𝑜 𝑖,𝑗 𝑜 ′ ∈𝒪 exp 𝛼 𝑜 ′ 𝑖,𝑗 ∙𝑜 𝐱 𝑖 𝑜 𝐱 𝑖 adds a specific operation to 𝐱 𝑖 , 𝑜 is chosen from a pre-defined set 𝒪 [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019.

Mathematics of DARTS (cont’d)
Optimization of 𝑓 𝐱;𝝎,𝜶 𝝎 and 𝜶 start from 𝝎 0 and 𝜶 0 , and get updated alternately In the 𝑡-th epoch, 𝜶 is first fixed to be 𝜶= 𝜶 𝑡 , and 𝝎 is updated from 𝝎 𝑡 to 𝝎 𝑡+1 ; then 𝝎 is fixed to be 𝝎= 𝝎 𝑡+𝟏 , and 𝜶 is updated from 𝜶 𝑡 to 𝜶 𝑡+𝟏 There is a critical issue here, which we will cover in our latest work [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019.

Mathematics of DARTS (cont’d)
After 𝑓 𝐱;𝝎,𝜶 has been well trained The final network is composed of operators with most significant weights Other operators are “pruned” Each node is allowed to be connected to two nodes with lower indices A special case: the none operator It is a part in each edge of the super- network, but it cannot be chosen as the most significant operator Role 1: numerical stability Role 2: latent edge selection [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019.

Challenges of Differentiable NAS
Instability: the devil behind all differentiable search methods Phenomenon 1: the searched architecture sometimes performs well, sometimes not Phenomenon 2: the search method can get lost in a very large search space (e.g., producing even worse results than randomly generated networks) Phenomenon 3: when the super-network is optimized until convergence, the obtained architecture can be ridiculously bad The essence of instability: “over-fitting” the super-network The best optimized super-network does not necessarily lead to the best architecture [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019. [Liang, 2019] H. Liang et al., DARTS+: Improved Differentiable Architecture Search with Early Stopping, arXiv preprint: , 2019. [Anonymous, 2019] Anonymous Author(s), Understanding and Robustifying Differentiable Architecture Search, submitted, 2019.

Gaps between Search and Evaluation
Architectural gap: is the ultimate goal to be finding the best super-network? Operators in super-networks are coupled, what about they are pruned? Other designs that bring in large gaps (e.g., the none operator) Hyper-parameter gap Depth/width gap: do shallow/narrow architectures generalize to deep/wide ones? Training policy matters: regularization, normalization, data augmentation, etc. Our work P-DARTS: starting with the depth gap, producing a preliminary solution PC-DARTS: alleviating the “over-fitting” ability of the super-network Stabilizing DARTS: from mathematics, how to alleviate this gap?

P-DARTS: Overview We start with the drawbacks of DARTS
There is a depth gap between search and evaluation The search process is not stable: multiple runs, different results The search process is not likely to transfer: only able to work on CIFAR10 We proposed a new approach named Progressive DARTS A multi-stage search progress which gradually increases the search depth Two useful techniques: search space approximation and search space regularization We obtained nice results SOTA accuracy by the searched networks on CIFAR10/CIFAR100 and ImageNet Search cost as small as 0.3 GPU-days (one single GPU, 7 hours) [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019. [Chen, 2019] X. Chen et al., Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation, ICCV, 2019.

P-DARTS: Motivation The depth gap and why it is important 20 cells 20
14 cells 8 cells 8 cells search evaluation search evaluation DARTS: CIFAR10 test error 2.83% P-DARTS: CIFAR10 test error 2.55% [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019. [Chen, 2019] X. Chen et al., Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation, ICCV, 2019.

P-DARTS: Search Space Approximation
The progressive way of increasing search depth

P-DARTS: Search Space Regularization
Problem: the strange behavior of skip-connect Searching on a deep network leads to many skip-connect operations (poor results) Reasons? On the one hand, skip-connect often leads to fastest gradient descent On the other hand, skip-connect does not have parameters and so leads to bad results Solution: regularization Adding a Dropout after each skip-connect, dedaying the rate during search Preserving a fixed number of skip-connect after the entire search Results Dropout on skip-c Testing Error, 2 SC Testing Error, 3 SC Testing Error, 4 SC with Dropout 2.93% 3.28% 3.51% without Dropout 2.69% 2.84% 2.97%

P-DARTS: CIFAR10/100 Experiments
CIFAR10 and CIFAR100 (with Cutout as default data augmentation)

P-DARTS: ImageNet Experiments
ImageNet (ILSVRC2012, under the mobile setting)

P-DARTS: Searched Cells
Searched architectures (verification of depth gap!)

P-DARTS: Summary The depth gap needs to be solved Our approach
Different properties of networks with different depths Depth is still the key issue in deep learning Our approach State-of-the-art results on both CIFAR10/100 and ImageNet Search cost as small as 0.3 GPU-days Future directions Directly searching on ImageNet

PC-DARTS: Overview We still build our approach upon DARTS
We proposed a new approach named Partially-Connected DARTS An alternative approach to deal with the over-fitting issue of DARTS Using partial channel connection as regularization This method enjoys stability, which enables it to be directly executed on ImageNet We obtained nice results SOTA accuracy by the searched networks on ImageNet Search cost as small as 0.06 GPU-days (one single GPU, 1.5 hours) on CIFAR10/100, or 4 GPU-days (8 GPUs, 11.5 hours) on ImageNet [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019. [Xu, 2019] Y. Xu et al., PC-DARTS: Partial Channel Connections for Memory-Efficient Differentiable Architecture Search, arXiv preprint: , 2019.

PC-DARTS: Motivation There exists the issue of “over-fitting”: the search process optimizes the super- network very well, but it does not imply the pruned network is equally good This is fundamentally the gap between search and evaluation The original search space uses a none operator for edge-level selection This enlarges the gap between search and evaluation (one operator, two functions) During search, these two functions can couple with each other and confuse the optimization process

PC-DARTS: Overall Pipeline
Partial channel connection and edge normalization

PC-DARTS: Side Benefits
Partial channel connection Computationally efficient: only 1/𝐾 time and memory is required In addition, each mini-batch can be 𝐾× larger, which contributes to search stability Edge normalization The none operator does not need to exist during search With partial channel connection (which brings more randomness and uncertainty), edge normalization contributes to stability Even without partial channel connection, edge normalization is a useful technique to achieve more stable search results

PC-DARTS: Speed-Accuracy Tradeoff
The parameter 𝐾 (folds of channels) controls search speed and accuracy A smaller 𝐾: weaker regularization, lower speed-up, higher search accuracy A larger 𝐾: stronger regularization, higher speed-up, lower search accuracy A proper 𝐾 works best 𝐾=4 for CIFAR10, 𝐾=2 for ImageNet More complex data requires higher search accuracy (a smaller 𝐾)

P-DARTS: CIFAR10 Experiments
CIFAR10 (with Cutout as default data augmentation)

PC-DARTS: ImageNet Experiments

PC-DARTS: Stability Tests
Test 1: a few individual runs The architectures searched by PC-DARTS produce a lower standard deviation (±0.07%), compared to DARTS-v1 (±0.15%) and DARTS-v2 (±0.21%) Test 2: different search epochs PC-DARTS is less impacted by the varying number of epochs Test 3: different nodes in each cell PC-DARTS produces consistently better results in different search spaces

PC-DARTS: Summary Regularization is still a big issue Our approach
Partial channel connection in order to prevent over-fitting Edge normalization in order to make partial channel connection work more stable Our approach State-of-the-art results on ImageNet (direct search) Search cost as small as 0.06 GPU-days on CIFAR10 Edge normalization is a useful technique for search stability Future directions Searching on a larger number of classes

Stabilizing DARTS: Overview
We deal with a critical issue that causes instability of DARTS DARTS, when trained for sufficiently long, can converge to dramatically bad results We find that the main reason lies in the current way of DARTS optimization The approximation of the architectural gradients are problematic We present a solution with an approximation with guaranteed accuracy It requires similar computational overhead as the second-order version of DARTS The amended optimization achieves satisfying stability It produces stable results when the search arrives at convergence It enables very large search space to be explored It closes up the gap between search and evaluation in a formal manner [Liu, 2019] H. Liu et al., DARTS: Differentiable Architecture Search, ICLR, 2019. [Bi, 2019] K. Bi et al., Stabilizing DARTS with Amended Gradient Estimation on Architectural Parameters, submitted, 2019.

Stabilizing DARTS: DARTS Fails!
When DARTS is executed for a sufficiently long time (e.g., 200 epochs, original version has 50 epochs) All preserved operators are skip-connect The none operator occupies over 95% weights (still increasing), with half remaining occupied by skip-connect Previous work also reported this issue: P-DARTS, DARTS+ Aftermath: the fundamental of DARTS is incorrect The convergence of super-network implies very bad performance (we do not dare to let search converge!) All tricks (early stop, a fixed number of skip-connect, etc.) actually violate the ideology of AutoML [Chen, 2019] X. Chen et al., Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation, ICCV, 2019. [Liang, 2019] H. Liang et al., Darts+: Improved Differentiable Architecture Search with Early Stopping, arXiv preprint: , 2019.

Stabilizing DARTS: Mathematics
Recall: optimizing 𝑓 𝐱;𝝎,𝜶 with loss ℒ 𝐱;𝝎,𝜶 ≐ℒ 𝝎,𝜶 𝝎 and 𝜶 start from 𝝎 0 and 𝜶 0 , and get updated alternately In the 𝑡-th epoch On the training set, 𝜶 is first fixed to be 𝜶= 𝜶 𝑡 , and 𝝎 is updated from 𝝎 𝑡 to 𝝎 𝑡+1 , with the loss function denoted as ℒ train 𝝎,𝜶 | 𝜶= 𝜶 𝑡 On the validation set, 𝝎 is fixed to be 𝝎= 𝝎 𝑡+𝟏 , and 𝜶 is updated from 𝜶 𝑡 to 𝜶 𝑡+𝟏 , with the loss function denoted as ℒ val 𝝎,𝜶 | 𝝎= 𝝎 𝑡+𝟏 = ℒ val 𝝎 ⋆ 𝜶 ,𝜶 | 𝜶= 𝜶 𝑡 Here, an assumption is that 𝝎 𝑡+1 has arrived at the optimum ( ℒ train 𝝎,𝜶 | 𝜶= 𝜶 𝑡 is the minimum under 𝜶= 𝜶 𝑡 ), denoted as 𝝎 𝑡+1 = 𝝎 ⋆ 𝜶 𝑡 = 𝝎 ⋆ 𝜶 | 𝜶= 𝜶 𝑡 It requires to compute the gradients with respect to 𝜶 (that to 𝝎 are easy) 𝛻 𝜶 ℒ val 𝝎 ⋆ 𝜶 ,𝜶 | 𝜶= 𝜶 𝑡 = 𝛻 𝜶 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 + 𝛻 𝜶 𝝎 ⋆ 𝜶 | 𝜶= 𝜶 𝑡 ∙ 𝛻 𝝎 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 The right-hand side is just the first-order and second-order terms in the original DARTS

Stabilizing DARTS: Mathematics (cont’d)
In the formula of 𝛻 𝜶 ℒ val 𝝎 ⋆ 𝜶 ,𝜶 | 𝜶= 𝜶 𝑡 = 𝛻 𝜶 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 + 𝛻 𝜶 𝝎 ⋆ 𝜶 | 𝜶= 𝜶 𝑡 ∙ 𝛻 𝝎 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 𝛻 𝜶 𝝎 ⋆ 𝜶 | 𝜶= 𝜶 𝑡 is most difficult to compute, since 𝝎 ⋆ 𝜶 is unknown Estimating 𝝎 ⋆ 𝜶 accurately at each 𝜶= 𝜶 𝑡 is computationally intractable It can be proved that even a very close approximation can lead to large error We make use of a straightforward property of 𝝎 ⋆ 𝜶 𝛻 𝝎 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 † ,𝜶= 𝜶 † ≡𝟎 for any 𝜶 † Differentiating both sides, and applying the chain rule on 𝜶= 𝜶 𝑡 , we have 𝛻 𝜶,𝝎 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 + 𝛻 𝜶 𝝎 ⋆ 𝜶 | 𝜶= 𝜶 𝑡 ∙ 𝛻 𝝎 2 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 =𝟎 Using this equation, we can solve 𝛻 𝜶 𝝎 ⋆ 𝜶 | 𝜶= 𝜶 𝑡 without knowing 𝝎 ⋆ 𝜶

Stabilizing DARTS: Mathematics (cont’d)
Let 𝐇= 𝛻 𝝎 2 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 , i.e., the Hesse matrix 𝛻 𝜶 𝝎 ⋆ 𝜶 | 𝜶= 𝜶 𝑡 = 𝛻 𝜶,𝝎 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ∙ 𝐇 −𝟏 𝛻 𝜶 ℒ val 𝝎 ⋆ 𝜶 ,𝜶 | 𝜶= 𝜶 𝑡 = 𝛻 𝜶 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 𝛻 𝜶,𝝎 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ∙ 𝐇 −𝟏 ∙ 𝛻 𝝎 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 Again, 𝐇 −𝟏 is difficult to compute (millions of parameters!) Our solution is to approximate the second term directly with 𝛻 𝜶,𝝎 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ∙ 𝐇 −𝟏 ∙ 𝛻 𝝎 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ≈ 𝜂∙ 𝛻 𝜶,𝝎 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ∙𝐇∙ 𝛻 𝝎 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 𝜂: the amending coefficient We can prove that the inner-product of these two terms is not smaller than 0 Our approximation guarantees that each update of 𝜶 is on the roughly correct direction

Stabilizing DARTS: Comparison to DARTS
The first-order version of DARTS directly discards the second-order term 𝛻 𝜶,𝝎 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ∙ 𝐇 −𝟏 ∙ 𝛻 𝝎 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ≈𝟎 Large errors, reflecting on the instability of first-order DARTS The second-order version of DARTS is equivalent to using a unit matrix to approximate 𝐇 −𝟏 (we use 𝐇) 𝛻 𝜶,𝝎 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ∙ 𝐇 −𝟏 ∙ 𝛻 𝝎 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ≈ 𝛻 𝜶,𝝎 ℒ train 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 ∙𝐈∙ 𝛻 𝝎 ℒ val 𝝎,𝜶 | 𝝎= 𝝎 ⋆ 𝜶 𝑡 ,𝜶= 𝜶 𝑡 It is easy to show that this strategy cannot guarantee that the inner-product of these two terms is not smaller than 0 Still, large errors in second-order DARTS

Stabilizing DARTS: What We Have Done
The most important goal we want is to achieve the consistency between search and re-training In other words, we hope to make sure that good networks found in the search stage can transfer to the re-training stage Previously, this is not guaranteed because the large error in gradient computation The dramatic search results are mainly due to this reason Our algorithm is only a slight modification away from the baseline! Speed: similar to the second-order DARTS, slower than the first-order DARTS Question: is this the only inconsistency between search and re-training?

Stabilizing DARTS: The Gaps of NAS
Gap 1: the “optimization gap” The optimization goals are different between search and re-training We have discussed thoroughly before Gap 2: the “architecture gap” An example from DARTS: the architecture in search has 8 cells, but the architecture in re-training has 20 cells, and a part of edges are removed after search Why not using the same architecture? Previous efforts from P-DARTS Gap 3: the “hyper-parameter gap” Base channels, learning rate, dropout, auxiliary loss, etc. Why not considering this factor? Not as important as previous ones [Chen, 2019] X. Chen et al., Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation, ICCV, 2019.

Stabilizing DARTS: Shrinking the Gaps
Gap 1: the “optimization gap” (the most critical one) Alleviated with mathematics Gap 2: the “architecture gap” (people noticed it, but did not solve it) Using the same number of cells during search and re-training Not allowing edge removal – a fixed edge connection More elegant solutions are left for future work Gap 3: the “hyper-parameter gap” (revealed in this work) Using the same set of hyper-parameters: base channels, learning rate, dropout and auxiliary loss We did observe accuracy drop with different hyper-parameters!

Stabilizing DARTS: Searched Architectures
Modified search space Edge connection is fixed: each node 𝑖 is connected to 𝑖−1 and −2 The basic search space: 6 same normal cells and 2 same reduction cells (as in DARTS) The enlarged search space: 18 normal cells and 2 reduction cells, all different Even larger spaces? GPU memory is a problem, but we will try

Stabilizing DARTS: The Amending Coefficient
The effect of 𝜂: balancing the “fitting ratio” of the search stage 𝜂=0: same as DARTS 𝜂<0.01: too small (top) 𝜂=0.1: about right (middle) 𝜂>1: too large (bottom)

Stabilizing DARTS: CIFAR10 Experiments
CIFAR10 (with Cutout as default data augmentation)

Stabilizing DARTS: ImageNet Experiments

Stabilizing DARTS: Summary
Instability is a critical issue of differentiable NAS Previous methods were mostly based on highly-inaccurate optimization Dramatically bad results can be found under search convergence Our approach Amending the errors in gradient computation Bridging a few gaps between search and re-training Stable architectures found in much larger search spaces Future directions Continuing on improving the stability of differentiable NAS

Conclusions AutoML is an emerging topic which paves the way for future AI NAS is an important subfield of AutoML, which has attracted a lot of attentions Two main pipelines of NAS: heuristic search and differentiable optimization Heuristic search is conservative, works on small spaces and behaves stable Differentiable search is promising, but has a lot of serious issues to solve yet The critical problem of differentiable search: over-fitting the super-network P-DARTS: progressively reducing the depth gap between search and evaluation PC-DARTS: using partial channel connection to mimic super-network pruning Stabilizing DARTS: delving into mathematics, closing the gap between optimizing the super-network and training the pruned network

Problems and Future Directions
Problem 1: which search method is better, discrete or continuous? Discrete: computational overhead seems unaffordable, in particular in a large space Continuous: stability is below satisfaction, but progresses have been made Problem 2: what should the basic search unit be, a layer or a basic operator? Smaller basic units imply a larger search space, which requires search stability Smaller basic units may bring in challenges to hardware design Problem 3: how to apply the searched architectures to real-world scenarios? Hardware: differentiable search methods are not yet friendly to, say, latency Individual vision tasks: sharing the same backbone or not?

Thanks Questions, please?

Noah’s Ark Lab, Huawei Inc. (华为诺亚方舟实验室)

Similar presentations

Presentation on theme: "Noah’s Ark Lab, Huawei Inc. (华为诺亚方舟实验室)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Noah’s Ark Lab, Huawei Inc. (华为诺亚方舟实验室)

Similar presentations

Presentation on theme: "Noah’s Ark Lab, Huawei Inc. (华为诺亚方舟实验室)"— Presentation transcript:

Similar presentations

About project

Feedback