Presentation is loading. Please wait.

Presentation is loading. Please wait.

Incremental Training of Deep Convolutional Neural Networks

Similar presentations


Presentation on theme: "Incremental Training of Deep Convolutional Neural Networks"— Presentation transcript:

1 Incremental Training of Deep Convolutional Neural Networks
R. Istrate, A. C. I. Malossi, C. Bekas, and D. Nikolopoulos ArXiv: v1

2 Depth trade-off Depth of Deep Neural network shows its capacity –
How to choose the depth of our Network? Deep Network has high capacity, but they need too much resources. Shallow Network has a limited capacity, but they can converge fastly. Existing solution : Grid search Disadvantage : Only too late in the process we learn whether the network is not well suited for the dataset.

3 Methodology Consider a generic CNN 𝒩 composed of 𝑛 layers.
Partition 𝒩 into 𝐾 sub-networks 𝑆 𝑘 , 𝑘=1,…𝐾 and 𝐾≤𝑛. Each sub-network contains learnable parameters There are no sub-network composed by just pooling and dropout.

4 Methodology The training process starts with sub-network 𝑆 1 .
To determine when is the optimal time to add the second sub-network 𝑆 2 between 𝑆 1 and classifier, we compute every window size(ws) epochs the improvement in the validation accuracy. When the improvement observed is below a threshold (fixed), stop the training and increase the network depth by adding the next sub-network.

5 Methodology

6 Criteria for end training
Every ws epochs, compute the angle 𝛼 between the linear approximation of the last ws accuracy points and the x-axis. The training is stopped when 𝛼 𝑖 ≤ 𝛾 𝛼 𝑖−1 , where 𝛾 is a predefined threshold and 𝛼 𝑖 is the angle characterizing the accuracy for the i-th window.

7 Look-ahead initialization
When a new sub-network 𝑆 𝑘+1 is inserted in the current architecture, its weights need to be initialized Random initialization shows empirically bad performance Look-ahead initialization : fix former sub-networks 𝑆 1 , …, 𝑆 𝑘 and learn 𝑆 𝑘+1 only. for a few epochs. The depth of the look-ahead tends to be comparably smaller than the depth of the final network, therefore the training of the look-ahead is not considered expensive.

8 Experiments Datasets : CIFAR10 Basic Networks : VGGNet and ResNet

9 Experiments Look-ahead initialization reduce the decrease of validation accuracy when the new sub-net is added If we use same resource, incremental learning shows better performance than baseline model.

10 Experiments

11 Experiments

12 Conclusion Incremental learning in this thesis is not a learning for stream dataset, but a learning for the depth of network. It can be easily applied to the online learning setting. Maximum depth of the model does not need to predefined before we start learning. If we use equivalent block for ResNet or VGGNet, then we can attach a new sub-network until the model converges.

13 Online Deep Learning: Learning Deep Neural Networks on the Fly
D. Sahoo, Q. Pham, J. Lu, S .C .H. Hoi Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) arXiv:

14 Online deep learning In many cases, data arrives sequentially in a stream, any may be too large to be stored in memory. Moreover, the data may exhibit concept drift. Online learning : class of learning algorithms that learn to optimize predictive models over a stream of data instances sequentially.

15 Online deep learning for deep learning
Previous online learning methods are focus on a linear and kernel (2-layers) models. The point is, in online learning, data is small at first and increase gradually. Depth trade-off How to choose depth? Explicit and Implicit methods There prediction is just perform by the last hidden layer. This hinder the learning of lower layer’s weights.

16 Proposed model Model Parameters : 𝑊, Θ, 𝛼
Prediction is the weighted sum of each hidden layers.

17 Hedge Algorithm Algorithm for learning 𝛼 (Freund and Schapire, 1997). – Adaboost Loss function ℒ 𝐹 𝑥 , 𝑦 = 𝑙=0 𝐿 𝛼 𝑙 ℒ( 𝑓 𝑙 𝑥 , 𝑦)) Initialize 𝛼 0 (𝑙) with uniformly distributed, i.e. 𝛼 0 (𝑙) = 1 𝐿+1 At every 𝑡 iteration, update 𝛼 (𝑙) as 𝛼 𝑡+1 (𝑙) ← 𝛼 𝑡 (𝑙) 𝛽 ℒ 𝑓 𝑙 𝑥 , 𝑦 where 𝛽∈(0,1) is the discount rate parameter, and ℒ 𝑓 𝑙 𝑥 , 𝑦 ∈(0,1). Normalize 𝛼 𝑙 s.t. 𝑙=0 𝐿 𝛼 𝑡+1 (𝑙) =1.

18 Hedge Algorithm Hedge enjoys a regret of 𝑅 𝑇 ≤ 𝑇 𝑙𝑛𝑁 , where 𝑁 is the number of experts (Freund and Schapire, 1999), which in this case is the network depth. Since shallower models tends to converge faster than deeper models, using a hedging strategy would lower 𝛼 weights of deeper classifiers to a very small value. To alleviate this, use smoothing parameter 𝑠 ∈(0,1) which is used to set a minimum weight for each classifier. 𝛼 𝑡 (𝑙) ⇠max⁡( 𝛼 𝑡 𝑙 , 𝑠 𝐿 )

19 Online Deep Learning using HBP
Learning 𝑊, Θ is based on the basic backpropagation.

20 Contribution Dynamic Objective : Having a dynamically adaptive objective function mitigates the impact of vanishing gradient and helps escape saddle points and local minima. Student-teacher learning Ensemble Concept drifting Convolutional Networks

21 Experiments - Datasets

22 Experiments – Traditional Online BP

23 Experiments – Comparison

24 Experiments – Convergence speed

25 Experiments – Evolution of weight 𝜶

26 Experiments – Robust to the Base Net


Download ppt "Incremental Training of Deep Convolutional Neural Networks"

Similar presentations


Ads by Google