Lecture 3a Analysis of training of NN

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Backpropagation Algorithm
Advertisements

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Lecture 2: Caffe: getting started Forward propagation
ImageNet Classification with Deep Convolutional Neural Networks
Neural Networks I CMPUT 466/551 Nilanjan Ray. Outline Projection Pursuit Regression Neural Network –Background –Vanilla Neural Networks –Back-propagation.
Lecture 14 – Neural Networks
Chapter 5 NEURAL NETWORKS
Lecture 4: CNN: Optimization Algorithms
Lecture 3: CNN: Back-propagation
MLP Exercise (2006) Become familiar with the Neural Network Toolbox in Matlab Construct a single hidden layer, feed forward network with sigmoidal units.
Biointelligence Laboratory, Seoul National University
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Appendix B: An Example of Back-propagation algorithm
Back-Propagation MLP Neural Network Optimizer ECE 539 Andrew Beckwith.
Radial Basis Function Networks:
Multi-Layer Perceptron
Deep Convolutional Nets
Chapter 8: Adaptive Networks
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Convolutional Neural Network
Deep Residual Learning for Image Recognition
Lecture 3b: CNN: Advanced Layers
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Understanding Convolutional Neural Networks for Object Recognition
Facial Detection via Convolutional Neural Network Nathan Schneider.
Convolutional Neural Networks
Big data classification using neural network
Deep Residual Learning for Image Recognition
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift CS838.
Deep Residual Networks
Deep Feedforward Networks
Computer Science and Engineering, Seoul National University
ECE 6504 Deep Learning for Perception
Lecture 5 Smaller Network: CNN
Training Techniques for Deep Neural Networks
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Machine Learning: The Connectionist
Prof. Carolina Ruiz Department of Computer Science
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Introduction to Neural Networks
Gradient Checks for ANN
CS 4501: Introduction to Computer Vision Training Neural Networks II
Tips for Training Deep Network
Object Classification through Deconvolutional Neural Networks
Very Deep Convolutional Networks for Large-Scale Image Recognition
CSC 578 Neural Networks and Deep Learning
Convolutional networks
Training Neural Networks
Neural Networks Geoff Hulten.
LECTURE 33: Alternative OPTIMIZERS
Machine Learning – Neural Networks David Fenyő
Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W
Forward and Backward Max Pooling
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Convolutional Neural Networks
Mihir Patel and Nikhil Sardana
实习生汇报 ——北邮 张安迪.
ImageNet Classification with Deep Convolutional Neural Networks
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
Batch Normalization.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
CSC 578 Neural Networks and Deep Learning
Principles of Back-Propagation
Prof. Carolina Ruiz Department of Computer Science
Overall Introduction for the Lecture
Presentation transcript:

Lecture 3a Analysis of training of NN boris.ginsburg@gmail.com

Agenda Analysis of deep networks Variance analysis Non-linear units Weight initialization Local Response Normalization (LRN) Batch Normalization

Understanding the difficulty of training convolutional networks The key idea : debug training through monitoring of mean and variance of activation variables 𝑦 𝑙 (output of non-linear unit) mean and variance of gradients 𝜕 𝐸 𝜕 𝑦 𝑙−1 , and 𝜕 𝐸 𝜕 𝑊 𝑙 . Reminder: variance of x: 𝑉𝑎𝑟 𝑥 =𝐸𝑥𝑝 (𝑥− 𝑥 ) 2 , We will compute scalar mean and variance for each layer, and then average over images in the test set. X. Glorot ,Y. Bengio, Understanding the difficulty of training deep feedforward neural networks

Understanding the difficulty of training convolutional networks Activation graph for MLP: 4 x [fully connected layers + sigmoid] Mean and standard deviation of the activation (output of the sigmoid) during learning, for 4 hidden layers . The top hidden layer quickly saturates at 0 (slowing down all learning), but then slowly desaturates ~ epoch 100. Sigmoid is non-symmetric -> difficult to train

Understanding the non-linear function behavior Let’s try MLP with symmetric non-linear functions: tanh & soft-sign Tanh: 𝒆 𝒙 − 𝒆 −𝒙 𝒆 𝒙 + 𝒆 −𝒙 | Soft-sign : 𝒙 𝟏+|𝒙| 98 % (markers only) and standard deviation (solid lines with markers)

MLP: Debugging the forward path How we can use variance analysis to debug NN training? Let’ start with classical Multi Layer Perceptron (MLP) Forward propagation for FC layer: 𝑦 𝑖 =𝑓( 𝑖=1 𝑛 𝑤 𝑖𝑗 ∗ 𝑥 𝑗 ) where 𝑥=[𝑥 𝑗 ] – layer inputs, 𝑛 – number of inputs W – layer matrix 𝑦=[ 𝑦 𝑖 ] – layer outputs (hidden nodes) Assume that 𝑓 is symmetric non-linear activation function. For ini tial analysis we will ignore non-linear unit𝑓 and it’s derivative ( 𝑓 is not saturate and f’ ~ 1).

MLP: Debugging the forward path Assume that: all 𝑥 𝑗 are independent and have the variance 𝑉𝑎𝑟 𝑋 , all 𝑤 𝑖𝑗 are independent and have the variance 𝑉𝑎𝑟 𝑊 . Then 𝑉𝑎𝑟 𝑦 = 𝑛 𝑖𝑛 ∗ 𝑉𝑎𝑟 𝑤 ∗ 𝑉𝑎𝑟(𝑥) We want to keep the output 𝑦 at the same dynamic range as input 𝑥: 𝑛 𝑖𝑛 ∗ 𝑉𝑎𝑟 𝑤 =1  𝑉𝑎𝑟 𝑊 = 1 𝑛 𝑖𝑛 Xavier rule for weight initialization with uniform rand( ): 𝑊=𝑢𝑛𝑖_𝑟𝑎𝑛𝑑 − 3 𝑛 𝑖𝑛 ; 3 𝑛 𝑖𝑛 X. Glorot ,Y. Bengio, Understanding the difficulty of training deep feedforward neural networks

MLP: Debugging the backward propagation Backward propagation of gradients: 𝜕 𝐸 𝜕𝑥 = 𝜕 𝐸 𝜕𝑦 ∗ 𝑤 𝑇 ; 𝜕 𝐸 𝜕𝑤 = 𝜕 𝐸 𝜕𝑦 ∗𝑥 Then 𝑉𝑎𝑟 𝜕𝐸 𝜕𝑥 = 𝑛 𝑜𝑢𝑡 ∗𝑉𝑎𝑟 𝑤 ∗ 𝑉𝑎𝑟( 𝜕𝐸 𝜕𝑦 ) 𝑉𝑎𝑟 𝜕𝐸 𝜕𝑤 = 𝑉𝑎𝑟 𝑥 ∗𝑉𝑎𝑟( 𝜕𝐸 𝜕𝑦 ) We want to keep gradients 𝜕 𝐸 𝜕𝑥 from vanishing and from exploding: 𝑛 𝑜𝑢𝑡 ∗𝑉𝑎𝑟 𝑤 =1  𝑉𝑎𝑟 𝑊 = 1 𝑛 𝑜𝑢𝑡 . Combining with formula from forward path: 𝑉𝑎𝑟 𝑊 = 2 𝑛 𝑖𝑛 +𝑛 𝑜𝑢𝑡 ; Xavier rule 2 for weight initialization with uniform rand( ): 𝑊=𝑢𝑛𝑖_𝑟𝑎𝑛𝑑 − 6 𝑛 𝑖𝑛 + 𝑛 𝑜𝑢𝑡 ; 6 𝑛 𝑖𝑛 + 𝑛 𝑜𝑢𝑡

Extension of gradient analysis for convolutional networks Convolutional layer: Forward propagation: 𝑌 𝑖 =𝑓( 𝑗=1 𝑀 𝑊 𝑖𝑗 ∗ 𝑋 𝑗 ) where: 𝑌 𝑖 - output feature map (H’ x W’) 𝑊 𝑖𝑗 - convolutional filter (K x K) 𝑋 𝑗 - input feature map (H x W) 𝑊 𝑖𝑗 ∗ 𝑋 𝑖 - convolution of input feature map X with filter W M – number of input features (each feature map H x W ) Backward propagation: 𝜕𝐸 𝜕 𝑋 𝑗 = 𝑖=1 𝑁 𝜕𝐸 𝜕 𝑌 𝑖 ∗ 𝑊 𝑖𝑗 , 𝜕𝐸 𝜕 𝑊 𝑖𝑗 = 𝜕𝐸 𝜕 𝑌 𝑖 * 𝑋 𝑗 Here : * is convolution.

Extension of gradient analysis for convolutional networks Convolutional layer: Forward propagation: 𝑉𝑎𝑟 𝑌 = 𝒏 𝒊𝒏 ∗ 𝑉𝑎𝑟 𝑊 ∗𝑉𝑎𝑟(𝑋) 𝒏 𝒊𝒏 = # input feature maps * k2 Backward propagation: 𝑉𝑎𝑟 𝜕𝐸 𝜕𝑥 = 𝑛 𝑜𝑢𝑡 ∗ 𝑉𝑎𝑟 𝑊 ∗𝑉𝑎𝑟( 𝜕𝐸 𝜕𝑦 ) 𝒏 𝒐𝒖𝒕 = # output feature maps * k2 For weight gradients: 𝑉𝑎𝑟 𝜕𝐸 𝜕𝑤 ~ (𝑯∗𝑾)∗ 𝑉𝑎𝑟 𝑋 ∗𝑉𝑎𝑟( 𝜕𝐸 𝜕𝑦 ) We can compensate (𝑯∗𝑾) with layer learning rate

Local Contrast Normalization Local Contrast Normalization - can be performed on the state of every layer, including the input Subtractive Local Contrast Normalization: Subtracts from every value in a feature a Gaussian-weighted average of its neighbors (high-pass filter) Divisive Local Contrast Normalization Divides every value in a layer by the standard deviation of its neighbors over space and over all feature maps Subtractive + Divisive LCN

Local Response Normalization Layer LRN layer “damps” responses that are too large by normalization in a local neighborhood inside the feature map: 𝑦 𝑙 𝑥,𝑦 = 𝒚 𝒍−𝟏 𝑥,𝑦 𝟏+ 𝛼 𝑵 𝟐 ∗ 𝑥 ′ =𝑥−𝑁/2 𝑥+𝑁/2 𝑦 ′ =𝑦−𝑁/2 𝑦+𝑁/2 𝑦 𝑙−1 ( 𝑥 ′ , 𝑦 ′ ) 2 where : y l−1 𝑥,𝑦  is the activity map prior to normalization, N is the size of the region to use for normalization. - 1 is used to prevent numerical issues for small numbers.

Local Response Normalization Layer Soft Max Inner Product LRN layer Pooling ReLUP Convolutional layer LRN layer Pooling ReLUP Convolutional layer Data Layer

Batch Normalization Layer Layer which normalizes the output of conv. layer before non-linear layer: Whitening: normalize each element of feature map over mini-batch. All locations of the same feature map are normalized in the same way. Adaptive scale γ and shift β (per map) – learned parameters S. Ioffe, C. Szegedy Batch Normalization: Accelerating Deep Network Training , 2015

Batch Normalization Layer Soft Max Inner Product Pooling ReLUP Batch Normalization layer Convolutional layer Pooling ReLUP Batch Normalization layer Convolutional layer Data Layer

Batch Normalization: training Back-propagation for BN layer: Implemented in caffe: https://github.com/BVLC/caffe/pull/1965

Batch Normalization: inference During inference we don’t have batch to normalize, so we use instead fixed mean and variance over all train set: 𝑥 = 𝑥−𝐸[𝑥] 𝑉𝑎𝑟 𝑥 + 𝜖 For testing during training we can use estimation of E[x] and Var[x]:

Batch Normalization: performance Networks with batch normalization train much faster : Much high learning rate with fast exponential decay No need in LRN Baseline: caffe cifar_full VGG-16:caffe VGG_ILSVRC_16

Batch Normalization: Imagenet performance Models: Goog;le Inception:(ILSVRC 2014) with the learning rate of 0.0015 BN-Baseline: Inception + Batch Normalization before each ReLU BN-x5: Inception + Batch Normalization w/o dropout and LRN. The initial learning rate was increased by 5x to 0.0075. BN-x30: Like BN-x5, but with the initial learning rate 0.045 (30x of Inception). BN-x5-Sigmoid: Like BN-x5, but with sigmoid instead of ReLU