Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 2b: Convolutional NN: Optimization Algorithms

Similar presentations


Presentation on theme: "Lecture 2b: Convolutional NN: Optimization Algorithms"— Presentation transcript:

1 Lecture 2b: Convolutional NN: Optimization Algorithms

2 Agenda Stochastic Gradient Descent (SGD): Advanced Optimization:
batch size adaptive learning rate momentum weight decay Advanced Optimization: Nesterov Accelerated Gradient descent Adagrad and Adadelta SGD near saddle points Why training works?

3 Refferences LeCun et all “Efficient Backpropagation “ Bottou Stochastic Gradient Descent Tricks Hinton Lecture Le, Ng et all “On Optimization Methods for Deep Learning“

4 Batch Gradient Descent
We want to minimize loss over batch with N samples (xn, yn): 𝐿 𝑤 = 𝑛=1 𝑁 𝐸(𝑓 𝑥 𝑛 , 𝑤 , 𝑦 𝑛 ) Batch optimization: compute gradient using back-propagation: 𝜕𝐸 𝜕 𝑦 𝑙−1 = 𝜕𝐸 𝜕 𝑦 𝑙 × 𝜕 𝑦 𝑙 (𝑤, 𝑦 𝑙−1 ) 𝜕 𝑦 𝑙−1 ; 𝜕𝐸 𝜕 𝑤 𝑙 = 𝜕𝐸 𝜕 𝑦 𝑙 × 𝜕 𝑦 𝑙 (𝑤, 𝑦 𝑙−1 ) 𝜕 𝑤 𝑙 accumulate gradients over all samples in batch, and update W: 𝑊 𝑡+1 = 𝑊 𝑡 −𝜆∗ 𝑛=1 𝑁 𝜕𝐸 𝜕𝑤 ( 𝑥 𝑛 , 𝑤 , 𝑦 𝑛 ) Issues: Gradient computation for whole batch is expensive.

5 Stochastic Gradient Descent
Stochastic Gradient Descent (on-line learning): Randomly choose sample (xk, yk): 𝑊 𝑡+1 = 𝑊 𝑡 −𝜆∗ 𝜕𝐸 𝜕𝑤 ( 𝑥 𝑘 , 𝑤 , 𝑦 𝑘 ) Stochastic Gradient Descent with mini-batches: divide the dataset into small mini-batches, choose samples from different classes compute the gradient using a single m-batch, make an update move to the next mini-batch … Don’t forget to shuffle / shift data between epochs! SGD with mini-batch: faster than batch more robust to redundant data Behaves better in local minima/ saddle. “Use stochastic gradient descent when training time is the bottleneck” Leon Bottou, Stochastic Gradient Descent Tricks

6 Mini-batch size and Learning Rate
Two key parameters : size of mini-batch and learning rate How learning rate is affected by size of mini-batch? What is the optimal batch size? N = batch: gradient computation is heavy, step is small N=1 (on-line training) gradient is very noisy, zig-zags around “true” gradient Mini-batch training follows the curve of gradient: the expected value of the weight change for on-line training is continuously pointing in the direction of the gradient at the current point in weight space. batch mini-batch Wilson, ‘On The general inefficiency of batch training for gradient descent learning, 2003

7 Mini-batch size and Learning Rate
Wilson’ research: how much faster a classical FC NN could be trained using a cluster of computers each operating on a part of the training set. He came to interesting conclusions: on-line training on a single machine faster than batch training on cluster online training is ~ 100x faster than batch training for large datasets his rule: reduce the learning rate by a factor of ~ N1/2 to get batch training to train with the same stability as on-line training on a training set of N instances: 𝑙𝑟 𝐵 = 𝑙𝑟(1) 𝐵 What’s about Convolutional NN ? Alex Krizhevsky (“One weird trick on parallelizing CNN”) Theory: multiply the learning rate by k1/2 when increase the batch size by K to keep the variance in the gradient expectation constant. Practice : to multiply the learning rate by k when multiplying the batch size by k. But this rule breaks for large B. Project: Explore optimal training parameters for CIFAR-10

8 Learning Rate Adaptation
𝑊 𝑡+1 = 𝑊 𝑡 −𝝀(𝒕) ∗ 𝜕𝐸 𝜕𝑤 annealing of learning rate, such that 𝑡=1 ∞ 1 𝜆 2 (𝑡) <∞ and 𝑡=1 ∞ 1 𝜆(𝑡) =∞ , e.g. λ(t) = c 𝑡 ; Caffe supports 4 learning rate policy fixed: 𝜆=𝑐𝑜𝑛𝑠𝑡 exp: 𝜆 𝑛 = 𝜆 0 ∗ 𝛾 𝑛 step : 𝜆 𝑛 = 𝜆 0 ∗ 𝛾 [ 𝑛 𝑠𝑡𝑒𝑝 ] inverse: 𝜆 𝑛 = 𝜆 0 ∗ (1+𝛾∗𝑛) −𝑐 Choose learning rate per layer: λ is ~to square root of # of connections which share weights

9 SGD with weight decay Weight decay works as regularization of weights:
𝑊 𝑡+1 =𝑊 𝑡 −𝜆∗( 𝜕𝐸 𝜕𝑤 +𝜽∗𝑾(𝒕) ) This is equivalent to adding penalty on weights to loss function: 𝐸 ′ 𝑊 =𝐸 𝑊 + 𝜃 2 ∗ 𝑊 2

10 SGD with momentum 𝑊 𝑡+1 =𝑊 𝑡 +∆𝑊 𝑡+1 ∆𝑊 𝑡+1 =𝜷∗ ∆𝑾 𝒕 −𝜆∗ 𝜕𝐸 𝜕𝑤 Momentum works as weighted average of gradients . ∆𝑊 t+1 =−𝜆 𝑘=1 𝑡+1 𝛽 𝑡+1−𝑘 ∗ 𝜕𝐸 𝜕𝑤 k

11 Nesterov Accelerated Gradient
Nesterov accelerated gradient (1983) is similar to SGD with momentum 𝑉 𝑡+1 =𝑊 𝑡 − 𝜆∗ 𝜕𝐸 𝜕𝑤 𝑡+1 // basic gradient step 𝑊 𝑡+1 =𝑉 𝑡+1 + 𝛽 𝑡+1 ∗∆𝑉 𝑡 // momentum step Where 𝛽 𝑡 is sequence , s.th. 0< 𝛽 𝑡 <1; 𝛽 𝑡  1 // variable momentum NAG: mrtz.org/blog/the-zen-of-gradient-descent/

12 Caffe’ Nesterov Accelerated Gradient
Caffe’ NAG implementation is different. Denote: 𝜆= learning rate , 𝛽 = momentum, 𝜽 = weight decay Weight update: Backward( ) computes 𝑑𝑖𝑓𝑓 t+1 = 𝜕𝐸 𝜕𝑤 (t+1) add weight decay: 𝑑𝑖𝑓𝑓 t+1 = 𝜕𝐸 𝜕𝑤 t+1 +𝜃∗𝑤 t+1 add diff to history: ℎ𝑖𝑠𝑡 t+1 =𝛽∗ℎ𝑖𝑠𝑡 𝑡 +𝜆∗𝑑𝑖𝑓𝑓 t+1 update: ∆𝑤 t+1 =ℎ𝑖𝑠𝑡 t+1 +𝛽∗(ℎ𝑖𝑠𝑡 t+1 −ℎ𝑖𝑠𝑡(𝑡)) ℎ𝑖𝑠𝑡(t+1) works as weighted sum of gradients (like IIR-fliter): ℎ𝑖𝑠𝑡 t+1 = 𝑘=1 𝑡+1 𝛽 𝑡+1−𝑘 ∗ 𝜆(𝑘)∗𝑑𝑖𝑓𝑓 k ) See also

13 AdaGrad & AdaDelta Lecun 98: “Learning root per layer ~ sqroot of output size (h x w)” Adagrad: adapt learning rate for each weight: ∆ 𝑊 𝑖𝑗 𝑡+1 =− 𝛾 1 𝑡+1 ( 𝜕𝐸 𝜕 𝑤 𝑖𝑗 (𝜏)) 2 ∗ 𝜕𝐸 𝜕 𝑤 𝑖𝑗 (𝑡+1) AdaDelta: Idea: accumulate the denominator over last k gradients (sliding window): 𝛼 𝑡+1 = 𝑡−𝑘+1 𝑡+1 ( 𝜕𝐸 𝜕 𝑤 𝑖𝑗 (𝜏)) 2 and ∆ 𝑊 𝑖𝑗 𝑡+1 =− 𝛾 𝛼(𝑡+1) ∗ 𝜕𝐸 𝜕 𝑤 𝑖𝑗 (𝑡+1) This requires to keep k gradients. Instead we can use simpler formula: 𝛽 𝑡+1 =𝜌∗𝛽 𝑡 + 1−𝜌 ∗( 𝜕𝐸 𝜕 𝑤 𝑖𝑗 (𝑡+1)) 2 and ∆ 𝑊 𝑖𝑗 𝑡+1 =− 𝛾 𝛽 𝑡+1 +𝜖 ∗ 𝜕𝐸 𝜕 𝑤 𝑖𝑗 (𝑡+1)

14 Non-convex optimization
The cost function is highly non-convex Large number of solutions large number of saddles and local minimal points Wilson, The general inefficiency of batch training for gradient descent learning

15 Optimization near Saddle Points
“The problem with convnets cost functions is not local min, but local saddle points. How SGD methods behave near saddle point?” R. Pascanu, “On the saddle point problem for non-convex optimization”, Dauphin, ”Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, See also

16 Performance near Saddle Points
by alecradford

17 Performance near Saddle Points
by alecradford

18 Performance near Saddle Points
by alecradford

19 Going beyond basic SGD See Le at all “On Optimization Methods for Deep Learning”

20 CNN training with different optimization algorithms
Karpathy, MNIST demo: “Adagrad/Adadelta are "safer" because they don't depend so strongly on setting of learning rates (with Adadelta being slightly better), but well-tuned SGD+Momentum almost always converges faster and at better final values”

21 Why SGD works for NN training?
Questions: Do neural networks enter and escape a series of local minima? Do they move at varying speed as they approach and then pass a variety of saddle points? Do they follow a narrow and winding ravine as it gradually descends to a low valley? Goodfellow, Qualitatively Characterizing NN Optimization Problems, ICLR2015

22 Why SGD works for NN training?
Goodfellow introduced very simple technique to visualize Loss function: Compute loss function along the line which connects initial point X0 with solution X1 : X(θ) = (1-θ) * X0 + θ* X1 MNIST Fully connected NN with maxout

23 Why SGD works for NN training?
Line between 2 solutions ( FC NN+maxout, MNIST)

24 Why SGD works for NN training?
Line between random point and solution (FC NN+maxout, MNIST)

25 Why SGD works for NN training?
Line between initial and final point for Conv NN, CIFAR-10

26 Projects What are best optimization parameters for CIFAR-10?
Batch size, Learning rate, momentum, optimization algorithm… Add to caffe new optimization algorithms RMSPROP/RPROP ) Varience-based SGD (VSGD, Shaul, LeCun, “No More Pesky Learning Rates “ ) Averaged SGD (Bottou “SGD tricks” )


Download ppt "Lecture 2b: Convolutional NN: Optimization Algorithms"

Similar presentations


Ads by Google