Download presentation
Presentation is loading. Please wait.
1
Activation Function: SELU
2
ReLU Rectified Linear Unit (ReLU) Reason: 1. Fast to compute ๐ ๐ง
๐ ๐=๐ง ๐=0 ๐ ๐ง 1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases Parametric ReLU, -> PReLU Ref: Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks."ย International Conference on Artificial Intelligence and Statistics No differentailalbe how about 0 Benefit: 2011 Where does it come from? fast to compute biology? many sigmoid 4. Vanishing gradient problem
3
ฮฑ also learned by gradient descent
ReLU - variant ๐ง ๐ ๐=๐ง ๐=0.01๐ง ๐ฟ๐๐๐๐ฆ ๐
๐๐ฟ๐ ๐ง ๐ ๐=๐ง ๐=๐ผ๐ง ๐๐๐๐๐๐๐ก๐๐๐ ๐
๐๐ฟ๐ ฮฑ also learned by gradient descent
4
ReLU - variant Exponential Linear Unit (ELU) Scaled ELU (SELU) ๐ ๐ ๐=๐ง
ReLU - variant ๐ง ๐ ๐=๐ง ๐=๐ผ ๐ ๐ง โ1 Exponential Linear Unit (ELU) ๐ง ๐ ๐=๐ง ๐=๐ผ ๐ ๐ง โ1 Scaled ELU (SELU) ร๐ ร๐ ๐ผ= ๐=
5
SELU ๐ผ=1.673263242โฆ ๐=1.050700987โฆ ๐=๐๐ง ๐=๐๐ผ ๐ ๐ง โ1
๐=๐๐ผ ๐ ๐ง โ1 Positive and negative values The whole ReLU family has this property except the original ReLU. Saturation region ELU also has this property Slope larger than 1 Only SELU also has this property
6
SELU โฆ ๐ ๐ง =๐ธ ๐ง = ๐=1 ๐พ ๐ธ ๐ ๐ ๐ค ๐ =๐ ๐=1 ๐พ ๐ค ๐ =๐โ ๐พ๐ ๐ค ๐ =0 =0
= ๐=1 ๐พ ๐ธ ๐ ๐ ๐ค ๐ =๐ ๐=1 ๐พ ๐ค ๐ =๐โ ๐พ๐ ๐ค ๐ =0 =0 The inputs are i.i.d random variables with mean ๐ and variance ๐ 2 . =0 =1 โฆ The intuition here is that high variance in one layer is mapped to low variance in the next layer and vice versa. This works because the SELU decreases the variance for negative inputs and increases the variance for positive inputs. The decrease effect is stronger for very negative inputs, and the increase effect is stronger for near-zero values. Theorem 2โโโthe mapping of variance is bounded from above so gradients cannot explode Theorem 3โthe mapping of variance is bounded from below so gradients cannot vanish Do not have to be Gaussian
7
SELU โฆ ๐ ๐ง =0 ๐ ๐ค =0 ๐ ๐ง 2 =๐ธ ๐งโ ๐ ๐ง 2 =๐ธ ๐ง 2
๐ ๐ง 2 =๐ธ ๐งโ ๐ ๐ง 2 =๐ธ ๐ง 2 =๐ธ ๐ 1 ๐ค 1 + ๐ 2 ๐ค 2 +โฏ 2 The inputs are i.i.d random variables with mean ๐ and variance ๐ 2 . = ๐=1 ๐พ ๐ค ๐ 2 ๐ 2 = ๐ 2 โ ๐พ๐ ๐ค 2 =1 =0 =1 =1 =1 โฆ ๐ธ ๐ ๐ ๐ค ๐ 2 = ๐ค ๐ 2 ๐ธ ๐ ๐ 2 = ๐ค ๐ 2 ๐ 2 ๐ธ ๐ ๐ ๐ ๐ ๐ค ๐ ๐ค ๐ = ๐ค ๐ ๐ค ๐ ๐ธ ๐ ๐ ๐ธ ๐ ๐ =0 The intuition here is that high variance in one layer is mapped to low variance in the next layer and vice versa. This works because the SELU decreases the variance for negative inputs and increases the variance for positive inputs. The decrease effect is stronger for very negative inputs, and the increase effect is stronger for near-zero values. Theorem 2โโโthe mapping of variance is bounded from above so gradients cannot explode Theorem 3โthe mapping of variance is bounded from below so gradients cannot vanish target Assume Gaussian ๐=0,๐=1
8
Demo
9
SELU is actually more general.
93 ้ ็่ญๆ Source of joke: SELU is actually more general.
10
ๆๆฐๆฟๆดป็ฅ็ถๅ
:SELF-NORMALIZATION NEURAL NETWORK (SELU)
MNIST CIFAR-10 ๆๆฐๆฟๆดป็ฅ็ถๅ
:SELF-NORMALIZATION NEURAL NETWORK (SELU) GELU:
11
Demo
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.