Presentation is loading. Please wait.

Presentation is loading. Please wait.

Activation Function: SELU

Similar presentations


Presentation on theme: "Activation Function: SELU"โ€” Presentation transcript:

1 Activation Function: SELU

2 ReLU Rectified Linear Unit (ReLU) Reason: 1. Fast to compute ๐œŽ ๐‘ง
๐‘Ž ๐‘Ž=๐‘ง ๐‘Ž=0 ๐œŽ ๐‘ง 1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases Parametric ReLU, -> PReLU Ref: Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks."ย International Conference on Artificial Intelligence and Statistics No differentailalbe how about 0 Benefit: 2011 Where does it come from? fast to compute biology? many sigmoid 4. Vanishing gradient problem

3 ฮฑ also learned by gradient descent
ReLU - variant ๐‘ง ๐‘Ž ๐‘Ž=๐‘ง ๐‘Ž=0.01๐‘ง ๐ฟ๐‘’๐‘Ž๐‘˜๐‘ฆ ๐‘…๐‘’๐ฟ๐‘ˆ ๐‘ง ๐‘Ž ๐‘Ž=๐‘ง ๐‘Ž=๐›ผ๐‘ง ๐‘ƒ๐‘Ž๐‘Ÿ๐‘Ž๐‘š๐‘’๐‘ก๐‘Ÿ๐‘–๐‘ ๐‘…๐‘’๐ฟ๐‘ˆ ฮฑ also learned by gradient descent

4 ReLU - variant Exponential Linear Unit (ELU) Scaled ELU (SELU) ๐‘Ž ๐‘Ž ๐‘Ž=๐‘ง
ReLU - variant ๐‘ง ๐‘Ž ๐‘Ž=๐‘ง ๐‘Ž=๐›ผ ๐‘’ ๐‘ง โˆ’1 Exponential Linear Unit (ELU) ๐‘ง ๐‘Ž ๐‘Ž=๐‘ง ๐‘Ž=๐›ผ ๐‘’ ๐‘ง โˆ’1 Scaled ELU (SELU) ร—๐œ† ร—๐œ† ๐›ผ= ๐œ†=

5 SELU ๐›ผ=1.673263242โ€ฆ ๐œ†=1.050700987โ€ฆ ๐‘Ž=๐œ†๐‘ง ๐‘Ž=๐œ†๐›ผ ๐‘’ ๐‘ง โˆ’1
๐‘Ž=๐œ†๐›ผ ๐‘’ ๐‘ง โˆ’1 Positive and negative values The whole ReLU family has this property except the original ReLU. Saturation region ELU also has this property Slope larger than 1 Only SELU also has this property

6 SELU โ€ฆ ๐œ‡ ๐‘ง =๐ธ ๐‘ง = ๐‘˜=1 ๐พ ๐ธ ๐‘Ž ๐‘˜ ๐‘ค ๐‘˜ =๐œ‡ ๐‘˜=1 ๐พ ๐‘ค ๐‘˜ =๐œ‡โˆ™ ๐พ๐œ‡ ๐‘ค ๐œ‡ =0 =0
= ๐‘˜=1 ๐พ ๐ธ ๐‘Ž ๐‘˜ ๐‘ค ๐‘˜ =๐œ‡ ๐‘˜=1 ๐พ ๐‘ค ๐‘˜ =๐œ‡โˆ™ ๐พ๐œ‡ ๐‘ค ๐œ‡ =0 =0 The inputs are i.i.d random variables with mean ๐œ‡ and variance ๐œŽ 2 . =0 =1 โ€ฆ The intuition here is that high variance in one layer is mapped to low variance in the next layer and vice versa. This works because the SELU decreases the variance for negative inputs and increases the variance for positive inputs. The decrease effect is stronger for very negative inputs, and the increase effect is stronger for near-zero values. Theorem 2โ€Šโ€”โ€Šthe mapping of variance is bounded from above so gradients cannot explode Theorem 3โ€”the mapping of variance is bounded from below so gradients cannot vanish Do not have to be Gaussian

7 SELU โ€ฆ ๐œ‡ ๐‘ง =0 ๐œ‡ ๐‘ค =0 ๐œŽ ๐‘ง 2 =๐ธ ๐‘งโˆ’ ๐œ‡ ๐‘ง 2 =๐ธ ๐‘ง 2
๐œŽ ๐‘ง 2 =๐ธ ๐‘งโˆ’ ๐œ‡ ๐‘ง 2 =๐ธ ๐‘ง 2 =๐ธ ๐‘Ž 1 ๐‘ค 1 + ๐‘Ž 2 ๐‘ค 2 +โ‹ฏ 2 The inputs are i.i.d random variables with mean ๐œ‡ and variance ๐œŽ 2 . = ๐‘˜=1 ๐พ ๐‘ค ๐‘˜ 2 ๐œŽ 2 = ๐œŽ 2 โˆ™ ๐พ๐œŽ ๐‘ค 2 =1 =0 =1 =1 =1 โ€ฆ ๐ธ ๐‘Ž ๐‘˜ ๐‘ค ๐‘˜ 2 = ๐‘ค ๐‘˜ 2 ๐ธ ๐‘Ž ๐‘˜ 2 = ๐‘ค ๐‘˜ 2 ๐œŽ 2 ๐ธ ๐‘Ž ๐‘– ๐‘Ž ๐‘— ๐‘ค ๐‘– ๐‘ค ๐‘— = ๐‘ค ๐‘– ๐‘ค ๐‘— ๐ธ ๐‘Ž ๐‘– ๐ธ ๐‘Ž ๐‘— =0 The intuition here is that high variance in one layer is mapped to low variance in the next layer and vice versa. This works because the SELU decreases the variance for negative inputs and increases the variance for positive inputs. The decrease effect is stronger for very negative inputs, and the increase effect is stronger for near-zero values. Theorem 2โ€Šโ€”โ€Šthe mapping of variance is bounded from above so gradients cannot explode Theorem 3โ€”the mapping of variance is bounded from below so gradients cannot vanish target Assume Gaussian ๐œ‡=0,๐œŽ=1

8 Demo

9 SELU is actually more general.
93 ้ ็š„่ญ‰ๆ˜Ž Source of joke: SELU is actually more general.

10 ๆœ€ๆ–ฐๆฟ€ๆดป็ฅž็ถ“ๅ…ƒ:SELF-NORMALIZATION NEURAL NETWORK (SELU)
MNIST CIFAR-10 ๆœ€ๆ–ฐๆฟ€ๆดป็ฅž็ถ“ๅ…ƒ:SELF-NORMALIZATION NEURAL NETWORK (SELU) GELU:

11 Demo


Download ppt "Activation Function: SELU"

Similar presentations


Ads by Google