The neural network is a computing methodology that imitates the learning behavior of the brain. This makes it one of the most popular techniques in machine learning since, like the brain, it can learn anything with sufficient training. Thus, the neural network is popularly known as universal function approximators.

Neurons in our brain receive information from different other neurons/ external stimuli. If all inputs to a neuron cross its activation threshold, the neuron will transmit information to the next connection and enhance learning. Inspired by this, machine learning uses mathematical activation functions in a neural network to learn about input data and produce a relevant response. Therefore, choosing the right kind of activation function remains a bottleneck for efficient learning. This article covers the widely used activation functions in a neural network.

By definition functions that make neural networks powerful and add the ability to learn complex data (images, videos, audio etc.) are called activation functions. One constraint imposed is that an activation function must be differentiable. This is because neural network uses backpropagation algorithm while correcting learning errors. Backpropagation algorithm uses partial derivative for error calculations. This demands for differentiable activation functions.

**Popular activation functions:**

#### a. Step Function

Step function works on the idea of activating a neuron when a value is above a specified threshold. Mathematically it can be expressed as shown in equation 1.

**Eqn. 1: Step function**

Threshold is decided with respect to available data and problem at hand. Figure 1 is a visualization of step function.

**Fig. 1: Step function**

As can be observed step function outputs only two values (0 and 1). This makes it suitable for binary classification. However, for a multi-class problem this function will fail where 37% or 83% activation is required.

#### b. Linear function:

Using a linear function produces an activation output which is proportional to inputs. If the input value is high, there is a higher chance of a neuron to get activated. The highest valued neuron can pass information to the next connection. This overcomes the step function’s limitation due to a range of activations. Equation 2 depicts a linear activation function.

** Eqn.2: Linear function**

However, gradient (derivative) of the above equation is a constant and has no relation to the input. Therefore, changes made by backpropagation is constant irrespective of the amount of error in prediction. Also, having multiple layers in the network will be irrelevant since it will be similar to a modified linear function with just one layer. Using a linear function allows a network to learn only linearly dependent data, which is very rare in real time.

#### c. Sigmoid function:

Sigmoid function introduces non-linearity in the neural network and allows building effective deep models. It is a special case of a logistic function that lies in range (0, 1). Equation 3 shows the formula for sigmoid function and is visualized in figure 3.

**Eqn. 3: Sigmoid function**

**Figure 3: Sigmoid function**

As can be seen in the figure, there are steps between (-2, 2), inferring small changes in the input will have a big impact on output. Thus, it is a suitable choice for classification problems. However, sigmoid suffers from vanishing gradient problem causing slow convergence. Towards either end of the sigmoid function, the output values tend to respond very less to changes in input. The gradient in that region is going to be small making backpropagation inefficient. Here the network refuses to learn further or is drastically slow. Also, the output is not zero-centered making optimization difficult.

#### d. Tanh function:

Tanh is a scaled version sigmoid function. Equation 4 shows Tanh function and its relation to sigmoid.

**Eqn 4: Tanh function and relation with the sigmoid function**

The advantage of Tanh is that it generates zero-centered output since it lies in range (-1, 1) hence optimization is easier. But as can be observed in figure 4, Tanh, like sigmoid, suffers from vanishing gradient problem.

**Figure 4: Tanh function**

#### e. Rectified Linear Units (ReLU):

ReLU is a non-linear function which has about 6 times improvement in convergence compared to tanh function. It is computationally less expensive because of simplicity as can be seen in equation 5. The range of ReLU is [0, infinity). This wide range allows sparsity in activation, unlike sigmoid or tanh. For a network with many neurons sigmoid and tanh may activate most of the neurons because of the limited output range. Almost all neurons will contribute to output, which is computationally expensive. ReLU allows activation to be sparse and effective.

**Equation 5: ReLU function**

Equation 6 shows that gradient of ReLU always exists. Hence overcomes vanishing gradient problem.

**Equation 6: Gradient of ReLU**

ReLU calculates the maximum of input and 0 as can be seen in figure 5.

**Figure 5: ReLU function **

But its limitation is that it should only be used within hidden layers of a neural network model. In horizontal part of ReLU gradient is 0 and weights will not get updated for neurons taking input from this region. This is called dying ReLU problem. This problem can cause several neurons to just die and not respond making a substantial part of the network passive. A dead neuron on final layer is undesirable, hence ReLU is avoided in the final layer.

#### f. Leaky ReLU:

Leaky ReLU solves dying neuron problem of ReLU. The horizontal line is converted into a non-horizontal component by a simple update as shown in equation 7.

**Equation 7: Leaky ReLU**

Where usually, 0 < p < 1.

Figure 6 shows Leaky ReLU whose range is (-infinity, infinity) thus is very preferable.

** Figure 6: Leaky ReLU**

#### g. Softmax function:

This function calculates the probabilities of each target class over all possible target classes as shown in equation 8. The formula computes the exponential (e-power) of the given input value and the sum of exponential values of all the values in the inputs. Then the ratio of the exponential of the input value and the sum of exponential values is the output of the softmax function.

* ***Equation 8: Softmax function**

Where, x being the input, w, and b representing weights and biases of a neural network.

##### About the author:

#### I am currently working as a data analyst with Toshiba Research and Development (R&D) team. I am an enthusiastic learner who loves about reading everything – from science, technology to literature. I am a sports fan. I play lawn tennis and table tennis and have won some tournaments. A food lover, peace lover. Above all, I value knowledge which is only enhanced by sharing. Hoping my article will be of some help to the readers.

E-mail us at she@shedrivesdata.com to inspire our readers with your story – be it your success story or a lesson learned, share what you learned or send some love to a friend. We would love to hear from you!