Deep Learning: Choosing an Activation Function

Neural networks are constructed by data scientists who must make critical decisions, such as selecting the most efficient activation function, which has a significant impact on the network’s performance. Activation functions are responsible for determining if a node in a neural network should be activated or not. Therefore, it’s paramount for data scientists to evaluate carefully all the available activation functions and choose the one that will optimise network performance to achieve the intended results.

The aim of this article is to provide an in-depth understanding of activation functions, their types, functionality, as well as their benefits and drawbacks. Firstly, let’s briefly discuss how neural networks operate. Neural networks are computing models that intend to emulate the human brain’s functions. These networks are made up of layers consisting of interconnected nodes that analyse and identify patterns in data. Activation functions are a critical part of these networks as they determine the output of each node. By introducing non-linearity to the network, activation functions enable network learning of intricate patterns and uses this knowledge to solve complex problems.

A definition of a neural network is:

The potential of the human brain to solve intricate problems such as interpreting cryptic captchas and ancient texts is remarkable. Computer scientists have attempted to enhance this capability by constructing artificial neural networks similar to the brain’s architecture. These networks consist of nodes that operate like neurons in the brain. Incoming data is received by each node, processed, and then outputted. However, unlike the brain, these networks need to be trained before they can be used.

Typically, a neural network undergoes training in three phases, known as layers:

“Input layer”: At this point, a neural network is fed with data, which could range from images to sound recordings in its original form. The received data is then forwarded to other nodes for further processing.

Hidden layer: This layer processes all input data. A neural network can have up to n hidden layers, depending on the training required. The processed information is then transferred to the output layer.

Output layer: The generated output comes from this layer.

In this illustration, every circle represents a node with an activation value ranging between zero and one, indicating if the neuron is activated or not. All nodes in this layer are linked to one another and to the higher layer’s nodes. The number of nodes, and layers in the neural network’s hidden layer can be increased or decreased, depending on the network’s essential needs. The data passes from one layer to the other and from node to node until it reaches the output layer. The output generated would be optimal when nodes have higher values.

Take a look at the following diagram:

When a fuzzy number is provided to the system, it disassembles it into pixels and delivers an activation signal to each node. As each number is distinct, the computer system must first learn to recognise its boundaries. This is accomplished by matching and comparing the input with its actual number through a trial and error process. During this process, some nodes will be activated, while others may not be triggered.

The complexity surrounding the activation and deactivation of specific neurons remains unclear. Therefore, it’s vital to scrutinise the factors that influence this occurrence. In this regard, the significance of activation functions becomes increasingly apparent.

An explanation of what an activation function is would be beneficial.

An activation function is a mathematical equation that determines whether a node is active or not. If a node is active, it will transmit signals to the nodes situated in the layer located immediately above it. To calculate the activation function, the weight is applied to the input, followed by the addition of a bias.

The mathematical representation is:

Z = (weights x input + bias) x activation function

Therefore, if x1, x2, x3,…, xn are the inputs, and w1, w2, w3,……, wn are their respective weights, then:

Activation equals the addition of the activation function and bias, which is (x1 w1 + x2 w2 + x3 w3 + xn wn).

Weights are the coefficients used in an equation to alter its output. A ‘bias’ pertains to the addition of a constant number to the input and weights sum, which can affect an equation’s output by tilting the outcome in either direction.

Activation functions are instrumental in embedding nonlinearity into neural networks. To achieve a specific output, it usually takes many iterations to train a neural network. Hence, backpropagation is employed to modify the weights and biases of the neural network. Activation functions facilitate backpropagation by directing the gradients into the relevant neurons, and facilitating adjustment in the weights and biases as required.

Attributes of Activation Functions

One type of activation function is the binary step function.

Mathematically, f(z) = 0 when z ≤ 0, and 1 when z > 0.

The step function, or threshold function, has the ability to activate a node based on a set threshold. It can yield a singular output value only when the input is above the threshold, and nothing when it falls below it. Thus, the step function is particularly effective in handling intricate patterns. Moreover, compared to backpropagation, one key advantage of the step function is that its gradient is and always remains at 0, which makes the latter ineffective.


The equation for the function is: f(z) = 1/(1+ez).

Nonlinear functions like the sigmoid are extensively utilised in regression analysis. Represented in a graph, the sigmoid appears as an ‘S’-shaped curve, converting the input to a probability value ranged from 0 to 1. It’s important to note that large negative numbers are rounded down to -1, whereas large positive numbers are inverted to +1.

The hidden layers of a convolutional neural network excluesively omit this operation owing to its demanding computational requirements. Since the network would receive a weak gradient for values outside of the range [3; -3], it cannot effectively learn or improve. This is known as a diminishing gradient.


This equation, f(x) = a = tanh(z) = (ez – ez)/(ez + ez), demonstrates a process to determine the ratio between two numbers.

The hyperbolic tangent (tanh) function is applicable for categorising a dataset into two distinct groups. Similar to the sigmoid function, the non-linear graph of the tanh function takes on the form of an ‘S’ and yields values which are either -1 or 1. Notably, the tanh function has a zero-centred design, thus permitting faster optimisation of the output. As the input number increases, the output grows closer to 1, and as it decreases, the output gets nearer to -1. However, it should be avoided as a final output stage.


The expression f(z) = maximum(0,z) is a mathematical equation.

The Rectified Linear Unit (ReLu) is one of the most commonly employed activation functions in convolutional neural networks and deep learning. As a variation of the sigmoid and tanh activation functions, ReLu evades the challenge of facing a vanishing gradient as the gradient approaches zero. Additionally, the range of ReLu is constrained between zero and positive infinity.

Since it lacks exponential terms, the ReLu function is capable of producing results at a faster rate. Nonetheless, it’s worth mentioning that, during the training phase, computation problems may arise if its positive side reaches extreme levels.


The formula for the calculation is f(x) = exi / (j=0 exi).

The softmax function is utilised for non-linear multiclass processing. Typically, it is used for multi-class input classification and implemented in the output layer of a neural network. This function normalises the outputs for each class between 0 and 1, and then calculates the probability of an input value falling into a specific category by dividing the absolute sum total by the grand total.

Choosing the Optimal Activation Function for Neural Networks

The activation function selection process for neural networks is highly subjective, varying on a task-specific basis. For those who are new to deep learning, starting with the sigmoid function is an excellent approach. As one’s expertise deepens, a wider range of activation functions can be explored.

When dealing with data analysis, two main types of problems may arise – regression and classification. The linear activation function has proven to be a useful solution for regression tasks, while nonlinear functions are generally more apt for classification tasks.

For binary classification, the sigmoid function is a useful activation function, while for multiclass classification, the softmax activation is the preferred choice.

While choosing an activation function, it is advisable to consider the following additional advice:

  • To make certain that the neural network model can handle complex scenarios most efficiently, it is advised to use differential nonlinear functions in the hidden layers of the model. This method will improve the model’s ability to adapt to more intricate situations.
  • When dealing with obscured layers, it’s best to use the ReLu algorithm.
  • Typically, the softmax function is used for the output layer of a neural network model.
  • Currently, the Rectified Linear Unit (ReLu) activation function is the most commonly used, making it an excellent selection for those unsure of which activation function to apply.

Join the Top 1% of Remote Developers and Designers

Works connects the top 1% of remote developers and designers with the leading brands and startups around the world. We focus on sophisticated, challenging tier-one projects which require highly skilled talent and problem solvers.
seasoned project manager reviewing remote software engineer's progress on software development project, hired from Works blog.join_marketplace.your_wayexperienced remote UI / UX designer working remotely at home while working on UI / UX & product design projects on Works blog.join_marketplace.freelance_jobs