# Machine Learning Optimisation Using Gradient Descent

Gradients are vectors that indicate the direction of the most rapid ascent of a cost function in machine learning (ML). Gradient descent is a popular optimisation technique used to train complex machine learning and deep learning models. At each step of the optimisation process, the cost function is evaluated to measure the accuracy of the current parameter adjustments. As the cost function approaches zero, the model will continue to refine its parameters.

The iterative procedure of the gradient descent method consists of two stages:

1. First-order derivative of the function at that point may be calculated using the gradient.
2. Change course so that you’re moving in the opposite direction as the incline.

## Speed of learning

In order to find the minimum of a given function, we need to implement a process known as gradient descent, which involves multiplying the gradient of the function by a parameter known as the learning rate, and subtracting this result from the present location. The effectiveness of this technique is heavily determined by the setting of the learning rate parameter; if it is too low, the model may be unable to arrive at the optimal solution before it runs out of iterations, whereas if it is too high, it may either fail to converge or even diverge.

The accompanying illustration illustrates that the model converges at a slower rate when using a lower learning rate (left) in comparison to a higher learning rate (right). This suggests that if one desires to attain a model with the highest accuracy, it is imperative to select an appropriate learning rate.

There are three distinct implementations of the gradient descent method, each of which is suitable for different scenarios based on the amount of data available and the desired speed-accuracy trade-off. These implementations include:

Number one, batch gradient descent: In this approach, gradients are calculated for the entire dataset as a single batch, making the process time-consuming and impractical for datasets that exceed the capacity of random access memory (RAM). While this may not be the ideal method for larger datasets, it is capable of producing exceptional results for smaller datasets due to the reduced volatility of the parameter updates.

2.Descent through a stochastic gradient: Stochastic Gradient Descent (SGD) has the potential to be significantly faster than Batch Gradient Descent (BGD) due to its use of random samples from the training data in its evaluation of the gradient. However, this can result in the addition of a considerable amount of noise and volatility in the model’s parameters. Reducing the learning rate may help to reduce this problem, but this will require more iterations to be run and more time to be spent in training.

Third, a smaller version of gradient descent in a batch: Mini-batch gradient descent is a method that takes the best features of stochastic gradient descent and batch gradient descent and combines them into a streamlined process. This method works by using subsets of the training data to make adjustments to the model’s parameters. The advantages of this variation lie in combining stochastic gradient descent’s resilience with the efficiency of batch gradient descent, making it an ideal choice for any situation where speed and accuracy are both important.

## Algorithms for optimising in a gradient descent fashion

### The Momentum Approach

This technique is employed to facilitate the stochastic gradient descent process in making progress towards the desired direction, thus reducing the oscillations that are inherent to the algorithm. This is achieved by incorporating the update vector from the prior process. The step factor is then multiplied by a second constant, usually set at approximately 0.9, in order to obtain the final result.

Given its ability to make minor adjustments to the weights based on characteristic usage, the Adagrad optimizer is particularly useful for processing sparse data. By assigning a variety of learning rates to each parameter update at each time step, Adagrad prioritises less frequent features with higher learning rates and more frequent features with lower learning rates. One of the main advantages of Adagrad is that it eliminates the need to manually adjust the learning rate, although it should be noted that the learning rate could potentially drop to nearly zero if there is no new information for the model to process.

### RMSprop

Although there are some resemblances between the RMSprop optimizer, gradient descent with momentum and Adagrad, it uses a distinct approach for adjusting parameters. RMSprop limits vertical oscillations and enables larger steps to be taken horizontally. Moreover, the algorithm utilises a decaying moving average to prioritise the most recent measured gradients. Furthermore, the RMSprop has an adjustable learning rate, meaning the learning rate is not a fixed hyperparameter but rather varies while the training process is taking place.

Adam is an innovative optimisation algorithm which combines the most advantageous aspects of Adagrad and RMSprop. This method has been proven to be effective in adjusting learning rates, using the average of the first and second momentums (variance). Its consistent performance has made Adam a popular choice in both machine learning and deep learning applications, and it is frequently the default algorithm of choice for many expert practitioners. Its remarkable success has earned it a reputation as a reliable and highly recommended optimisation method.

Unfortunately, despite the many benefits associated with the Adam optimizer, it has been found to not converge to a straightforward optimisation procedure. In an attempt to address this problem, the AMSGrad optimisation method was developed; however, unfortunately, it does not offer an improved performance over Adam.  