Gradient Descent is an optimization algorithm used in Machine Learning, one of the most popular ones!

What is a Gradient?

“A gradient measures how much the output of a function changes if you change the inputs a little bit.” — Lex Fridman (MIT)

A gradient basically calculates the change in weights in regard to change in error.

What is Gradient Descent?

Imagine you are climbing a mountain, you start with bigger steps, but they eventually keep getting shorter, in order to avoid missing the target area. That’s the concept behind gradient descent and it’s an optimisation algorithm used in ML. (As hopefully you can visualise in the attached image).

– It is used to find a local minimum of a differentiable function.

– It is used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.

– Gradient measures the change in all weights with regard to the change in error.

# Variants of Gradient descent:

There are three variants of gradient descent, it differs in the amount of data used to compute the gradient of the objective function.

**Stochastic Gradient Descent:**

Stochastic gradient descent (SGD) computes the gradient using a single sample. The noisier gradient measured for a smaller number of samples causes SGD to update often and with a large variance in this case. As a result, the goal feature fluctuates a lot.

SGD has the advantage of being computationally much quicker. Large datasets are often too large to fit in RAM, making vectorization inefficient. Instead, each sample or batch of samples must be loaded, processed, and the results saved, among other things.

**Batch Gradient Descen**t:

For each stage of Gradient Descent, we consider all the examples, which means we compute derivatives of all the training examples to get a new parameter in Batch Gradient Descent. As a result, we get a smoother objective function than with SGD.

Batch gradient descent, on the other hand, is computationally very costly when the number of training examples is high. As a result, batch gradient descent is not recommended if the number of training examples is high. Rather, we choose stochastic gradient descent or mini-batch gradient descent, which will be discussed next.

**Mini Batch gradient descent:**

This is a form of gradient descent that is faster than both batch and stochastic gradient descent methods. We don’t use the whole dataset at once, nor do we use a single example at a time. A mini-batch is a batch with a fixed number of training instances that is smaller than the entire dataset.

This allows one to reap the benefits of both of the previously discussed versions. Since Mini-batch allows the learning algorithm to be configured with an additional “mini-batch size” hyperparameter.

This takes us to the conclusion of this post, in which we discovered how Gradient Descent and its variants function.