October 15, 2024

Gradient Descent: a generic algorithm for finding the optimal solution

Spread the love

Gradient Descent is a generic algorithm capable of finding the optimal solutions to a wide range of problems. The general idea is to tweak the parameters in steps in order to minimize a cost function, leading to the optimal solution.

An important parameter is the size of the steps while tweaking the parameter. It is defined by the Learning Rate hyperparameter. If the learning rate is too small then the algorithm will take too many iterations before reaching the optimal solution. And if the learning rate is too high, then the algorithm may jump erratically and can end up on a suboptimal solution as well.

While using Gradient Descent, it is best if the cost function is a convex function, that is, if we pick any two points of the curve, the line joining them never crosses the curve. This implies that there are no local minima, just one global minimum. It is a continuous function with a slope that never changes. Because of these two facts, Gradient Descent is guaranteed to approach arbitrarily close the global minimum.

When using Gradient Descent, we should make sure that the features are all on the same scale. We can use StandardScalar class provided by Scikit-Learn, or else the algorithm will take too long to get to the optimal solution.

Let’s now talk about various types of Gradient Descent algorithms available:

Batch Gradient Descent:

The algorithm processes the entire training set, that is, it uses the whole batch of training data at every step. As a result, the process is very slow on large training sets. But it scales well with the number of features.

Stochastic Gradient Descent:

The algorithm picks a random instance in the training set at every step and computes gradients based only on that single instance. This makes the algorithm much faster. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration. But due to its stochastic natures, the cost function keeps bouncing even after reaching the optimal solution and never settles down. So once the algorithm stops, the final parameter values are good but not optimal. This stochastic nature helps the algorithm to jump out of local minima.

One solution to the above problem is to gradually reduce the learning rate. The steps start out large, which helps the algorithm to jump out of local minima, then gets smaller and smaller, allowing the algorithm to settle at the global minima. The function which determines the learning rate at each iteration is called the learning schedule.

Since the instances are picked randomly, some instances may be picked several times while others may not be picked at all. One approach is to shuffle the training set (input features and labels), and then go through it instance by instance. This approach causes the algorithm to converge slowly.

To perform Linear Regression using Stochastic GD with Scikit-Learn, we can use SGDRegressor class. The following code runs for a maximum of 1000 epochs or until the loss drops by less than 0.001 during one epoch (tolerance). It starts with a learning rate of 0.1, using the default learning schedule, it doesn’t use any regularization for now.

from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(
    max_iter=1000,
    tol=1e-3,
    penalty=None,
    eta0=0.1,
)

Mini Batch Gradient Descent:

This algorithm is a combination of the two, at each step, it computes the gradients on small random sets of instances called mini-batches. The main advantage is that we can get a performance boost compared to Batch GD, and it is less erratic compared to Stochastic GD. It still doesn’t settle down at the optimal solution but ends up walking closer to the optimal solution compared to Stochastic GD.

Comparing the 3 Gradient Descent Algorithms

AlgorithmLarge mOut-of-core SupportLarge nHyperparamsScaling Required
Batch GDSlowNoFast2Yes
Stochastic GDFastYesFast>2Yes
Mini-batch GDFastYesFast>2Yes

Hopefully, this article gave you an overview of the Gradient Descent, the different types, and the basic difference between them.

You can find the documentation for SGDRegressor here.


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *