# Resume of Gradient Descent algorithm

In this blog, I resumed characteristics of 3 different Gradient Descent algorithms: Batch Gradient Descent computes the gradients based on the full training set, it takes long time; Stochastic Grad...

Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively so that we can minimize a cost function. An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. In this blog, I will talk about how to determine the the size of the steps in different case with following models:

• Comparison

To implement Gradient Descent, you need to compute the gradient of the cost function with regards to each model . This is called a partial derivative.

Partial derivatives of the cost function: Notice that this formula involves calculations over the full training set X, at each Gradient Descent step! That is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step.

The main problem with Batch Gradient Descent is the that is uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at each step and computes the gradients based only on the single instance.

Randomness is good to escape from local optima, but bad since it means that the algorithm can never settle at the minimum. One solution is to gradually reduce the learning rate. The steps start out large, then get smaller and smaller, allowing the algorithm to settle at the global minimum. However, if you want to be sure that the algorithm goes through every instance at each epoch, another approach is to shuffle the training set, go through it instance by instance, then shuffle it again, and so on.

To perform Linear Regression using SGD with Scikit-Learn, we can use the SGDRegressor class, which defaults to optimizing the squared error cost function.

The code above runs 50 epochs, starting with a learning rate of 0.1, using the default learning schedule and it doesn’t use any regularization.

## Comparison

Let’s compare the algorithms we’ve discussed, m is the number of training instances, n is the number of features.

Algorithm Large m Out-of-core support Large n Hyperparams Scaling required Scikit-Learn class
Normal Equation Fast No Slow 0 No LinearRegression
Batch GD Slow No Fast 2 Yes n/a
Stochastic GD Fast Yes Fast >= 2 Yes SGDRegressor
Mini-batch GD Fast Yes Fast >= 2 Yes SGDRegressor

## Conclusion

In this blog, I resumed characteristics of 3 different Gradient Descent algorithms: Batch Gradient Descent computes the gradients based on the full training set, it takes long time; Stochastic Gradient Descent picks just one instance of training set, it has a better chance of finding the global minimum than Batch GD; Mini-batch Gradient Descent computes the gradients on small random sets of instances, it get a performance boost from hardware optimization of matrix operations. Hope it’s useful for you.