SGD

Stochastic Gradient Descent

The stochastic gradient descent optimizer updates the parameters by following the direction of steepest descent of the gradient. That is, for parameter $\theta$, the update rule is

$$\theta = \theta - \alpha \nabla_{\theta}\mathcal{L}(\hat{Y},Y)$$

where $\alpha$ is the learning rate and the gradient of the loss function $\nabla_{\theta}\mathcal{L}(\hat{Y},Y)$ is computed with the backpropagation algorithm.

Stochastic Gradient Descent with Momentum

Training a neural network with the SGD optimizer can be slow. A method to improve the convergence rate is to introduce momentum. This method keeps track of the past gradients by storing the moving average of the gradient. The moving average (sometimes called velocity by analogy with physical kinematic systems) is initialized as $v=0$ and is then updated according to

$$v = \beta v - \alpha \nabla_{\theta}\mathcal{L}(\hat{Y}, Y)$$

where $\beta$ is the exponential delay rate of the first-moment estimate (the momentum). The update rule for $\theta$ for the SGD with momentum becomes

$$\theta = \theta + v$$

By default $\beta$ is set to 0 in neuro’s SGD optimizer implementation. It can be turned on by creating the optimizer with the method with_param. A typical value is $\beta = 0.9$ [1].

References

[1] Stanford CS231n course notes: https://cs231n.github.io/neural-networks-3/#sgd, accessed November 2019.

[2] Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, Cambridge MA, 2017