Adam

Adaptive moments

The Adam optimizer is an adaptive learning rate algorithm and is widely used due in part to its robustness to the choice of the hyperparameters. The algorithm keeps track of the first- and second-order moment estimates for each parameter $\theta$, thus natively incorporating momentum (with the first-order moment). The moment estimates are initialized with $v=0$ and $s=0$. At each iteration, the first moment estimate is updated with

$$v = \beta_1v+(1-\beta_1)\nabla_{\theta}\mathcal{L}(\hat{Y},Y)$$

and the second one with

$$s = \beta_2s + (1-\beta_2)[\nabla_{\theta}\mathcal{L}(\hat{Y}, Y)]^2$$

where $\beta_1$ and $\beta_2$ are the exponential decay rates for the first and second moment estimates respectively, and the gradient of the loss function $\nabla_{\theta}\mathcal{L}(\hat{Y},Y)$ is obtained from the backpropagation algorithm. Once these estimates have been computed, we correct for the bias in the first and second moments:

$$\hat{v}=\frac{v}{1-\beta_1^t}$$

$$\hat{s} = \frac{s}{1-\beta_2^t}$$

where $t$ is the time step and is updated after each batch has been processed. The parameter is finally updated with

$$\theta = \theta - \alpha \frac{\hat{v}}{\sqrt{\hat{s}}+\epsilon}$$

where $\alpha$ is the learning rate and $\epsilon$ is a small number added for numerical stability. In neuro, the default values for the parameters of the optimizer are set based on Goodfellow et at. [1] recommendations:

$$\begin{align} \beta_1 &= 0.9 \newline \beta_2 &= 0.999 \newline \epsilon &= 10^{-8} \end{align}$$

References

[1] Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, Cambridge MA, 2017