Adadelta

This optimizer has first been introduced by Zeiler [1] in order to address two issues that arise in the fomulation of the Adagrad optimizer. In the Adagrad optimizer, the update rule is

$$\theta_{t+1} = \theta_t + \Delta \theta_t = \theta_t -\frac{\eta}{\sqrt{\sum_{\tau=1}^t g_t^2}}g_t$$

where $\theta_t$ is the parameter at step $t$, $\eta$ the learning rate, and $g_t$ the gradient. As can be seen from this equation, if the value of the gradient is large at the beginning of the simulation, the learning rate (the term in front of $g_t$) will be small for the entire training since the denominator will remain large no matter how small the gradient becomes. This can be addressed by selecting a large $\eta$ but thus makes the method dependent on the choice of the learning rate. Also, since the gradient keeps accumulating in the denominator, the learning rate will continually decrease and eventually become zero, hence stopping the training. Adadelta was designed to address these two limitations of Adagrad.

The Adadelta optimizer keeps track of only recent past gradients by accumulating them using an exponentially decaying average:

$$E[g^2]_{t} = \rho E[g^2]_{t-1} + (1-\rho)g_t^2$$

where $\rho$ is the exponential decay rate. That way, large gradients that may arise at the beginning of the training are slowly “forgotten” as the training progresses and thus don’t impact the learning rate later on in the optimization. Since in Adagrad the denominator is a square root, the root mean square is computed:

$$RMS[g]_{t} = \sqrt{E[g^2]_t + \epsilon}$$

where $\epsilon$ is a small constant added for numerical stability. The derivation of an expression for the numerator is based on the observation that the units of $\theta_t$ and $\Delta \theta_t$ should match and based on considerations on second order methods, the author proposes the following expression:

$$E[\Delta \theta^2]_{t} = \rho E[\Delta \theta^2]_{t-1} + (1-\rho)\Delta \theta_t^2$$

$$RMS[\Delta \theta]_t = \sqrt{E[\Delta \theta^2]_t + \epsilon}$$

Hence, the update term in the Adadelta method is

$$\Delta \theta_t = - \frac{RMS[g]_t}{RMS[\Delta \theta]_t}g_t$$

In neuro, the default values for $\rho$ and $\epsilon$ are

$$\rho = 0.95$$

$$\epsilon = 10^{-6}$$

References

[1] Zeiler, M.D., Adadelta: An Adaptive Learning Rate Method, arXiv:1212.5701v1, 2012.