Initializers

Parameters initialization plays a very important role in deep learning. A proper initialization can determine wether the optimization algorithm will converge at all, and influences the convergence speed. Symmetry breaking is also an important part of the initialization process. If two units are connected to the same input and with the same initial weights and activation function, then their weights will be updated identically during the optimization process, assuming a deterministic loss function is used. This means that the units will learn the same function, thus reducing the overall ability of the network to learn complex functions. In order to avoid this symmetry, the parameters are initialized randomly such that each unit has different weights and biases. Several methods have been developed over the last decades and the ones implemented in neuro are presented here.

Glorot Normal

This distribution has first been introduced by Glorot and Bengio [1]. The authors propose to use a normal distribution with mean $\mu=0$ and variance $\sigma^2=2/(n_i+n_{i+1})$, that is

$$X \sim \mathcal{N}\left(0,\frac{2}{n_i+n_{i+1}} \right )=\sqrt{\frac{2}{n_i+n_{i+1}}}\mathcal{N}(0,1)$$

where $n_i$ is the number of input units (sometimes called fan in) and $n_{i+1}$ the number of output units (sometimes called fan out). Note that this initialization method is sometimes called Xavier normal from the first name of the first author, Xavier Glorot.

Glorot Uniform

In this initialization scheme also proposed by Glorot and Bengio [1], the initial values are drawn from the uniform distribution:

$$X \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_i+n_{i+1}}},\sqrt{\frac{6}{n_i+n_{i+1}}}\right)$$

where, similarly to the Glorot Normal initializer, $n_i$ is the number of input units and $n_{i+1}$ the number of output units. Note that this initialization method is sometimes called Xavier uniform from the first name of the first author, Xavier Glorot.

He Normal

He et al. [2] build on top of the results obtained by Glorot and Bengio and propose to draw the initial values from a normal distribution with mean $\mu=0$ and variance $\sigma^2=2/n_i$:

$$X \sim \mathcal{N}\left(0,\frac{2}{n_i}\right)=\sqrt{\frac{2}{n_i}}\mathcal{N}\left(0,1\right)$$

where $n_i$ is the number of input units.

He Uniform

The He uniform initializer generates random values using a uniform distribution within $\pm\sqrt{6/n_i}$:

$$X \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_i}},\sqrt{\frac{6}{n_i}}\right)$$

where $n_i$ is the number of input units.

Lecun Normal

LeCun et al. [3] propose to initialize the weights using a normal distribution scaled by a factor $\sqrt{1/n_i}$ where $n_i$ is the number of input units. That is, the random values are drawn from the following distribution:

$$X \sim \mathcal{N}\left(0, \frac{1}{n_i} \right) = \sqrt{\frac{1}{n_i}}\mathcal{N}\left(0, 1 \right)$$

Lecun Uniform

The Lecun uniform initializer uses a uniform distribution within $\pm \sqrt{3/n_i}$:

$$X \sim \mathcal{U}\left(-\sqrt{\frac{3}{n_i}}, \sqrt{\frac{3}{n_i}} \right )$$

where $n_i$ is the number of input units.

Random Normal

The initial values are drawn from a normal distribution with mean $\mu=0$ and variance $\sigma^2 = 0.1$, i.e.

$$X \sim \mathcal{N}(0,0.1)$$

Random Uniform

In the random uniform distribution initializer, the initial values are drawn from a uniform distribution between -0.01 an 0.01:

$$X \sim \mathcal{U}(-0.01,0.01)$$

Zeros, Ones, Constant

Initializes all values with zeros, ones, or the given value. Note that it is very common to initialize the biases with zeros. The weights, however, should be initialized with random values in order to break the symmetry as discussed at the top of the page.

References

[1] Glorot, X., Bengio, Y., Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 2010.

[2] He, K., Zhang, X., Ren, S., Sun, J., Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852, Feb. 2015.

[3] LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R., Efficient BackProp. In: Neural networks: Tricks of the trade, pp. 9-48, Springer, Berlin, Heidelberg, 1998.