Initializers
Parameters initialization plays a very important role in deep learning. A proper initialization can determine wether the optimization algorithm will converge at all, and influences the convergence speed. Symmetry breaking is also an important part of the initialization process. If two units are connected to the same input and with the same initial weights and activation function, then their weights will be updated identically during the optimization process, assuming a deterministic loss function is used. This means that the units will learn the same function, thus reducing the overall ability of the network to learn complex functions. In order to avoid this symmetry, the parameters are initialized randomly such that each unit has different weights and biases. Several methods have been developed over the last decades and the ones implemented in neuro are presented here.
Glorot Normal
This distribution has first been introduced by Glorot and Bengio [1]. The authors propose to use a normal distribution with mean $\mu=0$ and variance $\sigma^2=2/(n_i+n_{i+1})$, that is
$$X \sim \mathcal{N}\left(0,\frac{2}{n_i+n_{i+1}} \right )=\sqrt{\frac{2}{n_i+n_{i+1}}}\mathcal{N}(0,1)$$
where $n_i$ is the number of input units (sometimes called fan in) and $n_{i+1}$ the number of output units (sometimes called fan out). Note that this initialization method is sometimes called Xavier normal from the first name of the first author, Xavier Glorot.
Glorot Uniform
In this initialization scheme also proposed by Glorot and Bengio [1], the initial values are drawn from the uniform distribution:
$$X \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_i+n_{i+1}}},\sqrt{\frac{6}{n_i+n_{i+1}}}\right)$$
where, similarly to the Glorot Normal initializer, $n_i$ is the number of input units and $n_{i+1}$ the number of output units. Note that this initialization method is sometimes called Xavier uniform from the first name of the first author, Xavier Glorot.
He Normal
He et al. [2] build on top of the results obtained by Glorot and Bengio and propose to draw the initial values from a normal distribution with mean $\mu=0$ and variance $\sigma^2=2/n_i$:
$$X \sim \mathcal{N}\left(0,\frac{2}{n_i}\right)=\sqrt{\frac{2}{n_i}}\mathcal{N}\left(0,1\right)$$
where $n_i$ is the number of input units.
He Uniform
The He uniform initializer generates random values using a uniform distribution within $\pm\sqrt{6/n_i}$:
$$X \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_i}},\sqrt{\frac{6}{n_i}}\right)$$
where $n_i$ is the number of input units.
Lecun Normal
LeCun et al. [3] propose to initialize the weights using a normal distribution scaled by a factor $\sqrt{1/n_i}$ where $n_i$ is the number of input units. That is, the random values are drawn from the following distribution:
$$X \sim \mathcal{N}\left(0, \frac{1}{n_i} \right) = \sqrt{\frac{1}{n_i}}\mathcal{N}\left(0, 1 \right)$$
Lecun Uniform
The Lecun uniform initializer uses a uniform distribution within $\pm \sqrt{3/n_i}$:
$$X \sim \mathcal{U}\left(-\sqrt{\frac{3}{n_i}}, \sqrt{\frac{3}{n_i}} \right )$$
where $n_i$ is the number of input units.
Random Normal
The initial values are drawn from a normal distribution with mean $\mu=0$ and variance $\sigma^2 = 0.1$, i.e.
$$X \sim \mathcal{N}(0,0.1)$$
Random Uniform
In the random uniform distribution initializer, the initial values are drawn from a uniform distribution between -0.01 an 0.01:
$$X \sim \mathcal{U}(-0.01,0.01)$$
Zeros, Ones, Constant
Initializes all values with zeros, ones, or the given value. Note that it is very common to initialize the biases with zeros. The weights, however, should be initialized with random values in order to break the symmetry as discussed at the top of the page.
References
[2] He, K., Zhang, X., Ren, S., Sun, J., Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852, Feb. 2015.