Cross Entropy

The cross entropy loss function is typically used in multiclass classification problems. For a model having $C$ classes and if $\hat{Y} = \begin{bmatrix} \boldsymbol{\hat{y}}^{(1)} & \boldsymbol{\hat{y}}^{(2)} & \dots & \boldsymbol{\hat{y}}^{(m)} \end{bmatrix}$ are the predicted labels and $Y = \begin{bmatrix} \boldsymbol{y}^{(1)} & \boldsymbol{y}^{(2)} & \dots & \boldsymbol{y}^{(m)} \end{bmatrix}$ are the true labels, then the cross entropy is defined as

$$\mathcal{L}(\hat{Y}, Y) = \frac{1}{m}\sum_{i=1}^m\left(-\sum_{j=1}^Cy^{(i)}_j\log\hat{y}^{(i)}_j \right)$$

where $m$ is the number of samples in the mini-batch. The derivative of this function with respect to output $j$ of sample $i$ is

$$\frac{\partial \mathcal{L}}{\partial \hat{y}_j^{(i)}} = -\frac{1}{m}\frac{y_j^{(i)}}{\hat{y}_j^{(i)}}$$ or in tensor form

$$\nabla_{\hat{Y}}\mathcal{L}(\hat{Y},Y) = -\frac{1}{m}\frac{Y}{\hat{Y }}$$ Now, if the activation of the last layer is a softmax function, the predicted label will be

$$\hat{y}_j^{(i)} = \frac{e^{x_j^{(i)}}}{\sum_{k=1}^ne^{x_k^{(i)}}}$$

and, as seen in the derivation of the softmax, the derivative is

$$\frac{\partial \hat{y}_j^{(i)}}{\partial x_l^{(i)}} = \hat{y}_j^{(i)}(\delta_{jl} - \hat{y}_l^{(i)})$$

Hence, computing the derivative of $\mathcal{L}$ with respect to $x_l^{(i)}$ results in

$$\frac{\partial \mathcal{L}}{\partial x_l^{(i)}} = \sum_{k=1}^C\frac{\partial \mathcal{L}}{\partial \hat{y}_k^{(i)}}\frac{\partial \hat{y}_k^{(i)}}{\partial x_l^{(i)}}$$

where $C$ is the number of classes. If the two fractions on the right hand side are replaced by the equations found previously, it follows that

$$\frac{\partial \mathcal{L}}{\partial x_l^{(i)}} = -\sum_{k=1}^C\frac{y_k^{(i)}}{\hat{y}_k^{(i)}}\hat{y}_k^{(i)}(\delta_{kl} - \hat{y}_l^{(i)}) = -\sum_{k=1}^Cy_k^{(i)}(\delta_{kl} - \hat{y}_l^{(i)})$$

Separating the cases $l=k$ and $l \neq k$ yields

$$\frac{\partial \mathcal{L}}{\partial x_l^{(i)}} = -y_l^{(i)}(1 - \hat{y}_l^{(i)}) + \sum_{k\neq l}^Cy_k^{(i)}\hat{y}_l^{(i)} = \hat{y}_l^{(i)}\left(y_l^{(i)} + \sum_{k\neq l}^Cy_k^{(i)} \right ) - y_l^{(i)}$$ but since the true labels are one-hot encoded,

$$y_l^{(i)} + \sum_{k\neq l}^Cy_k^{(i)} = 1$$ so that

$$\frac{\partial \mathcal{L}}{\partial x_l^{(i)}} = \hat{y}_l^{(i)} - y_l^{(i)}$$

Hence, if the activation of the last layer is a softmax function, the gradient of the cross-entropy loss with respect to the linear activation of the last layer takes the simple form

$$\nabla_X\mathcal{L}(\hat{Y}, Y) = \frac{1}{m}(\hat{Y} - Y)$$

which can be easily computed. For this reason, the current implementation of the cross-entropy loss function in neuro assumes that the last activation is a softmax function. Similarly, the softmax activation function can only be used for the last layer of the network with the cross-entropy loss function. This might change in the future.