Softmax

Multiclass classification

The softmax activation function is a bit different than the other activation functions in that it takes a vector as input rather than a scalar:

$$\vec{g}(\vec{x})=\frac{e^{\vec{x}}}{\sum_{k=1}^ne^{x_k}}$$

where $\vec{x} = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix}$. Component-wise, we have

$$g_i(\vec{x})=\frac{e^{x_i}}{\sum_{k=1}^ne^{x_k}}$$

Note that the exponential terms in the above equation can quickly grow and cause numerical instability. In order to mitigate this undersired behavior, we can multiply both the numerator and denominator by a constant such that

$$g_i(\vec{x}) = \frac{Ce^{x_i}}{C\sum_{k=1}^ne^{x_k}} = \frac{e^{\ln C}e^{x_i}}{e^{ \ln C}\sum_{k=1}^ne^{x_k}} = \frac{e^{x_i + \ln C}}{\sum_{k=1}^ne^{x_k+\ln C}}$$

A common choice for the constant is to set $\ln C = -\max_k(x_k)$ [1]. The definiton of the softmax function is such that

$$\sum_{i=1}^ng_i(\vec{x})=1$$

which allows us to see $g_i(\vec{x})$ as the probability that $\vec{x}$ belongs to class $i$. This function is therefore widely used for the last activation of multiclass classification models.

The derivative of this function is given by

$$\frac{\partial g_i}{\partial x_i} = \frac{e^{x_i}\sum_{k=1}^ne^{x_k} - e^{x_i}e^{x_i}}{\left(\sum_{k=1}^ne^{x_k} \right)^2} = \frac{e^{x_i}}{\sum_{k=1}^ne^{x_k}}\left(1 - \frac{e^{x_i}}{\sum_{k=1}^ne^{x_k}} \right)$$

and

$$\frac{\partial g_i}{\partial x_j} = \frac{-e^{x_i}e^{x_j}}{\left(\sum_{k=1}^ne^{x_k} \right)^2} =- \frac{e^{x_i}}{\sum_{k=1}^ne^{x_k}}\frac{e^{x_j}}{\sum_{k=1}^ne^{x_k}}$$

These two equations can be rewritten more compactly as

$$\frac{\partial g_i}{\partial x_j} = g_i(\vec{x})(\delta_{ij} - g_j(\vec{x}))$$

where $\delta_{ij}$ is the Kronecker delta and yields

$$\delta_{ij} = \begin{cases} 1 &\text{if } i = j \newline 0 &\text{if } i \neq j \end{cases}$$

The Jacobian of $\vec{g}$ is given by

$$\mathbf{J} = \begin{bmatrix} \frac{\partial g_1}{\partial x_1} & \frac{\partial g_1}{\partial x_2} & \frac{\partial g_1}{\partial x_3} & \dots & \frac{\partial g_1}{\partial x_n}\newline \frac{\partial g_2}{\partial x_1} & \frac{\partial g_2}{\partial x_2} & \frac{\partial g_2}{\partial x_3} & \dots & \frac{\partial g_2}{\partial x_n}\newline \frac{\partial g_3}{\partial x_1} & \frac{\partial g_3}{\partial x_2} & \frac{\partial g_3}{\partial x_3} & \dots & \frac{\partial g_3}{\partial x_n}\newline \vdots & \vdots & \vdots & \ddots & \vdots\newline \frac{\partial g_n}{\partial x_1} & \frac{\partial g_n}{\partial x_2} & \frac{\partial g_n}{\partial x_3} & \dots & \frac{\partial g_n}{\partial x_n} \end{bmatrix}$$

which, based on the partial derivatives computed above, becomes

$$\mathbf{J} = \begin{bmatrix} g_1(\vec{x})(1 - g_1(\vec{x})) & -g_1(\vec{x})g_2(\vec{x}) & -g_1(\vec{x})g_3(\vec{x}) & \dots & -g_1(\vec{x})g_n(\vec{x})\newline -g_2(\vec{x})g_1(\vec{x}) & g_2(\vec{x})(1 - g_2(\vec{x})) & -g_2(\vec{x})g_3(\vec{x}) & \dots & -g_2(\vec{x})g_n(\vec{x})\newline -g_3(\vec{x})g_1(\vec{x}) & -g_3(\vec{x})g_2(\vec{x}) & g_3(\vec{x})(1 - g_3(\vec{x})) & \dots & -g_3(\vec{x})g_n(\vec{x})\newline \vdots & \vdots & \vdots & \ddots & \vdots\newline -g_n(\vec{x})g_1(\vec{x}) & -g_n(\vec{x})g_2(\vec{x}) & -g_n(\vec{x})g_3(\vec{x}) & \dots & g_n(\vec{x})(1-g_n(\vec{x})) \end{bmatrix}$$

Note: for multiclass classification problems, the loss function is usually the cross-entropy loss function and the activation of the output layer the softmax function. When combined together, the gradient of the cross-entropy loss function with respect to the linear activation of the output layer (that is, the activation pre-softmax) takes a very simple form as shown here. For this reason, the gradient of the softmax activation function in neuro returns its output and should only be used for the output layer when the cross-entropy loss function is used. This might change in the future.

References

[1] Stanford CS231n course notes: https://cs231n.github.io/linear-classify/#softmax, accessed November 2019.