Mean Squared Error

The mean squared error loss function computes the mean of the square of the error of the model. If the model predicts $\hat{Y} = \begin{bmatrix} \boldsymbol{\hat{y}}^{(1)} & \boldsymbol{\hat{y}}^{(2)} & \dots & \boldsymbol{\hat{y}}^{(m)} \end{bmatrix}$ and the true values are $Y = \begin{bmatrix} \boldsymbol{y}^{(1)} & \boldsymbol{y}^{(2)} & \dots & \boldsymbol{y}^{(m)} \end{bmatrix}$, then the mean squared error is computed as

$$\mathcal{L}(\hat{Y},Y) = \frac{1}{m}\sum_{i=1}^m\Vert\boldsymbol{\hat{y}}^{(i)}-\boldsymbol{y}^{(i)}\Vert^2$$

where $m$ is the number of samples in the mini-batch. Taking the gradient of this function with respect to the predicted values yields

$$\nabla_{\hat{Y}}\mathcal{L}(\hat{Y},Y)=\frac{2}{m}(\hat{Y}-Y)$$