Dense

This page is in a draft status.

Forward Pass

During the forward pass, the inputs $o_{l-1}$, which correspond to the output of the previous layer for a hidden layer or to the training values for the input layer, are multiplied by the weights $W^{[l]}$ of the layer and the biases $b^{[l]}$ are added to form the linear activation:

$$z^{[l]}=W^{[l]}o_{l-1}+b^{[l]}$$

The linear activation is then used as input for the activation function to compute the nonlinear activation of the layer:

$$o_l = g(z^{[l]})$$

This value is the output of the layer and is passed on to the next layer or is used to compute the loss if it is the output layer.

Backward Pass

In the backward pass, we compute how the loss function used to train the model reacts to small variations in the weights, biases, and inputs. That is, if $\mathcal{L}$ is the loss function, we compute $\nabla_{W^{[l]}}\mathcal{L}$, $\nabla_{b^{[l]}}\mathcal{L}$, and $\nabla_{o_{l-1}}\mathcal{L}$. The input of the backward pass corresponds to the gradient of the loss function with respect to the output of the layer. That is, if $o_l^{\prime}$ is the input of the backward pass, we have

$$o^{\prime}_l \equiv \nabla_{o_l}\mathcal{L}$$

Applying the chain rule, we start by computing the partial derivative with respect to the weights:

$$\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \frac{\partial \mathcal{L}}{\partial o_l}\frac{\partial o_l}{\partial W^{[l]}} = o_l^{\prime}\frac{\partial o_l}{\partial z^{[l]}}\frac{\partial z^{[l]}}{\partial W^{[l]}} = (o_l^{\prime}\odot g’(z^{[l]}))o_{l-1}^T$$

where $\odot$ denotes the Hadamard product (i.e. element-wise product) and the transpose of $o_{l-1}$ is taken to have consistent dimensions. We then proceed with the partial derivative with respect to the biases:

$$\frac{\partial \mathcal{L}}{\partial b^{[l]}} = \frac{\partial \mathcal{L}}{\partial o_l}\frac{\partial o_l}{\partial b^{[l]}} = o_l^{\prime}\frac{\partial o_l}{\partial z^{[l]}}\frac{\partial z^{[l]}}{\partial b^{[l]}} = o_l^{\prime}\odot g’(z^{[l]})$$

And finally, the partial derivatives with respect to the layer’s inputs:

$$\frac{\partial \mathcal{L}}{\partial o_{l-1}} = \frac{\partial \mathcal{L}}{\partial o_l}\frac{\partial o_l}{\partial o_{l-1}} = o_l^{\prime}\frac{\partial o_l}{\partial z^{[l]}}\frac{\partial z^{[l]}}{\partial o_{l-1}} = {W^{[l]}}^T (o_l^{\prime}\odot g’(z^{[l]}))$$

where the transpose of the weights is taken to have consistent dimensions. This last value is the output of the backward pass and is passed on to the previous layer. The following figure illustrates the forward and backward passes of the dense layer.

The parameters of the layer are the weights and biases. During the forward pass, the parameters are used to compute the linear and nonlinear activations. The inputs and the linear activations are cached for later use by the backprop algorithm. During the backward pass, the parameters and the previously cached values are used to compute the gradients with respect to the weights and biases which are cached. These values will be used once the backprop algorithm has been computed for each layer. At that point, an optimizer will update all parameters in the network. Finally, the gradient with respect to the inputs is computed and returned.