Neural network: gradient descent and backpropagation

Gradient descent and backpropagation are
-two related algorithms used in the training of neural networks.

Gradient descent is an optimization algorithm used to minimize
the loss function of a neural network.
The loss function measures how well the network is fitting the training data,
and the goal of training is to find the values of the network’s parameters
that result in the lowest possible loss.

The gradient descent algorithm starts with an initial
set of parameters and iteratively updates them in the direction
of the negative gradient of the loss function until it reaches a minimum.

Backpropagation is a technique used to efficiently compute
the gradient of the loss function with respect to the parameters of the network.

This is necessary in order to use gradient descent to minimize
the loss and train the network.

The basic idea of backpropagation is to use the chain rule of calculus
to compute the gradient of the loss function with respect to the parameters,
by propagating the error back through the network
from the output layer to the input layer.

In detail, the process of backpropagation can be broken down into two steps:

Forward propagation: This involves computing the outputs of the network
given the input and current parameter values.
The computation of each output involves applying a non-linear activation function
to the weighted sum of the inputs,
and the outputs of one layer serve as the inputs to the next layer.

Backward propagation: This involves computing the gradient of the loss function
with respect to the parameters of the network.
The gradient is computed by propagating the error from the output layer back
to the input layer, using the chain rule of calculus.
At each layer, the gradient of the loss with respect to the outputs of that layer
is multiplied by the gradient of the activation function with respect to the inputs
to obtain the gradient of the loss with respect to the inputs to that layer.

The gradient of the loss with respect to the parameters can then be computed
by taking the derivative of the loss with respect to the inputs of that layer,
weighted by the inputs to that layer.

These two steps are repeated multiple times,
with the parameters being updated after each iteration using gradient descent
until the loss function reaches a minimum.

The final values of the parameters are the learned parameters of the network,
which can be used to make predictions on new, unseen data.