Backpropagation Explained
Stochastic Gradient Descent with Chain Rule
Backpropagation is the process by which we minimize the network error of a neural network slightly with each iteration, by adjusting the weights.
After completing a feedforward pass, we get the output and calculate the error by comparing the predicted output with the actual output. Then we are ready to go backwards to change the weights with the goal of decreasing the network error.
Going back from the output to the input, changing the weights is called backpropagation, which is essentially stochastic gradient descent computed using the chain rule.
Since partial derivatives are the key mathematical concept used in backpropagation, it’s important that you feel confident in your ability to calculate them. Once you know how to calculate basic derivatives, calculating partial derivatives is easy to understand.
For more information on partial derivatives use the following link
Neural network is the main building block of AI, to implement a basic neural network, one doesn’t need deep understanding of mathematics (as now we have open source tools), but to really understand how it works and to optimize the application, it’s always important to know the math.
Our goal is to find a set of weights that minimize the network error. We do that using an iterative process presenting our network with one input at a time from our training set.
As mentioned before, we use the feedforward pass to calculate the network error, we can then use this error to slightly change the weights in the correct direction, each time reducing the error by just a bit. We continue to do so until we determine, that the error is small enough.
Now how small an error is small enough? To answer that question we have to understand the concepts of overfitting. For the time being let us imagine a network with only one weight W.
Imagine that at a certain stage in the training process, the weight W(A) has an error E(A)
To reduce the error we have to increase the weight. Since the derivative, in other words gradient, which is the slope of the curve at point A is negative (since it’s pointing down).
On the other hand if we consider a point B where the weight is W(B) and the error is E(B), to reduce the error we have to decrease the weight W(B). Looking at the gradient at point B, we can see that its positive. In this case changing the weight by taking a step in the negative direction of the gradient, would mean that we are correctly decreasing the value of the weight.
The network we considered here, had a single weight which was oversimplifying the more practical case where a neural network has many weights.
We can summarize the weight update process using this equation.
We are looking at partial derivatives here, as the error is a function of many variables. And the partial derivative let’s us measure how the error is impacted by each weight separately.
Now to understand how partial derivative is calculated in a multi layer neural network, we need to understand the Chain Rule.
The chain rule says if you have a variable x and a function f that you apply to x to get f(x), which we are going to call A, that is A = f(x). And then another function g that you apply to f(x) to get g o f(x), which we are going to call B. The chain rule says, that if you want to find the partial derivative of B with respect to x, that is just the partial derivative B with respect to A, multiplied by the partial derivative of A with respect to x.
This indirectly says that when composing functions, the derivatives just multiply. Now feedforwarding is literally composing a bunch of functions and backpropagation is literally taking the derivative at each piece, since taking the derivative of a composition is the same as multiplying the partial derivatives. Thus to get what we want all we have to do is multiply a bunch of partial derivatives.
Thus backpropagation which is the same as stochastic gradient descent is achieved using chain rule.
References:
Udacity