Backpropagation Derivation
Although I learnt it a long time ago, I haven’t had a chance to derive it formally, now it’s time to do it. Definitions There are n layers, the last layer is the output layer and the first layer is the input layer. \[\mathbf{z_{i}^{(k)}}\]: input of neuron at layer k with an index of i \(z_{i}^{(k)}=\sum\limits_{j}{y_{j}^{(k-1)}w_{ji}^{(k-1)}}\) \[\mathbf{y_{i}^{(k)}}\]: output of neuron at layer k with an index of i \(y_{i}^{(k)}=g(z_{i}^{(k)})\) where g is the activation function Here we use logistic function \[g(z)=\frac{1}{1+e^{-z}}\] \[\mathbf{w_{ij}^{(k)}}\] weight between neuron i at layer k and neuron j at layer k+1 if the two neurons are connected \[\mathbf{E}\] \(E=J(y_{1}^{(n)},...,y_{I(n)}^{(n)})\) where J is the cost function for a particular training case and I(t) is the number of output neurons in the t-th layer. Objective \(\frac{\partial E}{\partial w_{ij}^{(k)}}\) Derivation \(\frac{\partial E}{\partial w_{ij}^{(k)}}=\frac{\partial E}{\partial y_{j}^{(k+1)}}\frac{\partial y_{j}^{(k+1)}}{\partial w_{ij}^{(k)}}\\ =\frac{\partial E}{\partial y_{j}^{(k+1)}} y_{i}^{(k)}\\\) There are two cases for the partial derivative \[\frac{\partial E}{\partial y_{i}^{(k)}}\] Case 1: k=n The partial derivative can be directly calculated since \[E=J(...,y_i^{(n)},...)\] Case 2: \[\mathbf{2\leq k <n}\] \(\frac{\partial E}{\partial y_{i}^{(k)}}= \sum\limits_{j} \frac{\partial E}{\partial z_{j}^{(k+1)}} \frac{\partial z_{j}^{(k+1)}}{\partial y_{i}^{(k)}} \\ =\sum\limits_{j} \frac{\partial E}{\partial z_{j}^{(k+1)}} w_{ij}^{(k)} \\ =\sum\limits_{j} {\frac{\partial E}{\partial y_{j}^{(k+1)}}\frac{dy_{j}^{(k+1)}}{dz_{j}^{(k+1)}}w_{ij}^{(k)}}\\\) Note that the activation function is \[ y=\frac{1}{1+e^{-z}} \] \(\frac{dy}{dz}=\frac{d}{dz}\frac{1}{1+e^{-z}}\\ =\frac{d}{du}\frac{1}{u}\frac{du}{dz}\;{(u=1+e^{-z})}\\ = -\frac{1}{u^2}\frac{du}{dz}\\ = -\frac{1}{u^2}\frac{d}{dz}(1+e^{-z})\\ = -\frac{1}{u^2}\frac{d}{du'}(e^{u'})\frac{du'}{dz}\;{(u'=-z)}\\ =\frac{e^{u'}}{u^2}\\ =\frac{e^{-z}}{(1+e^{-z})^2}\\ =\frac{1+e^{-z}-1}{(1+e^{-z})^2}\\ =\frac{1+e^{-z}}{(1+e^{-z})^2}-\frac{1}{(1+e^{-z})^2}\\ =y-y^2\\ =y(1-y)\\\) (It’s magical, isn’t it? The derivative of the function only depends on the output of the function, this is called mathematical convenience.) Therefore \(\frac{\partial E}{\partial y_{i}^{(k)}}=\sum\limits_{j} {\frac{\partial E}{\partial y_{j}^{(k+1)}}y_j^{(k+1)}(1-y_j^{(k+1)})w_{ij}^{(k)}}\\\) since \[y_{j}^{(k+1)}\] is in the next layer, \[\frac{\partial E}{\partial y_{j}^{(k+1)}}\] has already been computed. We can do this for as many layers as we want, hence we propagate from the last layer(the output layer) to the second layer.