Deep learning is huge in machine learning at the moment, and no wonder – it is making large and important strides in solving problems in computer vision, natural language and reinforcement learning and problems in many other areas. Deep learning neural networks are neural networks which are characterized by *many **layers* – making them *deep *instead of *wide*. Deep networks have been demonstrated to be more practically capable of solving problems than simple, wide two layer networks. Neural networks have been around for a long time, but initial success using these networks was elusive. One of the issues that had to be overcome in making them more useful and transitioning to modern deep learning networks was the *vanishing gradient* problem. This problem manifests in the early layers of deep neural networks not learning (or learning very slowly), resulting in difficulties in solving practical problems.

This post will examine the vanishing gradient problem, and demonstrate an improvement to the problem through the use of the rectified linear unit activation function, or ReLUs. The examination will take place using TensorFlow and visualizing with the TensorBoard utility. The TensorFlow code used in this tutorial can be found on this site’s Github repository.

### Eager to learn more? Get the book here

## The vanishing gradient problem

The vanishing gradient problem arises due to the nature of the back-propagation optimization which occurs in neural network training (for a comprehensive introduction to back-propagation, see my free ebook). The weight and bias values in the various layers within a neural network are updated each optimization iteration by stepping in the direction of the *gradient *of the weight/bias values with respect to the loss function. In other words, the weight values change in proportion to the following gradient:

$$ \partial C/ \partial W_l $$

Where *$W_l$* represents the weights of layer *l* and *C *is the cost or loss function at the output layer (again, if these terms are gibberish to you, check out my free ebook which will get you up to speed). In the final layer, this calculation is straight-forward, however in earlier layers, the back-propagation of errors method needs to be utilized. At the final layer, the error term $\delta$ looks like:

$$\delta_i^{(n_l)} = -(y_i – h_i^{(n_l)})\cdot f^\prime(z_i^{(n_l)})$$

Don’t worry too much about the notation, but basically the equation above shows first that the error is related to the difference between the output of the network $h_i^{(n_l)}$ and the training labels $y_i$ (i.e. $(y_i – h_i^{(n_l)})$). It is also, more importantly for the vanishing gradient problem, proportional to the derivative of the activation function $f^\prime(z_i^{(n_l)})$. The weights in the final layer change in direct proportion to this $\delta$ value. For earlier layers, the error from the latter layers is back-propagated via the following rule:

$$\delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)})$$

Again, in the second part of this equation, there is the derivative of the activation function f'(z^{(l)}). Notice that $\delta^{(l)}$ is also proportional to the error propagated from the downstream layer $\delta^{(l+1)}$. These downstream $\delta$ values also include their own f'(z^{(l)}) values. So, basically, the gradient of the weights of a given layer with respect to the loss function, which controls how these weight values are updated, is proportional to chained multiplications of the derivative of the activation function i.e.:

$$ \frac{\partial C} {\partial W_l} \propto f'(z^{(l)}) f'(z^{(l+1)}) f'(z^{(l+2)}) \dots$$

The vanishing gradient problem comes about in deep neural networks when the *f’* terms are all outputting values << 1. When we multiply lots of numbers << 1 together, we end up with a vanishing product, which leads to a very small $\frac{\partial C} {\partial W_l}$ value and hence practically no learning of the weight values – the predictive power of the neural network then platueus.

### The sigmoid activation function

The vanishing gradient problem is particularly problematic with sigmoid activation functions. The plot below shows the sigmoid activation function and its first derivative:

As can be observed, when the sigmoid function value is either too high or too low, the derivative (orange line) becomes very small i.e. << 1. This causes vanishing gradients and poor learning for deep networks. This can occur when the weights of our networks are initialized poorly – with too-large negative and positive values. These too-large values *saturate* the input to the sigmoid and pushes the derivatives into the small valued regions. However, even if the weights are initialized nicely, and the derivatives are sitting around the maximum i.e. ~0.2, with many layers there will still be a vanishing gradient problem. With only 4 layers of 0.2 valued derivatives we have a product of $0.2^{4} = 0.0016$ – not very large! Consider how the ResNet architecture, generally with 10’s or 100’s of layers, would train using sigmoid activation functions with even the best initialized weights. Most of the layers would be static or dead and impervious to training.

So what’s the solution to this problem? It’s called a rectified linear unit activation function, or ReLU.

### The ReLU activation function

The ReLU activation function is defined as:

$$f(x) = \max(0, x)$$

This function and it’s first derivative look like:

As can be observed, the ReLU activation simply returns its argument *x* whenever it is greater than zero, and returns 0 otherwise. The first derivative of ReLU is also very simple – it is equal to 1 when *x *is greater than zero, but otherwise it is 0. You can probably see the advantages of ReLU at this point – when it’s derivative is back-propagated there will be no degradation of the error signal as 1 x 1 x 1 x 1… = 1. However, the ReLU activation still maintains a non-linearity or “switch on” characteristic which enables it to behave analogously to a biological neuron.

There is only one problem with the ReLU activation – sometimes, because the derivative is zero when *x *< 0, certain weights can be “killed off” or become “dead”. This is because the back-propagated error can be cancelled out whenever there is a negative input into a given neuron and therefore the gradient $\frac{\partial C} {\partial W_l}$ will also fall to zero. This means there is no way for the associated weights to update in the right direction. This can obviously impact learning.

What’s the solution? A variant of ReLU which is called a Leaky ReLU activation.

### The Leaky ReLU activation

The Leaky ReLU activation is defined as:

$$f(x) = \max(0.01x, x)$$

As you can observe, when *x *is below zero, the output will switch from *x *to 0.01*x*. I won’t plot the activation for this function, as it is too difficult to see the difference between 0.01*x* and 0 and therefore in plots it looks just like a normal ReLU. However, the good thing about the Leaky ReLU activation function is that the derivative when *x* is below zero is 0.01 – i.e. it is a small but no longer 0. This gives the neuron and associated weights the *chance *to reactivate, and therefore this should improve the overall learning performance.

Now it’s time to test out these ideas in a real example using TensorFlow.

## Demonstrating the vanishing gradient problem in TensorFlow

### Creating the model

In the TensorFlow code I am about to show you, we’ll be creating a 7 layer densely connected network (including the input and output layers) and using the TensorFlow summary operations and TensorBoard visualization to see what is going on with the gradients. The code uses the TensorFlow layers (tf.layers) framework which allows quick and easy building of networks. The data we will be training the network on is the MNIST hand-written digit recognition dataset that comes packaged up with the TensorFlow installation.

To create the dataset, we can run the following:

The MNIST data can be extracted from this mnist data set by calling mnist.train.next_batch(batch_size). In this case, we’ll just be looking at the training data, but you can also extract a test dataset from the same data. In this example, I’ll be using the feed_dict methodology and placeholder variables to feed in the training data, which isn’t the optimal method (see my Dataset tutorial for the most efficient data consumption methodology) but it will do for these purposes. First, I’ll setup the data placeholders:

Note, I have created these variables in an overarching class called Model, hence all the *self* references. The MNIST data input size (*self._input_size*) is equal to the 28 x 28 image pixels i.e. 784 pixels. The number of associated labels, *self._label_size* is equal to the 10 possible hand-written digit classes in the MNIST dataset.

In this tutorial, we’ll be creating a slightly deep fully connected network – a network with 7 total layers including input and output layers. To create these densely connected layers easily, we’ll be using TensorFlow’s handy tf.layers API and a simple Python loop like follows:

First, the generic *input *variable is initialized to be equal to the input images (fed via the placeholder). Next, the code runs through a loop where multiple dense layers are created, each named ‘layerX’ where X is the layer number. The number of nodes in the layer is set equal to the class property *self._hidden_size* and the activation function is also supplied via the property *self._activation*.

Next we create the final, output layer (you’ll note that the loop above terminates before it gets to creating the final layer), and we don’t supply an activation to this layer. In the tf.layers API, a linear activation (i.e. *f(x) = x*) is applied by default if no activation argument is supplied.

Next, the loss operation is setup and logged:

The loss used in this instance is the handy TensorFlow softmax_cross_entropy_with_logits_v2 (the original version is soon to be deprecated). This loss function will apply the softmax operation to the un-activated output of the network, then apply the cross entropy loss to this outcome. After this loss operation is created, it’s output value is added to the tf.summary framework. This framework allows scalar values to be logged and subsequently visualized in the TensorBoard web-based visualization page. It can also log histogram information, along with audio and images – all of these can be observed through the aforementioned TensorBoard visualization.

Next, the program calls a method to log the gradients, which we will visualize to examine the vanishing gradient problem:

This method looks like the following:

In this method, first the TensorFlow computational graph is extracted so that weight variables can be called out of it. Then a loop is entered into, to cycle through all the layers. For each layer, first the weight tensor for the given layer is grabbed by the handy function get_tensor_by_name. You will recall that each layer was named “layerX” where X is the layer number. This is supplied to the function, along with “/kernel:0” – this tells the function that we are trying to access the weight variable (also called a kernel) as opposed to the bias value, which would be “/bias:0”.

On the next line, the tf.gradients() function is used. This will calculate gradients of the form $\partial y / \partial x$ where the first argument supplied to the function is *y* and the second is *x*. In the gradient descent step, the weight update is made in proportion to $\partial loss / \partial W$, so in this case the first argument supplied to tf.gradients() is the loss, and the second is the weight tensor.

Next, the mean absolute value of the gradient is calculated, and then this is logged as a scalar in the summary. Next, histograms of the gradients and the weight values are also logged in the summary. The flow now returns back to the main method in the class.

The code above is fairly standard TensorFlow usage – defining an optimizer, in this case the flexible and powerful AdamOptimizer(), and also a generic accuracy operation, the outcome of which is also added to the summary (see the Github code for the accuracy method called).

Finally a summary merge operation is created, which will gather up all the summary data ready for export to the TensorBoard file whenever it is executed:

An initialization operation is also created. Now all that is left is to run the main training loop.

### Training the model

The main training loop of this experimental model is shown in the code below:

This is a pretty standard TensorFlow training loop (if you’re unfamiliar with this, see my TensorFlow tutorial) – however, one non-standard addition is the tf.summary.FileWriter() operation and its associated uses. This operation generally takes two arguments – the location to store the files and the session graph. Note that it is a good idea to setup a different sub folder for each of your TensorFlow runs when using summaries, as this allows for better visualization and comparison of the various runs within TensorBoard.

Every 200 iterations, we run the *merged* operation, which is defined in the class instance model – as mentioned previously, this gathers up all the logged summary data ready for writing. The train_writer.add_summary() operation is then run on this output, which writes the data into the chosen location (optionally along with the iteration/epoch number).

The summary data can then be visualized using TensorBoard. To run TensorBoard, using command prompt, navigate to the base directory where all the sub folders are stored, and run the following command:

tensorboard –log_dir=whatever_your_folder_path_is

Upon running this command, you will see startup information in the prompt which will tell you the address to type into your browser which will bring up the TensorBoard interface. Note that the TensorBoard page will update itself dynamically during training, so you can visually monitor the progress.

Now, to run this whole experiment, we can run the following code which cycles through each of the activation functions:

This should be pretty self-explanatory. Three scenarios are investigated – a scenario for each type of activation reviewed: sigmoid, ReLU and Leaky ReLU. Note that, in this experiment, I’ve setup a densely connected model with 6 layers (including the output layer but excluding the input layer), with each having a layer size of 10 nodes.

### Analyzing the results

The first figure below shows the training accuracy of the network, for each of the activations:

As can be observed, the sigmoid (blue) significantly under performs the ReLU and Leaky ReLU activation functions. Is this due to the vanishing gradient problem? The plots below show the mean absolute gradient logs during training, again for the three scenarios:

The first graph shows the mean absolute gradients of the loss with respect to the weights for the output layer, and the second graph shows the same gradients for the first layer, for all three activation scenarios. First, it is clear that the overall magnitudes of the gradients for the ReLU activated networks are significantly greater than those in the sigmoid activated network. It can also be observed that there is a significant reduction in the gradient magnitudes between the output layer (layer 6) and the first layer (layer 1). This is the vanishing gradient problem.

You may be wondering why the ReLU activated networks still experience a significant reduction in the gradient values from the output layer to the first layer – weren’t these activation functions, with their gradients of 1 for activated regions, supposed to stop vanishing gradients? Yes and no. The gradient of the ReLU functions where *x > 0* is 1, so there is no degradation in multiplying 1’s together. However, the “chaining” expression I showed previously describing the vanishing gradient problem, i.e.:

$$ \frac{\partial C} {\partial W_l} \propto f'(z^{(l)}) f'(z^{(l+1)}) f'(z^{(l+2)}) \dots$$

isn’t quite the full picture. Rather, the back-propagation product is also in some sense proportional to the values of the weights in each layer, so more completely, it looks something like this:

$$ \frac{\partial C} {\partial W_l} \propto f'(z^{(l)}) \cdot W_{l} \cdot f'(z^{(l+1)}) \cdot W_{l+1} \cdot f'(z^{(l+2)}) \cdot W_{l+2} \dots$$

So if the weight values are consistently < 0, then we will also see a vanishing of gradients, as the chained expression will reduce through the layers as the weight values < 0 are multiplied together. We can confirm that the weight values in this case are < 0 by checking the histogram that was logged for the weight values in each layer:

The diagram above shows the histogram of layer 4 weights in the leaky ReLU scenario as they evolve through the epochs (y axis) – this is a handy visualization available in the TensorBoard panel. Note that the weights are consistently < 0, and therefore we should expect the gradients to reduce even under the ReLU scenarios.

In saying all this, we can observe that the degradation of the gradients is *significantly worse *in the sigmoid scenario than the ReLU scenarios. The mean absolute weight reduces by a factor of 30 between layer 6 and layer 1 for the sigmoid scenario, compared to a factor of 6 for the leaky ReLU scenario (the standard ReLU scenario is pretty much the same). Therefore, while there is still a vanishing gradient problem in the network presented, it is *greatly reduced* by using the ReLU activation functions. This benefit can be observed in the significantly better performance of the ReLU activation scenarios compared to the sigmoid scenario. Note that, at least in this example, there is not an observable benefit of the leaky ReLU activation function over the standard ReLU activation function.

In summary then, this post has shown you how the vanishing gradient problem comes about, particularly when using the old canonical sigmoid activation function. However, the problem can be greatly reduced using the ReLU family of activation functions. You will also have seen how to log summary information in TensorFlow and plot it in TensorBoard to understand more about your networks. Hope it helps.

### Eager to learn more? Get the book here