# Introduction to ResNet in TensorFlow 2

In previous tutorials, I’ve explained convolutional neural networks (CNN) and shown how to code them. The convolutional layer has proven to be a great success in the area of image recognition and processing in machine learning. However, state of the art techniques don’t involve just a few CNN layers. Rather, they can be very deep, consisting of 10s to >100 numbers of layers. One of the most successful CNN architectures developed has been the ResNet architecture. It was first introduced in 2015 (see this paper) and won the ILSVRC 2015 image classification task. The winning ResNet consisted of a whopping 152 layers, and in order to successfully make a network that deep, a significant innovation in CNN architecture was developed for ResNet. This innovation will be discussed in this post, and an example ResNet architecture will be developed in TensorFlow 2 and compared to a standard architecture. Because of the training requirements for this task, I have developed the code in Google Colaboratory (which gives free GPU time – see my tutorial here), and the notebook can be found on this site’s Github repository.

## Introduction to the ResNet architecture

The vanishing gradient problem was an initial barrier to making neural networks deeper and more powerful. However, as explained in this post, the problem has now largely been solved through the use of ReLU activations and batch normalization. Given this is true, and given enough computational power and data, we should be able to stack many CNN layers and dramatically increase classification accuracy, right? Well – to a degree. An early architecture, called the VGG-19 architecture, had 19 layers. However, this is a long way off the 152 layers of the version of ResNet that won the ILSVRC 2015 image classification task. The reason deeper networks were not successful prior to the ResNet architecture was due to something called the degradation problem. Note, this is not the vanishing gradient problem, but something else. It was observed that making the network deeper led to higher classification errors. One might think this is due to overfitting of the data – but not so fast, the degradation problem leads to higher training errors too! Consider the diagrams below from the original ResNet paper:

Illustration of degradation problem that ResNet solves

Note that the 56-layer network has higher test and training errors. Theoretically, this doesn’t make much sense. Let’s say the 20-layer network learns some mapping H(x) that gives a training error of 10%. If another 36 layers are added, we would expect that the error would at least not be any worse than 10%. Why? Well, the 36 extra layers, at worst, could just learn identity functions. In other words, the extra 36 layers could just learn to pass through the output from the first 20-layers of the network. This would give the same error of 10%. This doesn’t seem to happen though. It appears neural networks aren’t great at learning the identity function in deep architectures. Not only don’t they learn the identity function (and hence pass through the 20 layer error rate), they make things worse. Beyond a certain number of layers, they begin to degrade the performance of the network compared to shallower implementations. Here is where the ResNet architecture comes in.

### The ResNet solution

The ResNet solution relies on making the identity function option explicit in the architecture, rather than relying on the network itself to learn the identity function where appropriate. It consists of building networks which consist of the following CNN blocks:

ResNet building block from here

In the diagram above, the input tensor x enters the building block. This input then splits. On one path, the input is processed by two stacked convolutional layers (called a “weight layer” in the above). This path is the “standard” CNN processing part of the building block. The ResNet innovation is the “identity” path. Here, the input x is simply added to the output of the CNN component of the building block, F(x). The output from the block is then F(x) + x with a final ReLU activation applied at the end. This identity path in the ResNet building block allows the neural network to more easily pass through any abstractions learnt in previous layers. Alternatively, it can more easily build incremental abstractions on top of the abstractions learnt in the previous layers. What do I mean by this? The diagram below may help:

Layers and abstractions

Generally speaking, as CNN layers are added to a network, the network during training will learn lower level abstractions in the early layers (i.e lines, colours, corners, basic shapes etc.) and higher level abstractions in the later layers (groups of geometries, objects etc.). Let’s say that, when trying to classify an aircraft in an image, there are some mid-level abstractions which reliably signal that an aircraft is present. Say the shape of a jet engine near a wing (this is just an example). These abstractions might be able to be learnt in, say, 10 layers.

However, if we add an additional 20 or more layers after these first 10 layers, these reliable signals may get degraded / obfuscated. The ResNet architecture gives the network a more explicit chance of muting further CNN abstractions on some filters by driving F(x) to zero, with the output of the block defaulting to its input x. Not only that, the ResNet architecture allows blocks to “tinker” more easily with the input. This is because the block only has to learn the incremental difference between the previous layer abstraction and the optimal output H(x). In other words, it has to learn F(x) = H(x) – x. This is a residual expression, hence the name ResNet. This, theoretically at least, should be easier to learn than the full expression H(x).

An (somewhat tortured) analogy might assist here. Say you are trying to draw the picture of a tree. Someone hands you a picture of a pencil outline of the main structure of the tree – the trunk, large branches, smaller branches etc. Now say you are somewhat proud, and you don’t want too much help in drawing the picture. So, you rub out parts of the pencil outline of the tree that you were handed. You then proceed to add some detail to the picture you were handed, but you have to redraw parts that you already rubbed out. This is kind of like the case of a standard non-ResNet network. Because layers seem to struggle to reproduce an identity function, at each subsequent layer they essentially erase or degrade some of the previous level abstractions and these need to be re-estimated (at least to an extent).

Alternatively, you, the artist, might not be too proud and you happily accept the pencil outline that you received. It is much easier to then add new details to what you have already been given. This is like what the ResNet blocks do – they take what they are give i.e. x and just make tweaks to it by adding F(x). This analogy isn’t perfect, but it should give you an idea of what is going on here, and how the ResNet blocks help the learning along.

A full 34-layer version of ResNet is (partially) illustrated below (from the original paper):

ResNet-34 architecture (partial)

The diagram above shows roughly the first half of the ResNet 34-layer architecture, along with the equivalent layers of the VGG-19 architecture and a “plain” version of the ResNet architecture. The “plain” version has the same CNN layers, but lacks the identity path previously presented in the ResNet building block. These identity paths can be seen looping around every second CNN layer on the right hand side of the ResNet (“residual”) architecture.

In the next section, I’m going to show you how to build a ResNet architecture in TensorFlow 2/Keras. In the example, we’ll compare both the “plain” and “residual” networks on the CIFAR-10 classification task. Note that for computational ease, I’ll only include 10 ResNet blocks.

## Building ResNet in TensorFlow 2

As discussed previously, the code for this example can be found on this site’s Github repository. Importing the CIFAR-10 dataset can be performed easily by using the Keras datasets API:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import datetime as dt

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

We then perform some pre-processing of the training and test data. This pre-processing includes image renormalization (converting the data so it resides in the range [0,1]) and centrally cropping the image to 75% of it’s normal extents. Data augmentation is also performed by randomly flipping the image about the centre axis. This is performed using the TensorFlow Dataset API – more details on the code below can be found in this, this post and my book.

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(64).shuffle(10000)
train_dataset = train_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
train_dataset = train_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y))
train_dataset = train_dataset.map(lambda x, y: (tf.image.random_flip_left_right(x), y))
train_dataset = train_dataset.repeat()

valid_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(5000).shuffle(10000)
valid_dataset = valid_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
valid_dataset = valid_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y))
valid_dataset = valid_dataset.repeat()

In this example, to build the network, we’re going to use the Keras Functional API, in the TensorFlow 2 context. Here is what the ResNet model definition looks like:

inputs = keras.Input(shape=(24, 24, 3))
x = layers.Conv2D(32, 3, activation='relu')(inputs)
x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.MaxPooling2D(3)(x)

num_res_net_blocks = 10
for i in range(num_res_net_blocks):
x = res_net_block(x, 64, 3)

x = layers.Conv2D(64, 3, activation='relu')(x)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(10, activation='softmax')(x)

res_net_model = keras.Model(inputs, outputs)

First, we specify the input dimensions to Keras. The raw CIFAR-10 images have a size of (32, 32, 3) – but because we are performing central cropping of 75%, the post-processed images are of size (24, 24, 3). Next, we create 2 standard CNN layers, with 32 and 64 filters respectively (for more on convolutional layers, see this post and my book). The filter window sizes are 3 x 3, in line with the original ResNet architectures. Next some max pooling is performed and then it is time to produce some ResNet building blocks. In this case, 10 ResNet blocks are created by calling the res_net_block() function:

def res_net_block(input_data, filters, conv_size):
x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(input_data)
x = layers.BatchNormalization()(x)
x = layers.Conv2D(filters, conv_size, activation=None, padding='same')(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
return x

The first few lines of this function are standard CNN layers with Batch Normalization, except the 2nd layer does not have an activation function (this is because one will be applied after the residual addition part of the block). After these two layers, the residual addition part, where the input data is added to the CNN output (F(x)), is executed. Here we can make use of the Keras Add layer, which simply adds two tensors together. Finally, a ReLU activation is applied to the result of this addition and the outcome is returned.

After the ResNet block loop is finished, some final layers are added. First, a final CNN layer is added, followed by a Global Average Pooling (GAP) layer (for more on GAP layers, see here). Finally, we have a couple of dense classification layers with a dropout layer in between. This model was trained over 30 epochs and then an alternative “plain” model was also created. This was created by taking the same architecture but replacing the res_net_block function with the following function:

def non_res_block(input_data, filters, conv_size):
x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(input_data)
x = layers.BatchNormalization()(x)
x = layers.Conv2D(filters, conv_size, activation='relu', padding='same')(x)
x = layers.BatchNormalization()(x)
return x

Note that this function is simply two standard CNN layers, with no residual components included. The training code is as follows:

callbacks = [
# Write TensorBoard logs to ./logs directory
keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")), write_images=True),
]

loss='sparse_categorical_crossentropy',
metrics=['acc'])
res_net_model.fit(train_dataset, epochs=30, steps_per_epoch=195,
validation_data=valid_dataset,
validation_steps=3, callbacks=callbacks)

## ResNet training and validation results

The accuracy results of the training of these two models can be observed below:

ResNet (red) vs “plain” (pink) training accuracy

ResNet (blue) vs “plain” (green) training accuracy

As can be observed there is around a 5-6% improvement in the training accuracy from a ResNet architecture compared to the “plain” non-ResNet architecture. I have run this comparison a number of times and the 5-6% gap is consistent across the runs. These results illustrate the power of the ResNet idea, even for a relatively shallow 10 layer ResNet architecture. As demonstrated in the original paper, this effect will be more pronounced in deeper networks. Note that this network is not very well optimized, and the accuracy could be improved by running for more iterations. However, it is enough to show the benefits of the ResNet architecture. In future posts, I’ll demonstrate other ResNet-based architectures which can achieve even better results.