An introduction to Global Average Pooling in convolutional neural networks

For those familiar with convolutional neural networks (if you’re not, check out this post), you will know that, for many architectures, the final set of layers are often of the fully connected variety. This is like bolting a standard neural network classifier onto the end of an image processor. The convolutional neural network starts with a series of convolutional (and, potentially, pooling) layers which create feature maps which represent different components of the input images. The fully connected layers at the end then “interpret” the output of these features maps and make category predictions. However, as with many things in the fast moving world of deep learning research, this practice is starting to fall by the wayside in favor of something called Global Average Pooling (GAP). In this post, I’ll introduce the benefits of Global Average Pooling and apply it on the Cats vs Dogs image classification task using TensorFlow 2. In the process, I’ll compare its performance to the standard fully connected layer paradigm. The code for this tutorial can be found in a Jupyter Notebook on this site’s Github repository, ready for use in Google Colaboratory.


Eager to build deep learning systems? Get the book here


Global Average Pooling

Global Average Pooling is an operation that calculates the average output of each feature map in the previous layer. This fairly simple operation reduces the data significantly and prepares the model for the final classification layer. It also has no trainable parameters – just like Max Pooling (see here for more details). The diagram below shows how it is commonly used in a convolutional neural network:

Global Average Pooling in a CNN architecture

Global Average Pooling in a CNN architecture

As can be observed, the final layers consist simply of a Global Average Pooling layer and a final softmax output layer. As can be observed, in the architecture above, there are 64 averaging calculations corresponding to the 64, 7 x 7 channels at the output of the second convolutional layer. The GAP layer transforms the dimensions from (7, 7, 64) to (1, 1, 64) by performing the averaging across the 7 x 7 channel values. Global Average Pooling has the following advantages over the fully connected final layers paradigm:

  • The removal of a large number of trainable parameters from the model. Fully connected or dense layers have lots of parameters. A 7 x 7 x 64 CNN output being flattened and fed into a 500 node dense layer yields 1.56 million weights which need to be trained. Removing these layers speeds up the training of your model.
  • The elimination of all these trainable parameters also reduces the tendency of over-fitting, which needs to be managed in fully connected layers by the use of dropout.
  • The authors argue in the original paper that removing the fully connected classification layers forces the feature maps to be more closely related to the classification categories – so that each feature map becomes a kind of “category confidence map”.
  • Finally, the authors also argue that, due to the averaging operation over the feature maps, this makes the model more robust to spatial translations in the data. In other words, as long as the requisite feature is included / or activated in the feature map somewhere, it will still be “picked up” by the averaging operation.

To test out these ideas in practice, in the next section I’ll show you an example comparing the benefits of the Global Average Pooling with the historical paradigm. This example problem will be the Cats vs Dogs image classification task and I’ll be using TensorFlow 2 to build the models. At the time of writing, only TensorFlow 2 Alpha is available, and the reader can follow this link to find out how to install it.

Global Average Pooling with TensorFlow 2 and Cats vs Dogs

To download the Cats vs Dogs data for this example, you can use the following code:

import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds

split = (80, 10, 10)
splits = tfds.Split.TRAIN.subsplit(weighted=split)

(cat_train, cat_valid, cat_test), info = tfds.load('cats_vs_dogs', split=list(splits), with_info=True, as_supervised=True)

The code above utilizes the TensorFlow Datasets repository which allows you to import common machine learning datasets into TF Dataset objects.  For more on using Dataset objects in TensorFlow 2, check out this post. A few things to note. First, the split tuple (80, 10, 10) signifies the (training, validation, test) split as percentages of the dataset. This is then passed to the tensorflow_datasets split object which tells the dataset loader how to break up the data. Finally, the tfds.load() function is invoked. The first argument is a string specifying the dataset name to load. Following arguments relate to whether a split should be used, whether to return an argument with information about the dataset (info) and whether the dataset is intended to be used in a supervised learning problem, with labels being included. In order to examine the images in the data set, the following code can be run:

import matplotlib.pylab as plt

for image, label in cat_train.take(2):
  plt.figure()
  plt.imshow(image)

This produces the following images: As can be observed, the images are of varying sizes. This will need to be rectified so that the images have a consistent size to feed into our model. As usual, the image pixel values (which range from 0 to 255) need to be normalized – in this case, to between 0 and 1. The function below performs these tasks:

IMAGE_SIZE = 100
def pre_process_image(image, label):
  image = tf.cast(image, tf.float32)
  image = image / 255.0
  image = tf.image.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
  return image, label

In this example, we’ll be resizing the images to 100 x 100 using tf.image.resize. To get state of the art levels of accuracy, you would probably want a larger image size, say 200 x 200, but in this case I’ve chosen speed over accuracy for demonstration purposes. As can be observed, the image values are also cast into the tf.float32 datatype and normalized by dividing by 255. Next we apply this function to the datasets, and also shuffle and batch where appropriate:

TRAIN_BATCH_SIZE = 64
cat_train = cat_train.map(pre_process_image).shuffle(1000).repeat().batch(TRAIN_BATCH_SIZE)
cat_valid = cat_valid.map(pre_process_image).repeat().batch(1000)

For more on TensorFlow datasets, see this post. Now it is time to build the model – in this example, we’ll be using the Keras API in TensorFlow 2. In this example, I’ll be using a common “head” model, which consists of layers of standard convolutional operations – convolution and max pooling, with batch normalization and ReLU activations:

head = tf.keras.Sequential()
head.add(layers.Conv2D(32, (3, 3), input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(32, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(64, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

Next, we need to add the “back-end” of the network to perform the classification.

Standard fully connected classifier results

In the first instance, I’ll show the results of a standard fully connected classifier, without dropout. Because, for this example, there are only two possible classes – “cat” or “dog” – the final output layer is a dense / fully connected layer with a single node and a sigmoid activation.

standard_classifier = tf.keras.Sequential()
standard_classifier.add(layers.Flatten())
standard_classifier.add(layers.BatchNormalization())
standard_classifier.add(layers.Dense(100))
standard_classifier.add(layers.Activation('relu'))
standard_classifier.add(layers.BatchNormalization())
standard_classifier.add(layers.Dense(100))
standard_classifier.add(layers.Activation('relu'))
standard_classifier.add(layers.Dense(1))
standard_classifier.add(layers.Activation('sigmoid'))

As can be observed, in this case, the output classification layers includes 2 x 100 node dense layers. To combine the head model and this standard classifier, the following commands can be run:

standard_model = tf.keras.Sequential([
    head, 
    standard_classifier
])

Finally, the model is compiled, a TensorBoard callback is created for visualization purposes, and the Keras fit command is executed:

standard_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")))]

standard_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=10, validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Note that the loss used is binary crossentropy, due to the binary classes for this example. The training progress over 7 epochs can be seen in the figure below:

Standard classifier without average pooling

Standard classifier accuracy (red – training, blue – validation)

Standard classifier loss without average pooling

Standard classifier loss (red – training, blue – validation)

As can be observed, with a standard fully connected classifier back-end to the model (without dropout), the training accuracy reaches high values but it overfits with respect to the validation dataset. The validation dataset accuracy stagnates around 80% and the loss begins to increase – a sure sign of overfitting.

Global Average Pooling results

The next step is to test the results of the Global Average Pooling in TensorFlow 2. To build the GAP layer and associated model, the following code is added:

average_pool = tf.keras.Sequential()
average_pool.add(layers.AveragePooling2D())
average_pool.add(layers.Flatten())
average_pool.add(layers.Dense(1, activation='sigmoid'))

pool_model = tf.keras.Sequential([
    head, 
    average_pool
])

The accuracy results for this model, along with the results of the standard fully connected classifier model, are shown below:

Global Average Pooling accuracy

Global average pooling accuracy vs standard fully connected classifier model (pink – GAP training, green – GAP validation, blue – FC classifier validation)

As can be observed from the graph above, the Global Average Pooling model has a higher validation accuracy by the 7th epoch than the fully connected model. The training accuracy is lower than the FC model, but this is clearly due to overfitting being reduced in the GAP model. A final comparison including the case of the FC model with a dropout layer inserted is shown below:

standard_classifier_with_do = tf.keras.Sequential()
standard_classifier_with_do.add(layers.Flatten())
standard_classifier_with_do.add(layers.BatchNormalization())
standard_classifier_with_do.add(layers.Dense(100))
standard_classifier_with_do.add(layers.Activation('relu'))
standard_classifier_with_do.add(layers.Dropout(0.5))
standard_classifier_with_do.add(layers.BatchNormalization())
standard_classifier_with_do.add(layers.Dense(100))
standard_classifier_with_do.add(layers.Activation('relu'))
standard_classifier_with_do.add(layers.Dense(1))
standard_classifier_with_do.add(layers.Activation('sigmoid'))
Global Average Pooling accuracy vs FC with dropout

Global average pooling validation accuracy vs FC classifier with and without dropout (green – GAP model, blue – FC model without DO, orange – FC model with DO)

As can be seen, of the three model options sharing the same convolutional front end, the GAP model has the best validation accuracy after 7 epochs of training (x – axis in the graph above is the number of batches). Dropout improves the validation accuracy of the FC model, but the GAP model is still narrowly out in front. Further tuning could be performed on the fully connected models and results may improve. However, one would expect Global Average Pooling to be at least equivalent to a FC model with dropout – even though it has hundreds of thousands of fewer parameters. I hope this short tutorial gives you a good understanding of Global Average Pooling and its benefits. You may want to consider it in the architecture of your next image classifier design.


Eager to build deep learning systems? Get the book here

Leave a Reply

Your email address will not be published. Required fields are marked *