A PyTorch tutorial – deep learning in Python

PyTorch tutorial - fully connected neural network example architecture

So – if you’re a follower of this blog and you’ve been trying out your own deep learning networks in TensorFlow and Keras, you’ve probably come across the somewhat frustrating business of debugging these deep learning libraries. Sure, they have Python APIs, but it’s kinda hard to figure out what exactly is happening when something goes wrong. They also don’t seem to play well with Python libraries such as numpy, scipy, scikit-learn, Cython and so on. Enter the PyTorch deep learning library – one of it’s purported benefits is that is a deep learning library that is more at home in Python, which, for a Python aficionado like myself, sounds great. It also has nifty features such as dynamic computational graph construction as opposed to the static computational graphs present in TensorFlow and Keras (for more on computational graphs, see below). It’s also on the up and up, with its development supported by companies such as Facebook, Twitter, NVIDIA and so on. So let’s dive into it in this PyTorch tutorial.

The first question to consider – is it better than TensorFlow? That’s a fairly subjective judgement – performance-wise there doesn’t appear to be a great deal of difference. Check out this article for a quick comparison. In any case, its clear the PyTorch is here to stay and is likely to be a real contender in the “contest” between deep learning libraries, so let’s kick start our learning of it. I’ll leave it to you to decide which is “better”.

In this PyTorch tutorial we will introduce some of the core features of PyTorch, and build a fairly simple densely connected neural network to classify hand-written digits. To learn how to build more complex models in PyTorch, check out my post Convolutional Neural Networks Tutorial in PyTorch.

Recommended online course: If you’re more of a video course learner, check out this inexpensive, highly rated, Udemy course: Practical Deep Learning with PyTorch

A PyTorch tutorial – the basics

In this section, we’ll go through the basic ideas of PyTorch starting at tensors and computational graphs and finishing at the Variable class and the PyTorch autograd functionality.

Installing on Windows

For starters, if you are a Windows user like myself, you’ll find that there is no straight-forward installation options for that operating system on the PyTorch website. However, there is a successful way to do it, check out this website for instructions. It’s well worth the effort to get this library installed if you are a Windows user like myself.

Computational graphs

The first thing to understand about any deep learning library is the idea of a computational graph. A computational graph is a set of calculations, which are called nodes, and these nodes are connected in a directional ordering of computation. In other words, some nodes are dependent on other nodes for their input, and these nodes in turn output the results of their calculations to other nodes. A simple example of a computational graph for the calculation $a = (b + c) * (c + 2)$ can be seen below – we can break this calculation up into the following steps/nodes:

d &= b + c \\
e &= c + 2 \\
a &= d * e

PyTorch tutorial - simple computational graph

Simple computational graph


The benefits of using a computational graph is that each node is like its own independently functioning piece of code (once it receives all its required inputs). This allows various performance optimizations to be performed in running the calculations such as threading and multiple processing / parallelism. All the major deep learning frameworks (TensorFlow, Theano, PyTorch etc.) involve constructing such computational graphs, through which neural network operations can be built and through which gradients can be back-propagated (if you’re unfamiliar with back-propagation, see my neural networks tutorial).


Tensors are matrix-like data structures which are essential components in deep learning libraries and efficient computation. Graphical Processing Units (GPUs) are especially effective at calculating operations between tensors, and this has spurred the surge in deep learning capability in recent times. In PyTorch, tensors can be declared simply in a number of ways:

import torch
x = torch.Tensor(2, 3)

This code creates a tensor of size (2, 3) – i.e. 2 rows and 3 columns, filled with zero float values i.e:

 0  0  0
 0  0  0
[torch.FloatTensor of size 2x3]

We can also create tensors filled random float values:

x = torch.rand(2, 3)

Multiplying tensors, adding them and so forth is straight-forward:

x = torch.ones(2,3)
y = torch.ones(2,3) * 2
x + y

This returns:

 3  3  3
 3  3  3
[torch.FloatTensor of size 2x3]

Another great thing is the numpy slice functionality that is available – for instance y[:, 1]

y[:,1] = y[:,1] + 1

This returns:

 2  3  2
 2  3  2
[torch.FloatTensor of size 2x3]

Now you know how to create tensors and manipulate them in PyTorch, in the next step of this PyTorch tutorial let’s look at something a bit more complicated.

Autograd in PyTorch

In any deep learning library, there needs to be a mechanism where error gradients are calculated and back-propagated through the computational graph. This mechanism, called autograd in PyTorch, is easily accessible and intuitive. The Variable class is the main component of this autograd system in PyTorch. This Variable class wraps a tensor, and allows automatic gradient computation on the tensor when the .backward() function is called (more on this later). The object contains the data of the tensor, the gradient of the tensor (once computed with respect to some other value i.e. the loss) and also contains a reference to whatever function created the variable (if it is a user created function, this reference will be null).

Let’s create a Variable from a simple tensor:

x = Variable(torch.ones(2, 2) * 2, requires_grad=True)

In the Variable declaration above, we pass in a tensor of (2, 2) 2-values and we specify that this variable requires a gradient. If we were using this in a neural network, this would mean that this Variable would be trainable. If we set this flag to False, the Variable would not be trained. For this simple example we aren’t training anything, but we do want to interrogate the gradient for this Variable as will be shown below.

Next, let’s create another Variable, constructed based on operations on our original Variable x.

z = 2 * (x * x) + 5 * x

To get the gradient of this operation with respect to x i.e. dz/dx we can analytically calculate this to by 4x +5. If all elements of x are 2, then we should expect the gradient dz/dx to be a (2, 2) shaped tensor with 13-values. However, first we have to run the .backwards() operation to compute these gradients. Of course, to compute gradients, we need to compute them with respect to something. In this case, we can supply a (2,2) tensor of 1-values to be what we compute the gradients against – so the calculation simply becomes d/dx:

z.backward(torch.ones(2, 2))

This produces the following output:

Variable containing:
 13  13
 13  13
[torch.FloatTensor of size 2x2]

As you can observe, the gradient is equal to a (2, 2), 13-valued tensor as we predicted. Note that the gradient is stored in the Variable, in the property .grad.

Now that we’ve covered the basics of tensors, Variables and the autograd functionality within PyTorch, we can move onto creating a simple neural network in PyTorch which will showcase this functionality further.

Creating a neural network in PyTorch

This section is the main show of this PyTorch tutorial. To access the code for this tutorial, check out this website’s Github repository. Here we will create a simple 4-layer  fully connected neural network (including an “input layer” and two hidden layers) to classify the hand-written digits of the MNIST dataset. The architecture we’ll use can be seen in the figure below:

PyTorch tutorial - fully connected neural network example architecture

Fully connected neural network example architecture

The input layer consists of 28 x 28 (=784) greyscale pixels which constitute the input data of the MNIST data set. This input is then passed through two fully connected hidden layers, each with 200 nodes, with the nodes utilizing a ReLU activation function. Finally, we have an output layer with ten nodes corresponding to the 10 possible classes of hand-written digits (i.e. 0 to 9). We will use a softmax output layer to perform this classification.

Let’s create the neural network.

The neural network class

In order to create a neural network in PyTorch, you need to use the included class nn.Module. To use this base class, we also need to use Python class inheritance – this basically allows us to use all of the functionality of the nn.Module base class, but still have overwriting capabilities of the base class for the model construction / forward pass through the network. Some actual code will help explain:

import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 200)
        self.fc2 = nn.Linear(200, 200)
        self.fc3 = nn.Linear(200, 10)

In the class definition, you can see the inheritance of the base class nn.Module. Then, in the first line of the class initialization (def __init__(self):) we have the required Python super() function, which creates an instance of the base nn.Module class. The following three lines is where we create our fully connected layers as per the architecture diagram. A fully connected neural network layer is represented by the nn.Linear object, with the first argument in the definition being the number of nodes in layer l and the next argument being the number of nodes in layer l+1. As you can observer, the first layer takes the 28 x 28 input pixels and connects to the first 200 node hidden layer. Then we have another 200 to 200 hidden layer, and finally a connection between the last hidden layer and the output layer (with 10 nodes).

Now we’ve setup the “skeleton” of our network architecture, we have to define how data flows through out network. We do this by defining a forward() method in our class – this method overwrites a dummy method in the base class, and needs to be defined for each network:

def forward(self, x):
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return F.log_softmax(x)

For the forward() method, we supply the input data x as the primary argument. We feed this into our first fully connected layer (self.fc1(x)) and then apply a ReLU activation to the nodes in this layer using F.relu(). Because of the hierarchical nature of this network, we replace x at each stage, feeding it into the next layer. We do this through our three fully connected layers, except for the last one – instead of a ReLU activation we return a log softmax “activation”. This, combined with the negative log likelihood loss function which will be defined later, gives us a multi-class cross entropy based loss function which we will use to train the network.

So that’s it – we’ve defined our neural network. Pretty easy right?

The next step is to create an instance of this network architecture:

net = Net()

When we print the instance of the class Net, we get the following output:

Net (
(fc1): Linear (784 -> 200)
(fc2): Linear (200 -> 200)
(fc3): Linear (200 -> 10)

This is pretty handy as it confirms the structure of our network for us.

Training the network

Next we have to setup an optimizer and a loss criterion:

# create a stochastic gradient descent optimizer
optimizer = optim.SGD(net.parameters(), lr=learning_rate, momentum=0.9)
# create a loss function
criterion = nn.NLLLoss()

In the first line, we create a stochastic gradient descent optimizer, and we specify the learning rate (which I’ve passed to this function as 0.01) and a momentum of 0.9. The other ingredient we need to supply to our optimizer is all the parameters of our network – thankfully PyTorch make supplying these parameters easy by the .parameters() method of the base nn.Module class that we inherit from in the Net class.

Next, we set our loss criterion to be the negative log likelihood loss – this combined with our log softmax output from the neural network gives us an equivalent cross entropy loss for our 10 classification classes.

Now it’s time to train the network. During training, I will be extracting data from a data loader object which is included in the PyTorch utilities module. I won’t go into the details here (I’ll leave that for a future post), but you can find the code on this site’s Github repository. This data loader will supply batches of input and target data which we’ll supply to our network and loss function respectively. Here’s the full training code:

# run the main training loop
for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = Variable(data), Variable(target)
        # resize data from (batch_size, 1, 28, 28) to (batch_size, 28*28)
        data = data.view(-1, 28*28)
        net_out = net(data)
        loss = criterion(net_out, target)
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.dataset),
                           100. * batch_idx / len(train_loader), loss.data[0]))

The outer training loop is the number of epochs, whereas the inner training loop runs through the entire training set in batch sizes which are specified in the code as batch_size. On the next line, we convert data and target into PyTorch variables. The MNIST input data-set which is supplied in the torchvision package (which you’ll need to install using pip if you run the code for this tutorial) has the size (batch_size, 1, 28, 28) when extracted from the data loader – this 4D tensor is more suited to convolutional neural network architecture, and not so much our fully connected network. Therefore we need to flatten out the (1, 28, 28) data to a single dimension of 28 x 28 =  784 input nodes.

The .view() function operates on PyTorch variables to reshape them. If we want to be agnostic about the size of a given dimension, we can use the “-1” notation in the size definition. So by using data.view(-1, 28*28) we say that the second dimension must be equal to 28 x 28, but the first dimension should be calculated from the size of the original data variable. In practice, this means that data will now be of size (batch_size, 784). We can pass a batch of input data like this into our network and the magic of PyTorch will do all the hard work by efficiently performing the required operations on the tensors.

On the next line, we run optimizer.zero_grad() – this zeroes / resets all the gradients in the model, so that it is ready to go for the next back propagation pass. In other libraries this is performed implicitly, but in PyTorch you have to remember to do it explicitly. Let’s single out the next two lines:

net_out = net(data)
loss = criterion(net_out, target)

The first line is where we pass the input data batch into the model – this will actually call the forward() method in our Net class. After this line is run, the variable net_out will now hold the log softmax output of our neural network for the given data batch. That’s one of the great things about PyTorch, you can activate whatever normal Python debugger you usually use and instantly get a gauge of what is happening in your network. This is opposed to other deep learning libraries such as TensorFlow and Keras which require elaborate debugging sessions to be setup before you can check out what your network is actually producing. I hope you’ll play around with how useful this debugging is, by utilizing the code for this PyTorch tutorial here.

The second line is where we get the negative log likelihood loss between the output of our network and our target batch data.

Let’s look at the next two lines:


The first line here runs a back-propagation operation from the loss Variable backwards through the network. If you compare this with our review of the .backward() operation that we undertook earlier in this PyTorch tutorial, you’ll notice that we aren’t supplying the .backward() operation with an argument. Scalar variables, when we call .backward() on them, don’t require arguments – only tensors require a matching sized tensor argument to be passed to the .backward() operation.

The next line is where we tell PyTorch to execute a gradient descent step based on the gradients calculated during the .backward() operation.

Finally, we print out some results every time we reach a certain number of iterations:

if batch_idx % log_interval == 0:
    print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.dataset),
                           100. * batch_idx / len(train_loader), loss.data[0]))

This print function shows our progress through the epochs and also gives the network loss at that point in the training. Note how you access the loss – you access the Variable .data property, which in this case will be a single valued array. We access the scalar loss by executing loss.data[0].

Running this training loop you’ll get an output that looks something like this:

Train Epoch: 9 [52000/60000 (87%)] Loss: 0.015086

Train Epoch: 9 [52000/60000 (87%)] Loss: 0.015086

Train Epoch: 9 [54000/60000 (90%)] Loss: 0.030631

Train Epoch: 9 [56000/60000 (93%)] Loss: 0.052631

Train Epoch: 9 [58000/60000 (97%)] Loss: 0.052678

After 10 epochs, you should get a loss value down around the <0.05 magnitude.

Testing the network

To test the trained network on our test MNIST data set, we can run the following code:

# run a test loop
test_loss = 0
correct = 0
for data, target in test_loader:
    data, target = Variable(data, volatile=True), Variable(target)
    data = data.view(-1, 28 * 28)
    net_out = net(data)
    # sum up batch loss
    test_loss += criterion(net_out, target).data[0]
    pred = net_out.data.max(1)[1]  # get the index of the max log-probability
    correct += pred.eq(target.data).sum()
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

This loop is the same as the previous training loop up until the test_loss line – here we extract the network loss using the .data[0] property as before, but all in the same line. Next, we have the pred line, where the data.max(1) method is used – this .max() method can return the index of the maximum value in a certain dimension of a tensor. Now, the output of our neural network will be of size (batch_size, 10), where each value of the 10-length second dimension is a log probability which the network assigns to each output class (i.e. it is the log probability of whether the given image is a digit between 0 and 9). So for each input sample/row in the batch, net_out.data will look something like this:

[-1.3106e+01, -1.6731e+01, -1.1728e+01, -1.1995e+01, -1.5886e+01, -1.7700e+01, -2.4950e+01, -5.9817e-04, -1.3334e+01, -7.4527e+00]


The value with the highest log probability is the digit that the network considers to be the most probable given the input image – this is the best prediction of the class from the network. In the example of net_out.data above, it is the value -5.9817e-04 which is maximum, which corresponds to the digit “7”. So for this sample, the predicted digit is “7”. The .max(1) function will determine this maximum value in the second dimension (if we wanted the maximum in the first dimension, we’d supply an argument of 0) and returns both the maximum value that it has found, and the index that this maximum value was found at. It therefore has a size of (batch_size, 2) – in this case we are interested in the index where the maximum value is found at, therefore we access these values by calling .max(1)[1].

Now we have the prediction of the neural network for each sample in the batch determined, we can compare this with the actual target class from our training data, and count how many times in the batch the neural network got it right. We can use the PyTorch .eq() function to do this, which compares the values in two tensors and if they match, returns a 1. If they don’t match, it returns a 0:

correct += pred.eq(target.data).sum()

By summing the output of the .eq() function, we get a count of the number of times the neural network has produced a correct output, and we take an accumulating sum of these correct predictions so that we can determine the overall accuracy of the network on our test data set. Finally, after running through the test data in batches, we print out the averaged loss and accuracy:

test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

After training the network for 10 epochs, we get the following output from the above code on the test data:

Test set: Average loss: 0.0003, Accuracy: 9783/10000 (98%)

A 98% accuracy – not bad!

So there you have it – this PyTorch tutorial has shown you the basic ideas in PyTorch, from tensors to the autograd functionality, and finished with how to build a fully connected neural network using the nn.Module. I hope it was helpful. If you’d like to learn more about PyTorch, check out my post on Convolutional Neural Networks in PyTorch.

Recommended online course: If you’re more of a video course learner, check out this inexpensive, highly rated, Udemy course: Practical Deep Learning with PyTorch