How to create a TensorFlow deep learning powerhouse on Amazon AWS

Amazon AWS TensorFlow how-to: AMI selection

In my previous tutorial on recurrent neural networks and LSTM networks in TensorFlow, we weren’t able to get fantastic results. This is because I was running the code on my little ol’ laptop CPU – not exactly the ideal setup for big deep learning networks. So what to do? I could fork out thousands on a specced up desktop with NVIDIA GPUs, but, you know, I have a family and bills to pay. So the best option, I think, is to hire out some GPUs on Amazon AWS. That’s just what I did, and I’m going to give you a how-to guide below on how to do it. Then I’m going to run the sequence-to-sequence LSTM model that I created in TensorFlow, and show you the improvements. So let’s get to it.

Recommended online course: If you are more of a video course learner, checkout the following highly rated and inexpensive Udemy course, which covers deep learning concepts and how to deploy on Amazon AWS too: Modern Deep Learning in Python

Step 1 – Setup an Amazon AWS account and load up an instance

The first thing to do is to head over to Amazon AWS and create an account. You’ll need to supply some credit card details, as the computing power isn’t free – but we’ll be using a cheap option here, so it shouldn’t cost you too much if you want to follow along (a few dollars). At this stage, you may have to request, via Amazon AWS support, for them to free up an EC2 instance for you in your region. To do this, log into your Amazon AWS account and go to the dashboard. At the top of the window you’ll see a “Services” drop down – click this and select the Support link on the left hand side. Once you’ve clicked this, on the next page click “Create Case”, again on the left hand side menu.

On this page, next to the heading “Regarding”, select “Service Limit Increase”. Then, under “Limit Type” select “EC2 Instances”. Select your closest region, and under “Primary Instance Type” select “p2.xlarge”. Leave the “Limit” field as “Instance Limit”, and put a “1” in the field “New limit value”. Put in a use case description i.e. “Deep learning computing” then submit the case. Amazon AWS will then free up an instance for you to use, which might take a little while for them to do. If the terms above like “EC2 instance” and “p2.xlarge” don’t make sense at this stage, don’t worry – they are explained more fully later.

Once you’re done that, head over to this link. This page (see below) details a specifically setup Amazon Machine Instance (AMI) with all your favorite deep learning packages already loaded up – TensorFlow, Keras, PyTorch, CNTK, MXNet and more.

Amazon AWS TensorFlow how-to: AMI selection

Amazon AWS – deep learning AMI selection

Scroll down and check out the EC2 instances available and the hourly prices on the right hand side. The EC2 instances are scale-able cloud computing services offered by Amazon AWS, and there are lots of different machine arrangements to choose from. In this case, we want to choose an AMI with at least 1 NVIDIA GPU. To do that, select your appropriate region on the right hand side and then hit the continue button.

You’ll then be taken to a launch page that looks like:

Amazon AWS TensorFlow - AMI instance selection

AMI instance selection

Let’s go with the “1-Click Launch” option to make things nice and easy. Then, I’d suggest selecting the p2.xlarge EC2 instance under the “EC2 Instance Type” pane. This gives us 1 NVIDIA K80 GPU to play with. At the time of writing, this instance costs $1.54 / hour for an Asia Pacific (Sydney) deploy. Not too bad.

If you’re like me and haven’t done this before, scroll down to the bottom of the page and you’ll find this box:

Amazon AWS TensorFlow - key pair creation

Key pair creation

Expand the Key Pair pane and follow the instructions – this Key Pair is a security measure that is required to perform the necessary secure encryption when you logon to your instance. Once you’ve done that, refresh the page again, and make sure that your region matches if you had to change it. Once you match the region correctly with your Key Pair, the “Launch with 1-click” button will become enabled, as shown below. Click this, and your instance will be created after a few minutes.

Amazon AWS TensorFlow - launch button enabled

Launch button enabled

Once you’ve hit the button above, you can go back to your Amazon AWS dashboard. Search or select the “EC2” service (under the “Compute” heading) in the AWS Services. This will take you to your EC2 dashboard, it should look something like this:

Amazon AWS TensorFlow - EC2 console

Amazon AWS EC2 console

Note that under the Resources heading, there should be “1 Running Instances” showing – this is your instance. To access your running AMI, on the left hand side select “Instances”. You’ll then see your p2.xlarge instance up and running on the main pane. Select the button “Connect”. You’ll be presented with a pop-up window “Connect To Your Instance” – select either option. I’m using “A standalone SSH client” (PuTTY on Windows) – but you can choose whichever method you like to connect. Just follow the instruction Amazon AWS gives you to setup.

If you’re using PuTTY, there is one final step to allow you to properly use a Linux text manager and terminal multiplexer called Byobu. In your PuTTY program, before you connect, go to the settings menu on the left hand side. Under Connections – Data, in the field “Terminal-type string” enter “putty-256color”. This allows you to hit Ctrl-F2 in Windows to create multiple screens in Linux, which will let us monitor our GPU performance while training – this will be discussed later.

Once you’ve done that – you’re all connected! You should see a command prompt that looks like:

Amazon AWS TensorFlow - remote Linux console

AMI remote Linux console

Step 2 – Exploring the instance and loading up the code

The first thing you want to do when you have your instance running is update all the packages – you do this by running:

sudo yum upgrade

Next, let’s clone the Adventures in Machine Learning github repo by executing the following:

git clone

Let’s also install a Python package called gpustat that we will use to monitor how our Nvidia GPU on the Amazon AWS instance is going as we train our recurrent neural network. Run:

pip install gpustat

Ok, so we’re not too far off being able to run the code using the GPU. However, first we’ll want to be able to monitor the GPU as we train. To do this on a Linux machine we need two screens, and we can use the package mentioned earlier called byobu to do this. To install it, we first need to go back to the root or administration privilege of our instance. Run this:

sudo su – 

Then to install byobu run this:

yum install byobu

Ok – now you can run byobu by simply typing “byobu” at the command prompt. To open up a new window, press Ctrl-F2. You’ll see this opens a new screen in your Linux session. To switch between the screens, press Ctrl-F3 and Ctrl-F4. Now, on one screen, we want to run the following to start a background monitoring process (which speeds up our gpustat package):

sudo nvidia-smi daemon

Then, on the same screen let’s setup our gpustat watch function, which will give us data about the GPU usage:

watch -n1.0 gpustat -cp

You should now see a utility printout with the GPU temperature, percentage usage and memory stats (see below for an example when we are actually running the code).

Now switch back to the other screen, using either Ctrl-F3 or Ctrl-F4.

Step 3 – Download the data and run

One final thing remains before we run the code – we first have to download the training data onto our instance. In the TensorFlow recurrent neural network tutorial we used a text data-set from the following link: You’ll need to download and extract this tarfile – to do this run the following:

curl | tar xvz

You can navigate around this extracted file / folder by using the Linux commands ls (to list the contents of the current path) and cd (for change directory). You need to find the path to simple-examples/data/ – this is where our training data files are located. Once you’ve done this, we can finally run the following command to start training the LSTM network created in the aforementioned tutorial:

python 1 –data_path /home/ec2-user/data/simple-examples/data/

Once you run the above command, the program will start and, after it prints out some text data it will begin to train the network (note the dash before “data_path” is actually a double dash: “–“). After every 50 iterations, you can observe the loss, the accuracy on the training set and the average time it took to execute each iteration. You’ll see something like this:

Amazon AWS TensorFlow - GPU training times

Example output with GPU training times

As you can observe, each iteration takes an average of 0.14 seconds to execute. I’ve also run this oj my own Intel i5 CPUs and the average iteration time is around 3 seconds – so we get a greater than 20 times increase in performance with a single Amazon AWS Nvidia GPU. Not bad!

While it’s training, let’s take a look at what our GPU doing – hit Ctrl-F3 or Ctrl-F4 and you’ll return to your gpustat watch print-out. It should look something like this:

Amazon AWS TensorFlow - GPU status

GPU status while training the LSTM network

So here we can see that the GPU is running close to maximum capacity – 81%. Good to see

WARNING: Remember, you have to shut down your instance on your EC2 console on Amazon AWS when you are complete. It’s not enough to shut down your PuTTY session or similar – you have to go an shut down your instance on the AWS dashboard. If you don’t you’ll be getting charged per hour with the instance sitting there doing nothing!

I hope that’s been helpful and will let you get your own Amazon AWS deep learning instance up and running. Enjoy your faster model training!

Recommended online course: If you are more of a video course learner, checkout the following highly rated and inexpensive Udemy course, which covers deep learning concepts and how to deploy on Amazon AWS too: Modern Deep Learning in Python