Recurrent neural networks and LSTM tutorial in Python and TensorFlow

Recurrent neural network LSTM tutorial - sample many-to-many classifier
LSTM sample many-to-many classifier

In the deep learning journey so far on this website, I’ve introduced dense neural networks and convolutional neural networks (CNNs) which explain how to perform classification tasks on static images.  We’ve seen good results, especially with CNN’s. However, what happens if we want to analyze dynamic data? What about videos, voice recognition or sequences of text? There are ways to do some of this using CNN’s, but the most popular method of performing classification and other analysis on sequences of data is recurrent neural networks.  This tutorial will be a very comprehensive introduction to recurrent neural networks and a subset of such networks – long-short term memory networks (or LSTM networks). I’ll also show you how to implement such networks in TensorFlow – including the data preparation step. It’s going to be a long one, so settle in and enjoy these pivotal networks in deep learning – at the end of this post, you’ll have a very solid understanding of recurrent neural networks and LSTMs.

As always, all the code for this post can be found on this site’s Github repository.


Recommended online course: If you are more of a video course learner, I’d recommend this inexpensive Udemy course: Deep Learning: Recurrent Neural Networks in Python


An introduction to recurrent neural networks

A recurrent neural network, at its most fundamental level, is simply a type of densely connected neural network (for an introduction to such networks, see my tutorial). However, the key difference to normal feed forward networks is the introduction of time – in particular, the output of the hidden layer in a recurrent neural network is fed back into itself. Diagrams help here, so observe:

Recurrent LSTM tutorial - RNN diagram with nodes
Recurrent neural network diagram with nodes shown

In the diagram above, we have a simple recurrent neural network with three input nodes.  These input nodes are fed into a hidden layer, with sigmoid activations, as per any normal densely connected neural network. What happens next is what is interesting – the output of the hidden layer is then fed back into the same hidden layer. As you can see the hidden layer outputs are passed through a conceptual delay block to allow the input of $\textbf{h}^{t-1}$ into the hidden layer.  What is the point of this? Simply, the point is that we can now model time or sequence-dependent data.

A particularly good example of this is predicting text sequences.  Consider the following text string: “A girl walked into a bar, and she said ‘Can I have a drink please?’.  The bartender said ‘Certainly {}”. There are many options for what could fill in the {} symbol in the above string, for instance, “miss”, “ma’am” and so on. However, other words could also fit, such as “sir”, “Mister” etc. In order to get the correct gender of the noun, the neural network needs to “recall” that two previous words designating the likely gender (i.e. “girl” and “she”) were used. This type of flow of information through time (or sequence) in a recurrent neural network is shown in the diagram below, which unrolls the sequence:

Recurrent LSTM tutorial - unrolled RNN
Unrolled recurrent neural network

On the left-hand side of the above diagram, we have basically the same diagram as the first (the one which shows all the nodes explicitly). What the previous diagram neglected to show explicitly was that we in fact only ever supply finite length sequences to such networks – therefore we can unroll the network as shown on the right-hand side of the diagram above. This unrolled network shows how we can supply a stream of data to the recurrent neural network. For instance, first, we supply the word vector for “A” (more about word vectors later) to the network F – the output of the nodes in are fed into the “next” network and also act as a stand-alone output ($h_0$).  The next network (though it is really the same network) F at time t=1 takes the next word vector for “girl” and the previous output $h_0$ into its hidden nodes, producing the next output $h_1$ and so on.

As discussed above, the words themselves i.e. “A”, “girl” etc. aren’t input directly into the neural network. Neither are their one-hot vector type representations – rather, an embedding vector is used for each word. An embedding vector is an efficient vector representation of the word (often between 50-300 in length), which should maintain some meaning or context of the word. Word embedding won’t be entered into detail here, as I have covered it extensively in other posts – Word2Vec word embedding tutorial in Python and TensorFlowA Word2Vec Keras tutorial and Python gensim Word2Vec tutorial with TensorFlow and Keras. It is an interesting topic and well worth the time investigating.

Now, back to recurrent neural networks themselves. Recurrent neural networks are very flexible. In the implementation shown above, we have a many-to-many model – in other words, we have the input sequence “A girl walked into a bar…” and many outputs – $h_0$ to $h_t$. We could also have multiple other configurations.  Another option is one-to-many i.e. supplying one input, say “girl” and predicting multiple outputs $h_0$ to $h_t$ (i.e. trying to generate sentences based on a single starting word). A further configuration is many-to-one i.e. supplying many words as input, like the sentence “A girl walked into a bar, and she said ‘Can I have a drink please?’.  The bartender said ‘Certainly {}” and predicting the next word i.e. {}. The diagram below shows an example one-to-many and many-to-one configuration, respectively (the words next to the outputs are the target words which we would supply during training).

Recurrent neural network LSTM - one-to-many configuration
Recurrent neural network – one-to-many configuration
Recurrent neural network LSTM - many-to-one configuration
Recurrent neural network – many-to-one configuration

There are also different many-to-many configurations that can be constructed – but you get the idea: recurrent neural networks are quite flexible. One last thing to note – the weights of the connections between time steps are shared i.e. there isn’t a different set of weights for each time step.

Now you have a pretty good idea of what recurrent neural networks are, it is time to point out their dominant problem.

The problem with basic recurrent neural networks

Vanilla recurrent neural networks aren’t actually used very often in practice. Why? The main reason is the vanishing gradient problem. For recurrent neural networks, ideally, we would want to have long memories, so the network can connect data relationships at significant distances in time. That sort of network could make real progress in understanding how language and narrative works, how stock market events are correlated and so on. However, the more time steps we have, the more chance we have of back-propagation gradients either accumulating and exploding or vanishing down to nothing.

Consider the following representation of a recurrent neural network:

$$\textbf{h}_t = \sigma (\textbf{Ux}_t + \textbf{Vh}_{t-1})$$

Where and V are the weight matrices connecting the inputs and the recurrent outputs respectively. We then often will perform a softmax of all the $\textbf{h}_t$ outputs (if we have some sort of many-to-many or one-to-many configuration). Notice, however, that if we go back three time steps in our recurrent neural network, we have the following:

$$\textbf{h}_t = \sigma (\textbf{Ux}_t + \textbf{V}(\sigma(\textbf{Ux}_{t-1} + \textbf{V}(\sigma(\textbf{Ux}_{t-2})))$$

From the above you can see, as we work our way back in time, we are essentially adding deeper and deeper layers to our network. This causes a problem – consider the gradient of the error with respect to the weight matrix U during backpropagation through time, it looks something along the lines of this:

$$\frac{\partial E_3}{\partial U} = \frac{\partial E_3}{\partial out_3}\frac{\partial out_3}{\partial h_3}\frac{\partial h_3}{\partial h_2}\frac{\partial h_2}{\partial h_1}\frac{\partial h_1}{\partial U}$$

The equation above is only a rough approximation of what is going on during backpropagation through time, but it will suffice for our purposes (for more on back-propagation, see my comprehensive neural networks tutorial). Each of these gradients will involve calculating the gradient of the sigmoid function. The problem with the sigmoid function occurs when the input values are such that the output is close to either 0 or 1 – at this point, the gradient is very small, see the plot below.

Recurrent neural network and LSTM tutorial - sigmoid gradient
Sigmoid gradient

As you can observe, the values of the gradient (orange line) are always <0.25 and get to very low values when the output gets close to 0 or 1. What does this mean? It means that when you multiply many sigmoid gradients together you are multiplying many values which are potentially much less than zero – this leads to a vanishing gradient $\frac{\partial E}{\partial U}$. Because the gradient will become basically zero when dealing with many prior time steps, the weights won’t adjust to take into account these values, and therefore the network won’t learn relationships separated by significant periods of time. This makes vanilla recurrent neural networks not very useful.

We could use ReLU activation functions to reduce this problem, though not eliminate it. However, the most popular way of dealing with this issue in recurrent neural networks is by using long-short term memory (LSTM) networks, which will be introduced in the next section.

Introduction to LSTM networks

To reduce the vanishing (and exploding) gradient problem, and therefore allow deeper networks and recurrent neural networks to perform well in practical settings, there needs to be a way to reduce the multiplication of gradients which are less than zero. The LSTM cell is a specifically designed unit of logic that will help reduce the vanishing gradient problem sufficiently to make recurrent neural networks more useful for long-term memory tasks i.e. text sequence predictions. The way it does so is by creating an internal memory state which is simply added to the processed input, which greatly reduces the multiplicative effect of small gradients. The time dependence and effects of previous inputs are controlled by an interesting concept called a forget gate, which determines which states are remembered or forgotten. Two other gates, the input gate and output gate, are also featured in LSTM cells.

Let’s first have a look at LSTM cells more carefully, then I’ll discuss how they help reduce the vanishing gradient problem.

The structure of an LSTM cell

The structure of a typical LSTM cell is shown in the diagram below:

Recurrent neural network LSTM tutorial - LSTM cell diagram
LSTM cell diagram

The data flow is from left-to-right in the diagram above, with the current input $x_t$ and the previous cell output $h_{t-1}$ concatenated together and entering the top “data rail”. Here’s where things get interesting.

The input gate

First, the input is squashed between -1 and 1 using a tanh activation function. This can be expressed by:

$$g = tanh(b^g + x_tU^g + h_{t-1}V^g)$$

Where $U^g$ and $V^g$ are the weights for the input and previous cell output, respectively, and $b^g$ is the input bias. Note that the exponents g are not a raised power, but rather signify that these are the input weights and bias values (as opposed to the input gate, forget gate, output gate etc.).

This squashed input is then multiplied element-wise by the output of the input gate. The input gate is basically a hidden layer of sigmoid activated nodes, with weighted $x_t$ and $h_{t-1}$ input values, which outputs values of between 0 and 1 and when multiplied element-wise by the input determines which inputs are switched on and off. In other words, it is a kind of input filter or gate. The expression for the input gate is:

$$i = \sigma(b^i + x_tU^i + h_{t-1}V^i)$$

The output of the input stage of the LSTM cell can be expressed below, where the $\circ$ operator expresses element-wise multiplication:

$$g \circ i$$

As you can observe, the input gate output i acts as the weights for the squashed input g.  We now move onto the next stage of the LSTM cell – the internal state and the forget gate.

The internal state and the forget gate

This stage in the LSTM is where most of the magic happens. As can be observed, there is a new variable $s_t$ which is the inner state of the LSTM cell. This state is delayed by one-time step and is ultimately added to the $g \circ i$ input to provide an internal recurrence loop to learn the relationship between inputs separated by time. Two things to notice – first, there is a forget gate here – this forget gate is again a sigmoid activated set of nodes which is element-wise multiplied by $s_{t-1}$ to determine which previous states should be remembered (i.e. forget gate output close to 1) and which should be forgotten (i.e. forget gate output close to 0). This allows the LSTM cell to learn appropriate context. Consider the sentence “Clare took Helen to Paris and she was very grateful” – for the LSTM cell to learn who “she” refers to, it needs to forget the subject “Clare” and replace it with the subject “Helen”. The forget gate can facilitate such operations and is expressed as:Recurrent neural network LSTM tutorial - forget gate snippet

$$f = \sigma(b^f + x_tU^f + h_{t-1}V^f)$$

The output of the element-wise product of the previous state and the forget gate is expressed as $s_{t-1} \circ f$. Again, the forget gate output acts as weights for the internal state. The second thing to notice about this stage is that the forget-gate-“filtered” state is simply added to the input, rather than multiplied by it, or mixed with it via weights and a sigmoid activation function as occurs in a standard recurrent neural network. This is important to reduce the issue of vanishing gradients. The output from this stage, $s_t$ is expressed by:

$$s_t = s_{t-1} \circ f + g \circ i$$

The final stage of the LSTM cell is the output gate.

The output gate

The final stage of the LSTM cell is the output gate. The output gate has two components – another tanh squashing function and an output sigmoid gating function. The output sigmoid gating function, like the other gating functions in the cell, is multiplied by the squashed state $s_t$ to determine which values of the state are output from the cell. As you can tell, the LSTM cell is very flexible, with gating functions controlling what is input, what is “remembered” in the internal state variable, and finally what is output from the LSTM cell. Recurrent neural network LSTM tutorial - output gate snippet

The output gate is expressed as:

$$o = \sigma(b^o + x_tU^o + h_{t-1}V^o)$$

So the final output of the cell can be expressed as:

$$h_t = tanh(s_t) \circ o$$

The next question is, how does the LSTM cell reduce the vanishing gradient problem?

Reducing the vanishing gradient problem

Recall before that the issue with vanilla recurrent neural networks is that calculating the gradient to update the weights involves cascading terms like:

$$\frac {\partial h_n}{\partial h_{n-1}} \frac {\partial h_{n-1}}{\partial h_{n-2}} \frac {\partial h_{n-2}}{\partial h_{n-3}} …$$

This is a problem because of the sigmoid derivative, which is present in all of the partial derivatives above, being <0.25 (often greatly so). There is also a factorial of the weights involved, so if they are consistently <1, we get a similar result – a vanishing gradient.

In an LSTM cell, the recurrency of the internal state of the LSTM cell involves, as shown above, an addition – like so:

$$s_t = s_{t-1} \circ f + g \circ i$$

If we take the partial derivative of this recurrency like we did above for a vanilla recurrent neural network, we find the following:

$$\frac{\partial s_t}{\partial s_{t-1}} = f$$

Notice that the $g \circ i$ term drops away and we are just left with a repeated multiplication of $f$. So for three time steps, we would have $f x f x f$. Notice that if the output of $f=1$, there will be no decay of the gradient. Generally, the bias of the sigmoid in $f$ is made large at the beginning of training so that $f$ starts out as 1 , meaning that all past input states will be “remembered” in the cell. During training, the forget gate will reduce or eliminate the memory of certain components of the state $s_{t-1}$.

This might be a bit confusing, so I’ll explain another way before we move on. Imagine if we let in a single input during the first time step, but then we block all future inputs (by setting the input gate to output zeros) and remember all previous states (by setting the forget gate to output ones). We would have a kind of circulating memory of $s_t$ which never decays i.e. $s_t$ = $s_{t-1}$. A back-propagated error “entering” this loop would also never decay. With the vanilla recurrent neural network, however, if we did the same thing our back-propagated error would be continuously degraded by the gradient of the activation function of the hidden nodes, and therefore eventually decay to zero.

Hopefully, that helps you to understand, at least in part, why LSTM cells are a great solution to the vanishing gradient problem, and therefore why they are currently used so extensively. Now, so far, we have been dealing with the data in the LSTM cells as if they were single values (i.e. scalars), however, in reality, they are tensors or vectors, and this can get confusing. So in the next section, I’ll spend a bit of time explaining the tensor sizes we can expect to be flowing around our unrolled LSTM networks.

The dimensions of data inside an LSTM cell

In the example code that is going to be discussed below, we are going to be performing text prediction. Now, as discussed in previous tutorials on the Word2Vec algorithm, words are input into neural networks using meaningful word vectors i.e. the word “cat” might be represented by, say, a 650 length vector. This vector is encoded in such a way as to capture some aspect of the meaning of the word (where meaning is usually construed as the context the word is usually found in). So each word input into our LSTM network below will be a 650 length vector. Next, because we will be inputting a sequence of words into our unrolled LSTM network, for each input row we will be inputting 35 of these word vectors. So the input for each row will be (35 x 650) in size. Finally, with TensorFlow, we can process batches of data via multi-dimensional tensors (to learn more about basic TensorFlow, see this TensorFlow tutorial). If we have a batch size of 20, our training input data will be (20 x 35 x 650). For future reference, the way I have presented the tensor size here (i.e. (20 x 35 x 650)) is called a “batch-major” arrangement, where the batch size is the first dimension of the tensor. We could also alternatively arrange the data in “time-major” format, which would be (35 x 20 x 650) – same data, just a different arrangement.

Now, the next thing to consider is that each of the input, forget and output gates, along with the inner state variable $s_t$ and the squashing functions, are not single functions with single/scalar weights. Rather, they comprise the hidden layer of the network and therefore include multiple nodes, connecting weights, bias values and so on. It is up to us to set the size of the hidden layer. The output from the unrolled LSTM network will, therefore, include the size of the hidden layer. The size of the output from the unrolled LSTM network with a size 650 hidden layer, and a 20 length batch-size and 35 time steps will be (20, 35, 650). Often, the output of an unrolled LSTM will be partially flattened and fed into a softmax layer for classification – so, for instance, the first two dimensions of the tensor are flattened to give a softmax layer input size of (700, 650). The output of the softmax is then matched against the expected training outputs during training. The diagram below shows all this:

Recurrent neural network LSTM tutorial - sample many-to-many classifier
LSTM sample many-to-many classifier

As can be observed in the architecture above (which we will be creating in the code below), it is possible to stack layers of LSTM cells on top of each other – this increases the model complexity and predictive power but at the expense of training times and difficulties. The architecture shown above is what we will implement in TensorFlow in the next section. Note the small batch size – this is to allow a more stochastic gradient descent which will avoid settling in local minima during many training iterations (see here).

Creating an LSTM network in TensorFlow

We are now going to create an LSTM network in TensorFlow. The code will loosely follow the TensorFlow team tutorial found here, but with updates and my own substantial modifications. The text dataset that will be used and is a common benchmarking corpus is the Penn Tree Bank (PTB) dataset. As usual, all the code for this post can be found on the AdventuresinML Github site. To run this code, you’ll first have to download and extract the .tgz file from here. First off, we’ll go through the data preparation part of the code.

Preparing the data

This code will use, verbatim, the following functions from the previously mentioned TensorFlow tutorialread_words, build_vocab and file_to_word_ids. I won’t go into these functions in detail, but basically, they first split the given text file into separate words and sentence based characters (i.e. end-of-sentence <eos>). Then, each unique word is identified and assigned a unique integer. Finally, the original text file is converted into a list of these unique integers, where each word is substituted with its new integer identifier. This allows the text data to be consumed in the neural network.

The code below shows how these functions are used in my code:

def load_data():
    # get the data paths
    train_path = os.path.join(data_path, "ptb.train.txt")
    valid_path = os.path.join(data_path, "ptb.valid.txt")
    test_path = os.path.join(data_path, "ptb.test.txt")

    # build the complete vocabulary, then convert text data to list of integers
    word_to_id = build_vocab(train_path)
    train_data = file_to_word_ids(train_path, word_to_id)
    valid_data = file_to_word_ids(valid_path, word_to_id)
    test_data = file_to_word_ids(test_path, word_to_id)
    vocabulary = len(word_to_id)
    reversed_dictionary = dict(zip(word_to_id.values(), word_to_id.keys()))

    print(train_data[:5])
    print(word_to_id)
    print(vocabulary)
    print(" ".join([reversed_dictionary[x] for x in train_data[:10]]))
    return train_data, valid_data, test_data, vocabulary, reversed_dictionary

First, we simply setup the directory paths for the train, validation and test datasets respectively. Then, build_vocab() is invoked on the training data to create a dictionary that has each word as a key, and a unique integer as the associated value. Here is a sample of what the word_to_id dictionary looks like:

{‘write-off’: 7229, ‘ports’: 8314, ‘fundamentals’: 4478, ‘toronto-based’: 5034, ‘head’: 638, ‘fairness’: 6417,…

Next, we convert the text data for each file into a list of integers using the word_to_id dictionary. The first 5 items of the list train_data looks like:

[9970, 9971, 9972, 9974, 9975]

I’ve also created a reverse dictionary which allows you to go the other direction – from a unique integer identifier to the corresponding word. This will be used later when we are reconstructing the outputs of our LSTM network back into plain English sentences.

The next step is to develop an input data pipeline that allows the extraction of batches of data in an efficient manner.

Creating an input data pipeline

As discussed in my TensorFlow queues and threads tutorial, the use of a feed dictionary to supply data to your model during training, while common in tutorials, is not efficient – as can be read here on the TensorFlow site. Rather, it is more efficient to use TensorFlow queues and threading. Note, that there is a new way of doing things, using the Dataset API, which won’t be used in this tutorial, but I will perhaps update it in the future to include this new way of doing things. I’ve packaged up this code in a function called batch_producer – this function extracts batches of x, y training data – the batch is formatted as the time stepped text data. The y batch is the same data, except delayed one time step. So, for instance, a single x, y sample in a batch, with the number of time steps being 8, looks like:

  • x = “A girl walked into a bar, and she”
  • y = “girl walked into a bar, and she said”

Remember that x and y will be batches of integer data, with the size (batch_sizenum_steps), not text as shown above – however, I have shown the above and y sample in text form to aid understanding. So, as demonstrated in the model architecture diagram above, we are producing a many-to-many LSTM model, where the model will be trained to predict the very next word in the sequence for each word in the number of time steps.

Here’s what the code looks like:

def batch_producer(raw_data, batch_size, num_steps):
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)

    data_len = tf.size(raw_data)
    batch_len = data_len // batch_size
    data = tf.reshape(raw_data[0: batch_size * batch_len],
                      [batch_size, batch_len])

    epoch_size = (batch_len - 1) // num_steps

    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
    x = data[:, i * num_steps:(i + 1) * num_steps]
    x.set_shape([batch_size, num_steps])
    y = data[:, i * num_steps + 1: (i + 1) * num_steps + 1]
    y.set_shape([batch_size, num_steps])
    return x, y

In the code above, first, the raw text data is converted into an int32 tensor. Next, the length of the full data set is calculated and stored in data_len and this is then divided by the batch size in an integer division (//) to get the number of full batches of data available within the dataset. The next line reshapes the raw_data tensor (restricted in size to the number of full batches of data i.e. 0 to batch_size * batch_len) into a (batch_size, batch_len) shape. The next line sets the number of iterations in each epoch – usually, this is set so that all the training data is passed through the algorithm in each epoch. This is what occurs here – the number of batches in the data (batch_len) is integer divided by the number of time steps – this gives the number of time-step-sized batches that are available to be iterated through in a single epoch.

The next line sets up an input range producer queue – this is a simple queue which allows the asynchronous and threaded extraction of data batches from a pre-existing dataset. For more on threads and queues, check out my tutorial. Basically, each time more data is required in the training of the model, a new integer is extracted between 0 and epoch_size – this is then used in the following lines to extract a batch of data asynchronously from the data tensor. With the shuffle argument set to False, this integer simply cycles from 0 to epoch_size and then resets back at 0 to repeat.

To produce the x, y batches of data, data slices are extracted from the data tensor based on the dequeued integer i. To see how this works, it is easier to imagine a dummy dataset of integers up to 20 – [0, 1, 2, 3, 4, 5, 6, …, 19, 20]. Let’s say we set the batch size to 3, and the number of steps to 2. The variables batch_len and epoch_size will therefore be equal to 6 and 2, respectively. The dummy reshaped data will look like:

$$\begin{bmatrix}
1 & 2 & 3 & 4 & 5 & 6 \\
7 & 8 & 9 & 10 & 11 & 12 \\
13 & 14 & 15 & 16 & 17 & 18 \\
\end{bmatrix}$$

For the first data batch extraction, i = 0, therefore the extracted x for our dummy dataset will be data[:, 0:2]:

$$\begin{bmatrix}
1 & 2\\
7 & 8\\
13 & 14\\
\end{bmatrix}$$

The extracted y will be data[:, 1:3]:

$$\begin{bmatrix}
2 & 3\\
8 & 9\\
14 & 15\\
\end{bmatrix}$$

As can be observed, each row of the extracted and tensors will be an individual sample of length num_steps and the number of rows is the batch length. By organizing the data in this fashion, it is straight-forward to extract batch data while still maintaining the correct sentence sequence within each data sample.

Creating the model

In this code example, in order to have nice encapsulation and better-looking code, I’ll be building the model in Python classes. The first class is a simple class that contains the input data:

class Input(object):
    def __init__(self, batch_size, num_steps, data):
        self.batch_size = batch_size
        self.num_steps = num_steps
        self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
        self.input_data, self.targets = batch_producer(data, batch_size, num_steps)

We pass this object important input data information such as batch size, the number of recurrent time steps and finally the raw data file we wish to extract batch data from. The previously explained batch_producer function, when called, will return our input data batch x and the associated time step + 1 target data batch, y.

The next step is to create our LSTM model. Again, I’ve used a Python class to hold all the information and TensorFlow operations:

# create the main model
class Model(object):
    def __init__(self, input, is_training, hidden_size, vocab_size, num_layers,
                 dropout=0.5, init_scale=0.05):
        self.is_training = is_training
        self.input_obj = input
        self.batch_size = input.batch_size
        self.num_steps = input.num_steps

The first part of initialization is pretty self-explanatory, with the input data information and batch producer operation found in input_obj. Another important input is the boolean is_training – this allows the model instance to be created either as a model setup for training, or alternatively setup for validation or testing only.

# create the word embeddings
with tf.device("/cpu:0"):
    embedding = tf.Variable(tf.random_uniform([vocab_size, self.hidden_size], -init_scale, init_scale))
    inputs = tf.nn.embedding_lookup(embedding, self.input_obj.input_data)

The block of code above creates the word embeddings. As previously discussed and shown in my tutorial, word embedding creates meaningful vectors to represent each word. First, we initialize the embedding variable with size (vocab_size, hidden_size) which creates the “lookup table” where each row represents a word in the dataset, and the set of columns is the embedding vector. In this case, our embedding vector length is set equal to the size of our LSTM hidden layer.

The next line performs a lookup action on the embedding tensor, where each word in the input data set is matched with a row in the embedding tensor, with the matched embedding vector being returned within inputs.

In this model, the embedding layer / vectors will be learned during the model training – however, if we so desired, we could also pre-learn embedding vectors using another model and upload these into our models. I’ve shown how to do this in my gensim tutorial if you want to check it out.

The next step adds a drop-out wrapper to the input data – this helps prevent overfitting by continually changing the structure of the network connections:

if is_training and dropout < 1:
    inputs = tf.nn.dropout(inputs, dropout)

Creating the LSTM network

The next step is to setup the initial state TensorFlow placeholder. This placeholder will be loaded with the initial state of the LSTM cells for each training batch. At the beginning of each training epoch, the input data will reset to the beginning of the text data set, so we want to reset the state variables to zero. However, during the multiple training batches executed in each epoch, we want to load the final state variables from the previous training batch into our LSTM cells for the current training batch. This keeps a certain continuity of state in our model, as we are progressing linearly through our text data set. We define the placeholder by:

# set up the state storage / extraction
self.init_state = tf.placeholder(tf.float32, [num_layers, 2, self.batch_size, self.hidden_size])

The second argument to the placeholder function is the size of the variable – (num_layers, 2, batch_size, hidden_size) and requires some explanation. If we consider an individual LSTM cell, for each training sample it processes it has two other inputs – the previous output from the cell ($h_{t-1}$) and the previous state variable ($s_{t-1}$). These two inputs, h and s, are what is required to load the full state data into an LSTM cell. Remember also that h and s for each sample are actually vectors with the size equal to the hidden layer size. Therefore, for all the samples in the batch, for a single LSTM cell we have state data required of shape (2, batch_size, hidden_size). Finally, if we have stacked LSTM cell layers, we need state variables for each layer – num_layers. This gives the final shape of the state variables: (num_layers, 2, batch_size, hidden_size).

The next two steps involve setting up this state data variable in the format required to feed it into the TensorFlow LSTM data structure:

state_per_layer_list = tf.unstack(self.init_state, axis=0)
rnn_tuple_state = tuple(
            [tf.contrib.rnn.LSTMStateTuple(state_per_layer_list[idx][0], state_per_layer_list[idx][1])
             for idx in range(num_layers)]
        )

The TensorFlow LSTM cell can accept the state as a tuple if a flag is set to True (more on this later). The tf.unstack command creates a number of tensors, each of shape (2, batch_size, hidden_size), from the init_state tensor, one for each stacked LSTM layer (num_layer). These tensors are then loaded into a specific TensorFlow data structure, LSTMStateTuple, which is the required for input into the LSTM cells.

Next, we create an LSTM cell which will be “unrolled” over the number of time steps. Following this, we apply a drop-out wrapper to again protect against overfitting. Notice that we set the forget bias values to be equal to 1.0, which helps guard against repeated low forget gate outputs causing vanishing gradients, as explained above:

# create an LSTM cell to be unrolled
cell = tf.contrib.rnn.LSTMCell(hidden_size, forget_bias=1.0)
# add a dropout wrapper if training
if is_training and dropout < 1:
    cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=dropout)

Next, if we include many layers of stacked LSTM cells in the model, we need to use another TensorFlow object called MultiRNNCell which performs the requisite cell stacking / layering:

if num_layers > 1:
    cell = tf.contrib.rnn.MultiRNNCell([cell for _ in range(num_layers)], state_is_tuple=True)

Note that we tell MultiRNNCell to expect the state variables in the form of a LSTMStateTuple by setting the flag state_is_tuple to True.

The final step in creating the LSTM network structure is to create a dynamic RNN object in TensorFlow. This object will dynamically perform the unrolling of the LSTM cell over each time step.

output, self.state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32, initial_state=rnn_tuple_state)

The dynamic_rnn object takes our defined LSTM cell as the first argument, and the embedding vector tensor inputs as the second argument. The final argument, initial_state is where we load our time-step zero state variables, that we created earlier, into the unrolled LSTM network.

This operation creates two outputs, the first is the output from all the unrolled LSTM cells, and will have a shape of (batch_size, num_steps, hidden_size). This data will be flattened in the next step to feed into a softmax classification layer. The second output, state, is the (s, h) state tuple taken from the final time step of the LSTM cells. This state operation / tuple will be extracted during each batch training operation to be used as inputs (via init_state) into the next training batch.

Creating the softmax, loss and optimizer operations

Next we have to flatten the outputs so that we can feed them into our proposed softmax classification layer. We can use the -1 notation to reshape our output tensor, with the second axis set to be equal to the hidden layer size:

# reshape to (batch_size * num_steps, hidden_size)
output = tf.reshape(output, [-1, hidden_size])

Next we setup our softmax weight variables and the standard $xw+b$ operation:

softmax_w = tf.Variable(tf.random_uniform([hidden_size, vocab_size], -init_scale, init_scale))
softmax_b = tf.Variable(tf.random_uniform([vocab_size], -init_scale, init_scale))
logits = tf.nn.xw_plus_b(output, softmax_w, softmax_b)

Note that the logits operation is simply the output of our tensor multiplication – we haven’t yet added the softmax operation – this will occur in the loss calculations below (and also in our ancillary accuracy calculations).

Following this, we have to setup our loss or cost function which will be used to train our LSTM network. In this case, we will use the specialized TensorFlow sequence to sequence loss function. This loss function allows one to calculate (a potentially) weighted cross entropy loss over a sequence of values. The first argument to this loss function is the logits argument, which requires tensors with the shape (batch_size, num_steps, vocab_size) – so we’ll need to reshape our logits tensor. The second argument to the loss function is the targets tensor which has a shape (batch_size, num_steps) with each value being an integer (which corresponds to a unique word in our case) – in other words, this tensor contains the true values of the word sequence that we want our LSTM network to predict. The third important argument is the weights tensor, of shape (batch_size, num_steps), which allows you to weight different samples or time steps with respect to the loss i.e. you might want the loss to favor the latter time steps rather than the earlier ones. No weighting is applied in this model, so a tensor of ones is passed to this argument.

# Reshape logits to be a 3-D tensor for sequence loss
logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])

# Use the contrib sequence loss and average over the batches
loss = tf.contrib.seq2seq.sequence_loss(
            logits,
            self.input_obj.targets,
            tf.ones([self.batch_size, self.num_steps], dtype=tf.float32),
            average_across_timesteps=False,
            average_across_batch=True)
# Update the cost
self.cost = tf.reduce_sum(loss)

There are two more important arguments for this function – average_across_timesteps and average_across_batch. If average_across_timesteps is set to True, the cost will be summed across the time dimension, if average_across_batch is True, then the cost will be summed across the batch dimension. In this case we are favoring the latter option.

Finally, we produce the cost operation which reduces the loss to a single scalar value – we could also do something similar by setting average_across_timesteps to True – however, I am keeping things consistent with the TensorFlow tutorial.

In the next few steps, we set up some operations to calculate the accuracy off predictions over the batch samples:

# get the prediction accuracy
self.softmax_out = tf.nn.softmax(tf.reshape(logits, [-1, vocab_size]))
self.predict = tf.cast(tf.argmax(self.softmax_out, axis=1), tf.int32)
correct_prediction = tf.equal(self.predict, tf.reshape(self.input_obj.targets, [-1]))
self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

First we apply a softmax operation to get the predicted probabilities of each word for each output of the LSTM network. We then make the network predictions equal to those words with the highest softmax probability by using the argmax function. These predictions are then compared to the actual target words and then averaged to get the accuracy.

Now we move onto constructing the optimization operations – in this case we aren’t using a simple “out of the box” optimizer – rather we are doing a few manipulations to improve results:

if not is_training:
   return
self.learning_rate = tf.Variable(0.0, trainable=False)

tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars), 5)
optimizer = tf.train.GradientDescentOptimizer(self.learning_rate)
self.train_op = optimizer.apply_gradients(
            zip(grads, tvars),
            global_step=tf.contrib.framework.get_or_create_global_step())

First off, if the model has been created for predictions, validations or testing only, these operations do not need to be created. The first step if the model is being used for training, is to create a learning rate variable. This will be used so that we can decrease the learning rate during training – this improves the final outcome of the model.

Next we wish to clip the size of the gradients in our network during back-propagation – this is recommended in recurrent neural networks to improve outcomes. Clipping values of between 1 and 5 are commonly used. Finally, we create the optimizer operation, using the learning_rate variable, and apply the clipped gradients.. Then a gradient descent step is performed – assigning this operation to train_op. This operation, train_op, will be called for each training batch.

The final two lines of the model creation involve the updating of the learning_rate:

self.new_lr = tf.placeholder(tf.float32, shape=[])
self.lr_update = tf.assign(self.learning_rate, self.new_lr)

First, a placeholder is created which will be input via the feed_dict argument when running the training, new_lr. This new learning rate is then assigned to learning_rate via a tf.assign operation. This operation, lr_update, will be run at the beginning of each epoch.

Now that the model structure is fully created, we can move onto the training loops:

Training the LSTM model

The training function will take as input the training data, along with various model parameters (batch sizes, number of steps etc.). The first part of the function looks like:

def train(train_data, vocabulary, num_layers, num_epochs, batch_size, model_save_name,
          learning_rate=1.0, max_lr_epoch=10, lr_decay=0.93):
    # setup data and models
    training_input = Input(batch_size=batch_size, num_steps=35, data=train_data)
    m = Model(training_input, is_training=True, hidden_size=650, vocab_size=vocabulary,
              num_layers=num_layers)
    init_op = tf.global_variables_initializer()

First we create an Input object instance and a Model object instance, passing in the necessary parameters. Because the TensorFlow graph is being created during the initialization of these objects, the TensorFlow global variable initializer operation can only be properly run after the creation of these instances.

orig_decay = lr_decay
with tf.Session() as sess:
    # start threads
    sess.run([init_op])
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    saver = tf.train.Saver()

Next we start the session, and run the variable initializer operation. Because we are using queuing in the Input object, we also need to create a thread coordinator and start the running of the threads (for more information, see this tutorial). If you skip this step, or put it before the creation of training_input, your program will hang. Finally, a saver instance is created as we want to store model training checkpoints and the final trained model.

Next, the epochal training loop is entered into:

for epoch in range(num_epochs):
    new_lr_decay = orig_decay ** max(epoch + 1 - max_lr_epoch, 0.0)
    m.assign_lr(sess, learning_rate * new_lr_decay)
    current_state = np.zeros((num_layers, 2, batch_size, m.hidden_size))
    for step in range(training_input.epoch_size):
        if step % 50 != 0:
            cost, _, current_state = sess.run([m.cost, m.train_op, m.state],
                                                             feed_dict={m.init_state: current_state})
        else:
            cost, _, current_state, acc = sess.run([m.cost, m.train_op, m.state, m.accuracy],
                                                      feed_dict={m.init_state: current_state})
            print("Epoch {}, Step {}, cost: {:.3f}, accuracy: {:.3f}".format(epoch, step, cost, acc))
    # save a model checkpoint
    saver.save(sess, data_path + '\\' + model_save_name, global_step=epoch)
# do a final save
saver.save(sess, data_path + '\\' + model_save_name + '-final')
# close threads
coord.request_stop()
coord.join(threads)

The first step in every epoch is to calculate the learning rate decay factor, which gradually decreases after max_lr_epoch number of epochs has been reached. This learning rate decay factor, new_lr_decay, is multiplied by the learning rate and assigned to the model by calling the Model method assign_lr. This method looks like:

def assign_lr(self, session, lr_value):
    session.run(self.lr_update, feed_dict={self.new_lr: lr_value})

As can be observed, this function simply runs the lr_update operation which was explained in the prior section.

The next step is to create a zeroed initial state tensor for our LSTM model – we assign this zeroed tensor to the variable current_state. Then each training operation is looped through within our specified epoch size. Every iteration we run the following operations: m.train_op and m.state. The train_op operation, as previously shown, calculates the clipped gradients of the model and takes a batched step to minimize the cost. The state operation returns the state of the final unrolled LSTM cell which we will require to input as the state for the next training batch – note that it replaces the contents of the current_state variable. This current_state variable is inserted into the m.init_state placeholder via the feed_dict.

Every 50 iterations we also extract the current cost of the model in training, as well as the accuracy against the current training batch, to provide printed feedback during training. The outputs look like this:

Epoch 9, Step 1850, cost: 96.185, accuracy: 0.198
Epoch 9, Step 1900, cost: 94.755, accuracy: 0.235

Finally, at the end of each epoch, we use the saver object to save a model checkpoint, and finally at the end of the training a final save of the state of the model is performed.

Expected training outcomes

The expected cost and accuracy progress through the epochs depends on the multitude of parameters supplied to the models and also the results of the random initialization of the variables. Training time is also dependent on whether you are using only CPUs, or whether you are using GPUs too (note, I have not tested the code on the Github repository with GPUs).

My model achieved an average cost and training batch accuracy on the order of 110-120 and 30%, respectively, after 38 epochs with the following paramters:

Hidden size:650, Number of steps:35, Initialization scale:0.05, Batch size:20, Number of stacked LSTM layers:2, Keep probability / dropout: 0.5

You are probably thinking the accuracy isn’t very high, and you are correct, however further training and a larger hidden layer would provide better final accuracy values. To perform further training on a larger network you really need to be using GPUs to accelerate the training – I’ll do this in a future post and present the results.

Testing the model

To test the model on the test or validation data, I’ve created another function called test which looks like so:

def test(model_path, test_data, reversed_dictionary):
    test_input = Input(batch_size=20, num_steps=35, data=test_data)
    m = Model(test_input, is_training=False, hidden_size=650, vocab_size=vocabulary,
              num_layers=2)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        # start threads
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        current_state = np.zeros((2, 2, m.batch_size, m.hidden_size))
        # restore the trained model
        saver.restore(sess, model_path)
        # get an average accuracy over num_acc_batches
        num_acc_batches = 30
        check_batch_idx = 25
        acc_check_thresh = 5
        accuracy = 0
        for batch in range(num_acc_batches):
            if batch == check_batch_idx:
                true_vals, pred, current_state, acc = sess.run([m.input_obj.targets, m.predict, m.state, m.accuracy],
                                                               feed_dict={m.init_state: current_state})
                pred_string = [reversed_dictionary[x] for x in pred[:m.num_steps]]
                true_vals_string = [reversed_dictionary[x] for x in true_vals[0]]
                print("True values (1st line) vs predicted values (2nd line):")
                print(" ".join(true_vals_string))
                print(" ".join(pred_string))
            else:
                acc, current_state = sess.run([m.accuracy, m.state], feed_dict={m.init_state: current_state})
            if batch >= acc_check_thresh:
                accuracy += acc
        print("Average accuracy: {:.3f}".format(accuracy / (num_acc_batches-acc_check_thresh)))
        # close threads
        coord.request_stop()
        coord.join(threads)

We start with creating an Input and Model class that matches our training Input and Model classes. It is important that key parameters match the training model, such as the hidden size, number of steps, batch size etc. We are going to load our saved model variables into the computational graph created by the test Model instance, and if the dimensions don’t match TensorFlow will throw an error.

Next we create a tf.train.Saver() operation – this will load all our saved model variables into our test model when we run the line saver.restore(sess, model_path). After dealing with all of the threads and creating a zeroed state variable, we setup some variables which relate to how we are going to assess the accuracy and look at some specific instances of predicted strings. Because we have to “warm up” the model by feeding it some data to get good state variables, we only measure the accuracy after a certain number of batches i.e. acc_check_thresh.

When the batch number is equal to check_batch_idx the code runs the m.predict operation to extract the predictions for the particular batch of data. The first prediction of the batch is passed through the reverse dictionary to convert them back to actual words (along with the batch target words) and then compared with what should have been predicted via printing.

Using the trained model, we can see the following output:

True values (1st line) vs predicted values (2nd line):
stock market is headed many traders were afraid to trust stock prices quoted on the big board <eos> the futures halt was even <unk> by big board floor traders <eos> it <unk> things up said
market market is n’t for traders say willing to buy the prices <eos> <eos> the big board <eos> the dow market is a worse <eos> the board traders traders <eos> the ‘s the to to
Average accuracy: 0.283

The accuracy isn’t fantastic, but you can see the network is matching the “gist” of the sentence i.e. not producing all of the exact words but matching the general subject matter. As I mentioned above, in a future post I’ll present the data from a model trained for longer using GPUs.

I hope you enjoyed the post – it’s been a long one, but I hope that this gives you a solid foundation in understanding recurrent neural networks and LSTMs and how to implement them in TensorFlow.


Recommended online course: If you are more of a video course learner, I’d recommend this inexpensive Udemy course: Deep Learning: Recurrent Neural Networks in Python


 

4 Comments

  1. I learned a lot from this tutorial. Thank! Just one question. In the image “LSTM sample many-to-many classifier”, should the indices go from x0…x35, likewise h0…h35. In the current illustration, I do not understand why there is feedback within a batch (i.e., across rows – which is of size 20). Please clarify. Thanks.

Leave a Reply

Your email address will not be published.


*