In the deep learning journey so far on this website, I’ve introduced dense neural networks and convolutional neural networks (CNNs) which explain how to perform classification tasks on static images. We’ve seen good results, especially with CNN’s. However, what happens if we want to analyze dynamic data? What about videos, voice recognition or sequences of text? There are ways to do some of this using CNN’s, but the most popular method of performing classification and other analysis on *sequences* of data is recurrent neural networks. This tutorial will be a very comprehensive introduction to recurrent neural networks and a subset of such networks – long-short term memory networks (or LSTM networks). I’ll also show you how to implement such networks in TensorFlow – including the data preparation step. It’s going to be a long one, so settle in and enjoy these pivotal networks in deep learning – at the end of this post, you’ll have a very solid understanding of recurrent neural networks and LSTMs. By the way, if you’d like to learn how to build LSTM networks in Keras, see this tutorial.

As always, all the code for this post can be found on this site’s Github repository.

# An introduction to recurrent neural networks

A recurrent neural network, at its most fundamental level, is simply a type of densely connected neural network (for an introduction to such networks, see my tutorial). However, the key difference to normal feed forward networks is the introduction of *time* – in particular, the output of the hidden layer in a recurrent neural network is *fed back **into itself*. Diagrams help here, so observe:

In the diagram above, we have a simple recurrent neural network with three input nodes. These input nodes are fed into a hidden layer, with sigmoid activations, as per any normal densely connected neural network. What happens next is what is interesting – the output of the hidden layer is then *fed back* into the same hidden layer. As you can see the hidden layer outputs are passed through a conceptual *delay *block to allow the input of $\textbf{h}^{t-1}$ into the hidden layer. What is the point of this? Simply, the point is that we can now model *time *or sequence-dependent data.

A particularly good example of this is predicting text sequences. Consider the following text string: “A girl walked into a bar, and she said ‘Can I have a drink please?’. The bartender said ‘Certainly {}”. There are many options for what could fill in the {} symbol in the above string, for instance, “miss”, “ma’am” and so on. However, other words could also fit, such as “sir”, “Mister” etc. In order to get the correct gender of the noun, the neural network needs to “recall” that two previous words designating the likely gender (i.e. “girl” and “she”) were used. This type of flow of information through time (or sequence) in a recurrent neural network is shown in the diagram below, which *unrolls *the sequence:

On the left-hand side of the above diagram, we have basically the same diagram as the first (the one which shows all the nodes explicitly). What the previous diagram neglected to show explicitly was that we in fact only ever supply finite length sequences to such networks – therefore we can *unroll *the network as shown on the right-hand side of the diagram above. This unrolled network shows how we can supply a stream of data to the recurrent neural network. For instance, first, we supply the word vector for “A” (more about word vectors later) to the network *F* – the output of the nodes in *F *are fed into the “next” network and also act as a stand-alone output ($h_0$). The next network (though it is really the same network) *F* at time *t=1* takes the next word vector for “girl” and the previous output $h_0$ into its hidden nodes, producing the next output $h_1$ and so on.

As discussed above, the words themselves i.e. “A”, “girl” etc. aren’t input directly into the neural network. Neither are their one-hot vector type representations – rather, an embedding vector is used for each word. An embedding vector is an efficient vector representation of the word (often between 50-300 in length), which should maintain some meaning or context of the word. Word embedding won’t be entered into detail here, as I have covered it extensively in other posts – Word2Vec word embedding tutorial in Python and TensorFlow, A Word2Vec Keras tutorial and Python gensim Word2Vec tutorial with TensorFlow and Keras. It is an interesting topic and well worth the time investigating.

Now, back to recurrent neural networks themselves. Recurrent neural networks are very flexible. In the implementation shown above, we have a many-to-many model – in other words, we have the input sequence “A girl walked into a bar…” and many outputs – $h_0$ to $h_t$. We could also have multiple other configurations. Another option is one-to-many i.e. supplying one input, say “girl” and predicting multiple outputs $h_0$ to $h_t$ (i.e. trying to generate sentences based on a single starting word). A further configuration is many-to-one i.e. supplying many words as input, like the sentence “A girl walked into a bar, and she said ‘Can I have a drink please?’. The bartender said ‘Certainly {}” and predicting the next word i.e. {}. The diagram below shows an example one-to-many and many-to-one configuration, respectively (the words next to the outputs are the target words which we would supply during training).

There are also different many-to-many configurations that can be constructed – but you get the idea: recurrent neural networks are quite flexible. One last thing to note – the weights of the connections between time steps are *shared* i.e. there isn’t a different set of weights for each time step.

Now you have a pretty good idea of what recurrent neural networks are, it is time to point out their dominant problem.

## The problem with basic recurrent neural networks

Vanilla recurrent neural networks aren’t actually used very often in practice. Why? The main reason is the vanishing gradient problem. For recurrent neural networks, ideally, we would want to have long memories, so the network can connect data relationships at significant distances in time. That sort of network could make real progress in understanding how language and narrative works, how stock market events are correlated and so on. However, the more time steps we have, the more chance we have of back-propagation gradients either accumulating and exploding or vanishing down to nothing.

Consider the following representation of a recurrent neural network:

$$\textbf{h}_t = \sigma (\textbf{Ux}_t + \textbf{Vh}_{t-1})$$

Where ** U **and

*V**are the weight matrices connecting the inputs and the recurrent outputs respectively. We then often will perform a softmax of all the $\textbf{h}_t$ outputs (if we have some sort of many-to-many or one-to-many configuration). Notice, however, that if we go back three time steps in our recurrent neural network, we have the following:*

$$\textbf{h}_t = \sigma (\textbf{Ux}_t + \textbf{V}(\sigma(\textbf{Ux}_{t-1} + \textbf{V}(\sigma(\textbf{Ux}_{t-2})))$$

From the above you can see, as we work our way back in time, we are essentially adding deeper and deeper layers to our network. This causes a problem – consider the gradient of the error with respect to the weight matrix * U* during backpropagation through time, it looks something along the lines of this:

$$\frac{\partial E_3}{\partial U} = \frac{\partial E_3}{\partial out_3}\frac{\partial out_3}{\partial h_3}\frac{\partial h_3}{\partial h_2}\frac{\partial h_2}{\partial h_1}\frac{\partial h_1}{\partial U}$$

The equation above is only a rough approximation of what is going on during backpropagation through time, but it will suffice for our purposes (for more on back-propagation, see my comprehensive neural networks tutorial). Each of these gradients will involve calculating the gradient of the sigmoid function. The problem with the sigmoid function occurs when the input values are such that the output is close to either 0 or 1 – at this point, the gradient is very small, see the plot below.

As you can observe, the values of the gradient (orange line) are always <0.25 and get to very low values when the output gets close to 0 or 1. What does this mean? It means that when you multiply many sigmoid gradients together you are multiplying many values which are potentially much less than zero – this leads to a vanishing gradient $\frac{\partial E}{\partial U}$. Because the gradient will become basically zero when dealing with many prior time steps, the weights won’t adjust to take into account these values, and therefore the network won’t learn relationships separated by significant periods of time. This makes vanilla recurrent neural networks not very useful. If you’d like to learn more about the vanishing gradient problem, see my dedicated post about it here.

We could use ReLU activation functions to reduce this problem, though not eliminate it. However, the most popular way of dealing with this issue in recurrent neural networks is by using long-short term memory (LSTM) networks, which will be introduced in the next section.

# Introduction to LSTM networks

To reduce the vanishing (and exploding) gradient problem, and therefore allow deeper networks and recurrent neural networks to perform well in practical settings, there needs to be a way to reduce the multiplication of gradients which are less than zero. The LSTM cell is a specifically designed unit of logic that will help reduce the vanishing gradient problem sufficiently to make recurrent neural networks more useful for long-term memory tasks i.e. text sequence predictions. The way it does so is by creating an internal memory state which* *is simply *added* to the processed input, which greatly reduces the multiplicative effect of small gradients. The time dependence and effects of previous inputs are controlled by an interesting concept called a *forget **gate*, which determines which states are remembered or forgotten. Two other gates, the *input gate* and *output** gate*, are also featured in LSTM cells.

Let’s first have a look at LSTM cells more carefully, then I’ll discuss how they help reduce the vanishing gradient problem.

## The structure of an LSTM cell

The structure of a typical LSTM cell is shown in the diagram below:

The data flow is from left-to-right in the diagram above, with the current input $x_t$ and the previous cell output $h_{t-1}$ concatenated together and entering the top “data rail”. Here’s where things get interesting.

### The input gate

First, the input is squashed between -1 and 1 using a *tanh* activation function. This can be expressed by:

$$g = tanh(b^g + x_tU^g + h_{t-1}V^g)$$

Where $U^g$ and $V^g$ are the weights for the input and previous cell output, respectively, and $b^g$ is the input bias. Note that the exponents *g* are not a raised power, but rather signify that these are the input weights and bias values (as opposed to the input gate, forget gate, output gate etc.).

This squashed input is then multiplied element-wise by the output of the *input gate*. The input gate is basically a hidden layer of sigmoid activated nodes, with weighted $x_t$ and $h_{t-1}$ input values, which outputs values of between 0 and 1 and when multiplied element-wise by the input determines which inputs are switched on and off. In other words, it is a kind of input filter or gate. The expression for the input gate is:

$$i = \sigma(b^i + x_tU^i + h_{t-1}V^i)$$

The output of the input stage of the LSTM cell can be expressed below, where the $\circ$ operator expresses element-wise multiplication:

$$g \circ i$$

As you can observe, the input gate output *i* acts as the weights for the squashed input *g*. We now move onto the next stage of the LSTM cell – the internal state and the forget gate.

### The internal state and the forget gate

This stage in the LSTM is where most of the magic happens. As can be observed, there is a new variable* *$s_t$ which is the inner state of the LSTM cell. This state is delayed by one-time step and is ultimately added to the $g \circ i$ input to provide an internal recurrence loop to learn the relationship between inputs separated by time. Two things to notice – first, there is a forget gate here – this forget gate is again a sigmoid activated set of nodes which is element-wise multiplied by $s_{t-1}$ to determine which previous states should be remembered (i.e. forget gate output close to 1) and which should be forgotten (i.e. forget gate output close to 0). This allows the LSTM cell to learn appropriate context. Consider the sentence “Clare took Helen to Paris and she was very grateful” – for the LSTM cell to learn who “she” refers to, it needs to forget the subject “Clare” and replace it with the subject “Helen”. The forget gate can facilitate such operations and is expressed as:

$$f = \sigma(b^f + x_tU^f + h_{t-1}V^f)$$

The output of the element-wise product of the previous state and the forget gate is expressed as $s_{t-1} \circ f$. Again, the forget gate output acts as weights for the internal state. The second thing to notice about this stage is that the forget-gate-“filtered” state is simply added to the input, rather than multiplied by it, or mixed with it via weights and a sigmoid activation function as occurs in a standard recurrent neural network. This is important to reduce the issue of vanishing gradients. The output from this stage, $s_t$ is expressed by:

$$s_t = s_{t-1} \circ f + g \circ i$$

The final stage of the LSTM cell is the output gate.

### The output gate

The final stage of the LSTM cell is the output gate. The output gate has two components – another *tanh *squashing function and an output sigmoid gating function. The output sigmoid gating function, like the other gating functions in the cell, is multiplied by the squashed state $s_t$ to determine which values of the state are output from the cell. As you can tell, the LSTM cell is very flexible, with gating functions controlling what is input, what is “remembered” in the internal state variable, and finally what is output from the LSTM cell.

The output gate is expressed as:

$$o = \sigma(b^o + x_tU^o + h_{t-1}V^o)$$

So the final output of the cell can be expressed as:

$$h_t = tanh(s_t) \circ o$$

The next question is, how does the LSTM cell reduce the vanishing gradient problem?

## Reducing the vanishing gradient problem

Recall before that the issue with vanilla recurrent neural networks is that calculating the gradient to update the weights involves cascading terms like:

$$\frac {\partial h_n}{\partial h_{n-1}} \frac {\partial h_{n-1}}{\partial h_{n-2}} \frac {\partial h_{n-2}}{\partial h_{n-3}} …$$

This is a problem because of the sigmoid derivative, which is present in all of the partial derivatives above, being <0.25 (often greatly so). There is also a factorial of the weights involved, so if they are consistently <1, we get a similar result – a vanishing gradient.

In an LSTM cell, the recurrency of the internal state of the LSTM cell involves, as shown above, an addition – like so:

$$s_t = s_{t-1} \circ f + g \circ i$$

If we take the partial derivative of this recurrency like we did above for a vanilla recurrent neural network, we find the following:

$$\frac{\partial s_t}{\partial s_{t-1}} = f$$

Notice that the $g \circ i$ term drops away and we are just left with a repeated multiplication of $f$. So for three time steps, we would have $f x f x f$. Notice that if the output of $f=1$, there will be no decay of the gradient. Generally, the bias of the sigmoid in $f$ is made large at the beginning of training so that $f$ starts out as 1 , meaning that all past input states will be “remembered” in the cell. During training, the forget gate will reduce or eliminate the memory of certain components of the state $s_{t-1}$.

This might be a bit confusing, so I’ll explain another way before we move on. Imagine if we let in a single input during the first time step, but then we block all future inputs (by setting the input gate to output zeros) and remember all previous states (by setting the forget gate to output ones). We would have a kind of circulating memory of $s_t$ which never decays i.e. $s_t$ = $s_{t-1}$. A back-propagated error “entering” this loop would also never decay. With the vanilla recurrent neural network, however, if we did the same thing our back-propagated error would be continuously degraded by the gradient of the activation function of the hidden nodes, and therefore eventually decay to zero.

Hopefully, that helps you to understand, at least in part, why LSTM cells are a great solution to the vanishing gradient problem, and therefore why they are currently used so extensively. Now, so far, we have been dealing with the data in the LSTM cells as if they were single values (i.e. scalars), however, in reality, they are tensors or vectors, and this can get confusing. So in the next section, I’ll spend a bit of time explaining the tensor sizes we can expect to be flowing around our unrolled LSTM networks.

## The dimensions of data inside an LSTM cell

In the example code that is going to be discussed below, we are going to be performing text prediction. Now, as discussed in previous tutorials on the Word2Vec algorithm, words are input into neural networks using meaningful word vectors i.e. the word “cat” might be represented by, say, a 650 length vector. This vector is encoded in such a way as to capture some aspect of the meaning of the word (where meaning is usually construed as the context the word is usually found in). So each word input into our LSTM network below will be a 650 length vector. Next, because we will be inputting a sequence of words into our unrolled LSTM network, for each input row we will be inputting 35 of these word vectors. So the input for each row will be (35 x 650) in size. Finally, with TensorFlow, we can process batches of data via multi-dimensional tensors (to learn more about basic TensorFlow, see this TensorFlow tutorial). If we have a batch size of 20, our *training* input data will be (20 x 35 x 650). For future reference, the way I have presented the tensor size here (i.e. (20 x 35 x 650)) is called a “batch-major” arrangement, where the batch size is the first dimension of the tensor. We could also alternatively arrange the data in “time-major” format, which would be (35 x 20 x 650) – same data, just a different arrangement.

Now, the next thing to consider is that each of the input, forget and output gates, along with the inner state variable $s_t$ and the squashing functions, are not single functions with single/scalar weights. Rather, they comprise the hidden layer of the network and therefore include multiple nodes, connecting weights, bias values and so on. It is up to us to set the size of the hidden layer. The output from the unrolled LSTM network will, therefore, include the size of the hidden layer. The size of the output from the unrolled LSTM network with a size 650 hidden layer, and a 20 length batch-size and 35 time steps will be (20, 35, 650). Often, the output of an unrolled LSTM will be partially flattened and fed into a softmax layer for classification – so, for instance, the first two dimensions of the tensor are flattened to give a softmax layer input size of (700, 650). The output of the softmax is then matched against the expected training outputs during training. The diagram below shows all this:

As can be observed in the architecture above (which we will be creating in the code below), it is possible to stack layers of LSTM cells on top of each other – this increases the model complexity and predictive power but at the expense of training times and difficulties. The architecture shown above is what we will implement in TensorFlow in the next section. Note the small batch size – this is to allow a more stochastic gradient descent which will avoid settling in local minima during many training iterations (see here).

# Creating an LSTM network in TensorFlow

We are now going to create an LSTM network in TensorFlow. The code will loosely follow the TensorFlow team tutorial found here, but with updates and my own substantial modifications. The text dataset that will be used and is a common benchmarking corpus is the Penn Tree Bank (PTB) dataset. As usual, all the code for this post can be found on the AdventuresinML Github site. To run this code, you’ll first have to download and extract the .tgz file from here. First off, we’ll go through the data preparation part of the code.

## Preparing the data

This code will use, verbatim, the following functions from the previously mentioned TensorFlow tutorial: *read_words, build_vocab *and *file_to_word_ids. *I won’t go into these functions in detail, but basically, they first split the given text file into separate words and sentence based characters (i.e. end-of-sentence <eos>). Then, each unique word is identified and assigned a unique integer. Finally, the original text file is converted into a list of these unique integers, where each word is substituted with its new integer identifier. This allows the text data to be consumed in the neural network.

The code below shows how these functions are used in my code:

First, we simply setup the directory paths for the train, validation and test datasets respectively. Then, *build_vocab*() is invoked on the training data to create a dictionary that has each word as a key, and a unique integer as the associated value. Here is a sample of what the *word_to_id* dictionary looks like:

{‘write-off’: 7229, ‘ports’: 8314, ‘fundamentals’: 4478, ‘toronto-based’: 5034, ‘head’: 638, ‘fairness’: 6417,…

Next, we convert the text data for each file into a list of integers using the *word_to_id* dictionary. The first 5 items of the list *train_data *looks like:

[9970, 9971, 9972, 9974, 9975]

I’ve also created a reverse dictionary which allows you to go the other direction – from a unique integer identifier to the corresponding word. This will be used later when we are reconstructing the outputs of our LSTM network back into plain English sentences.

The next step is to develop an input data pipeline that allows the extraction of batches of data in an efficient manner.

## Creating an input data pipeline

As discussed in my TensorFlow queues and threads tutorial, the use of a feed dictionary to supply data to your model during training, while common in tutorials, is not efficient – as can be read here on the TensorFlow site. Rather, it is more efficient to use TensorFlow queues and threading. Note, that there is a new way of doing things, using the Dataset API, which won’t be used in this tutorial, but I will perhaps update it in the future to include this new way of doing things. I’ve packaged up this code in a function called *batch_producer* – this function extracts batches of *x, y* training data – the *x *batch is formatted as the time stepped text data. The y batch is the same data, except delayed one time step. So, for instance, a single *x, y* sample in a batch, with the number of time steps being 8, looks like:

*x =*“A girl walked into a bar, and she”- y = “girl walked into a bar, and she said”

Remember that *x *and *y* will be batches of integer data, with the size (*batch_size*, *num_steps*), not text as shown above – however, I have shown the above *x *and *y *sample in text form to aid understanding. So, as demonstrated in the model architecture diagram above, we are producing a many-to-many LSTM model, where the model will be trained to predict the very next word in the sequence *for each* word in the number of time steps.

Here’s what the code looks like:

In the code above, first, the raw text data is converted into an *int32* tensor. Next, the length of the full data set is calculated and stored in *data_len* and this is then divided by the batch size in an *integer division (//)* to get the number of full batches of data available within the dataset. The next line reshapes the *raw_data *tensor (restricted in size to the number of full batches of data i.e. 0 to *batch_size * batch_len*) into a (*batch_size, batch_len*) shape. The next line sets the number of iterations in each epoch – usually, this is set so that all the training data is passed through the algorithm in each epoch. This is what occurs here – the number of batches in the data (*batch_len*) is integer divided by the number of time steps – this gives the number of time-step-sized batches that are available to be iterated through in a single epoch.

The next line sets up an input range producer queue – this is a simple queue which allows the asynchronous and threaded extraction of data batches from a pre-existing dataset. For more on threads and queues, check out my tutorial. Basically, each time more data is required in the training of the model, a new integer is extracted between 0 and *epoch_size* – this is then used in the following lines to extract a batch of data asynchronously from the *data* tensor. With the *shuffle* argument set to False, this integer simply cycles from 0 to *epoch_size* and then resets back at 0 to repeat.

To produce the *x, y* batches of data, data slices are extracted from the data tensor based on the dequeued integer *i*. To see how this works, it is easier to imagine a dummy dataset of integers up to 20 – [0, 1, 2, 3, 4, 5, 6, …, 19, 20]. Let’s say we set the batch size to 3, and the number of steps to 2. The variables *batch_len *and *epoch_size *will therefore be equal to 6 and 2, respectively. The dummy reshaped data will look like:

$$\begin{bmatrix}

1 & 2 & 3 & 4 & 5 & 6 \\

7 & 8 & 9 & 10 & 11 & 12 \\

13 & 14 & 15 & 16 & 17 & 18 \\

\end{bmatrix}$$

For the first data batch extraction, *i = 0*, therefore the extracted *x* for our dummy dataset will be *data[:, 0:2]*:

$$\begin{bmatrix}

1 & 2\\

7 & 8\\

13 & 14\\

\end{bmatrix}$$

The extracted *y* will be *data[:, 1:3]*:

$$\begin{bmatrix}

2 & 3\\

8 & 9\\

14 & 15\\

\end{bmatrix}$$

As can be observed, each row of the extracted *x *and *y *tensors will be an individual sample of length *num_steps* and the number of rows is the batch length. By organizing the data in this fashion, it is straight-forward to extract batch data while still maintaining the correct sentence sequence within each data sample.

## Creating the model

In this code example, in order to have nice encapsulation and better-looking code, I’ll be building the model in Python classes. The first class is a simple class that contains the input data:

We pass this object important input data information such as batch size, the number of recurrent time steps and finally the raw data file we wish to extract batch data from. The previously explained *batch_producer* function, when called, will return our input data batch *x* and the associated time step + 1 target data batch, *y*.

The next step is to create our LSTM model. Again, I’ve used a Python class to hold all the information and TensorFlow operations:

The first part of initialization is pretty self-explanatory, with the input data information and batch producer operation found in *input_obj*. Another important input is the boolean *is_training* – this allows the model instance to be created either as a model setup for training, or alternatively setup for validation or testing only.

The block of code above creates the word embeddings. As previously discussed and shown in my tutorial, word embedding creates meaningful vectors to represent each word. First, we initialize the embedding variable with size (vocab_size, hidden_size) which creates the “lookup table” where each row represents a word in the dataset, and the set of columns is the embedding vector. In this case, our embedding vector length is set equal to the size of our LSTM hidden layer.

The next line performs a lookup action on the embedding tensor, where each word in the input data set is matched with a row in the embedding tensor, with the matched embedding vector being returned within *inputs.*

In this model, the embedding layer / vectors will be learned during the model training – however, if we so desired, we could also pre-learn embedding vectors using another model and upload these into our models. I’ve shown how to do this in my gensim tutorial if you want to check it out.

The next step adds a drop-out wrapper to the input data – this helps prevent overfitting by continually changing the structure of the network connections:

### Creating the LSTM network

The next step is to setup the initial state TensorFlow placeholder. This placeholder will be loaded with the initial state of the LSTM cells for each training batch. At the beginning of each training epoch, the input data will reset to the beginning of the text data set, so we want to reset the state variables to zero. However, during the multiple training batches executed in each epoch, we want to load the final state variables from the previous training batch into our LSTM cells for the current training batch. This keeps a certain continuity of state in our model, as we are progressing linearly through our text data set. We define the placeholder by:

The second argument to the placeholder function is the size of the variable – (num_layers, 2, batch_size, hidden_size) and requires some explanation. If we consider an individual LSTM cell, for each training sample it processes it has two other inputs – the previous output from the cell ($h_{t-1}$) and the previous state variable ($s_{t-1}$). These two inputs, *h* and *s, *are what is required to load the full state data into an LSTM cell. Remember also that *h* and *s* for each sample are actually vectors with the size equal to the hidden layer size. Therefore, for all the samples in the batch, for a single LSTM cell we have state data required of shape (2, batch_size, hidden_size). Finally, if we have stacked LSTM cell layers, we need state variables for each layer – *num_layers. *This gives the final shape of the state variables: (num_layers, 2, batch_size, hidden_size).

The next two steps involve setting up this state data variable in the format required to feed it into the TensorFlow LSTM data structure:

The TensorFlow LSTM cell can accept the state as a tuple if a flag is set to True (more on this later). The *tf.unstack* command creates a number of tensors, each of shape (2, batch_size, hidden_size), from the *init_state *tensor, one for each stacked LSTM layer *(num_layer)*. These tensors are then loaded into a specific TensorFlow data structure*, LSTMStateTuple*, which is the required for input into the LSTM cells.

Next, we create an LSTM cell which will be “unrolled” over the number of time steps. Following this, we apply a drop-out wrapper to again protect against overfitting. Notice that we set the forget bias values to be equal to 1.0, which helps guard against repeated low forget gate outputs causing vanishing gradients, as explained above:

Next, if we include many layers of stacked LSTM cells in the model, we need to use another TensorFlow object called *MultiRNNCell *which performs the requisite cell stacking / layering:

Note that we tell *MultiRNNCell *to expect the state variables in the form of a *LSTMStateTuple* by setting the flag *state_is_tuple* to True.

The final step in creating the LSTM network structure is to create a dynamic RNN object in TensorFlow. This object will dynamically perform the unrolling of the LSTM cell over each time step.

The *dynamic_rnn *object takes our defined LSTM cell as the first argument, and the embedding vector tensor *inputs* as the second argument. The final argument, *initial_state* is where we load our time-step zero state variables, that we created earlier, into the unrolled LSTM network.

This operation creates two outputs, the first is the output from all the unrolled LSTM cells, and will have a shape of (batch_size, num_steps, hidden_size). This data will be flattened in the next step to feed into a softmax classification layer. The second output, *state*, is the (s, h) state tuple taken from the final time step of the LSTM cells. This *state* operation / tuple will be extracted during each batch training operation to be used as inputs (via *init_state*) into the next training batch.

### Creating the softmax, loss and optimizer operations

Next we have to flatten the outputs so that we can feed them into our proposed softmax classification layer. We can use the -1 notation to reshape our output tensor, with the second axis set to be equal to the hidden layer size:

Next we setup our softmax weight variables and the standard $xw+b$ operation:

Note that the *logits* operation is simply the output of our tensor multiplication – we haven’t yet added the softmax operation – this will occur in the loss calculations below (and also in our ancillary accuracy calculations).

Following this, we have to setup our loss or cost function which will be used to train our LSTM network. In this case, we will use the specialized TensorFlow sequence to sequence loss function. This loss function allows one to calculate (a potentially) weighted cross entropy loss over a sequence of values. The first argument to this loss function is the *logits* argument, which requires tensors with the shape (batch_size, num_steps, vocab_size) – so we’ll need to reshape our logits tensor. The second argument to the loss function is the *targets *tensor which has a shape (batch_size, num_steps) with each value being an integer (which corresponds to a unique word in our case) – in other words, this tensor contains the true values of the word sequence that we want our LSTM network to predict. The third important argument is the weights tensor, of shape (batch_size, num_steps), which allows you to weight different samples or time steps with respect to the loss i.e. you might want the loss to favor the latter time steps rather than the earlier ones. No weighting is applied in this model, so a tensor of ones is passed to this argument.

There are two more important arguments for this function – *average_across_timesteps *and *average_across_batch*. If *average_across_timesteps *is set to True, the cost will be summed across the time dimension, if *average_across_batch* is True, then the cost will be summed across the batch dimension. In this case we are favoring the latter option.

Finally, we produce the *cost* operation which reduces the loss to a single scalar value – we could also do something similar by setting *average_across_timesteps** *to True – however, I am keeping things consistent with the TensorFlow tutorial.

In the next few steps, we set up some operations to calculate the accuracy off predictions over the batch samples:

First we apply a softmax operation to get the predicted probabilities of each word for each output of the LSTM network. We then make the network predictions equal to those words with the highest softmax probability by using the *argmax* function. These predictions are then compared to the actual target words and then averaged to get the accuracy.

Now we move onto constructing the optimization operations – in this case we aren’t using a simple “out of the box” optimizer – rather we are doing a few manipulations to improve results:

First off, if the model has been created for predictions, validations or testing only, these operations do not need to be created. The first step if the model is being used for training, is to create a learning rate variable. This will be used so that we can decrease the learning rate during training – this improves the final outcome of the model.

Next we wish to clip the size of the gradients in our network during back-propagation – this is recommended in recurrent neural networks to improve outcomes. Clipping values of between 1 and 5 are commonly used. Finally, we create the optimizer operation, using the *learning_rate *variable, and apply the clipped gradients.. Then a gradient descent step is performed – assigning this operation to *train_op*. This operation, *train_op*, will be called for each training batch.

The final two lines of the model creation involve the updating of the *learning_rate*:

First, a placeholder is created which will be input via the *feed_dict* argument when running the training, *new_lr*. This new learning rate is then assigned to *learning_rate* via a *tf.assign* operation. This operation, *lr_update,* will be run at the beginning of each epoch.

Now that the model structure is fully created, we can move onto the training loops:

## Training the LSTM model

The training function will take as input the training data, along with various model parameters (batch sizes, number of steps etc.). The first part of the function looks like:

First we create an Input object instance and a Model object instance, passing in the necessary parameters. Because the TensorFlow graph is being created during the initialization of these objects, the TensorFlow global variable initializer operation can only be properly run *after* the creation of these instances.

Next we start the session, and run the variable initializer operation. Because we are using queuing in the Input object, we also need to create a thread coordinator and start the running of the threads (for more information, see this tutorial). If you skip this step, or put it before the creation of *training_input*, your program will hang. Finally, a saver instance is created as we want to store model training checkpoints and the final trained model.

Next, the epochal training loop is entered into:

The first step in every epoch is to calculate the learning rate decay factor, which gradually decreases after *max_lr_epoch* number of epochs has been reached. This learning rate decay factor, *new_lr_decay*, is multiplied by the learning rate and assigned to the model by calling the Model method *assign_lr*. This method looks like:

As can be observed, this function simply runs the *lr_update *operation which was explained in the prior section.

The next step is to create a zeroed initial state tensor for our LSTM model – we assign this zeroed tensor to the variable *current_state*. Then each training operation is looped through within our specified epoch size. Every iteration we run the following operations: *m.train_op* and *m.state*. The *train_op *operation, as previously shown, calculates the clipped gradients of the model and takes a batched step to minimize the cost. The *state *operation returns the *state *of the final unrolled LSTM cell which we will require to input as the state for the next training batch – note that it replaces the contents of the *current_state* variable. This *current_state *variable is inserted into the *m.init_state *placeholder via the *feed_dict.*

Every 50 iterations we also extract the current cost of the model in training, as well as the accuracy against the current training batch, to provide printed feedback during training. The outputs look like this:

Epoch 9, Step 1850, cost: 96.185, accuracy: 0.198

Epoch 9, Step 1900, cost: 94.755, accuracy: 0.235

Finally, at the end of each epoch, we use the *saver *object to save a model checkpoint, and finally at the end of the training a final save of the state of the model is performed.

### Expected training outcomes

The expected cost and accuracy progress through the epochs depends on the multitude of parameters supplied to the models and also the results of the random initialization of the variables. Training time is also dependent on whether you are using only CPUs, or whether you are using GPUs too (note, I have not tested the code on the Github repository with GPUs).

My model achieved an average cost and *training batch* accuracy on the order of 110-120 and 30%, respectively, after 38 epochs with the following paramters:

Hidden size:650, Number of steps:35, Initialization scale:0.05, Batch size:20, Number of stacked LSTM layers:2, Keep probability / dropout: 0.5

You are probably thinking the accuracy isn’t very high, and you are correct, however further training and a larger hidden layer would provide better final accuracy values. To perform further training on a larger network you really need to be using GPUs to accelerate the training – I’ll do this in a future post and present the results.

## Testing the model

To test the model on the test or validation data, I’ve created another function called *test* which looks like so:

We start with creating an Input and Model class that matches our training Input and Model classes. It is important that key parameters match the training model, such as the hidden size, number of steps, batch size etc. We are going to load our saved model variables into the computational graph created by the test Model instance, and if the dimensions don’t match TensorFlow will throw an error.

Next we create a *tf.train.Saver() *operation – this will load all our saved model variables into our test model when we run the line *saver.restore(sess, model_path). *After dealing with all of the threads and creating a zeroed state variable, we setup some variables which relate to how we are going to assess the accuracy and look at some specific instances of predicted strings. Because we have to “warm up” the model by feeding it some data to get good state variables, we only measure the accuracy after a certain number of batches i.e. *acc_check_thresh.*

When the batch number is equal to *check_batch_idx* the code runs the *m.predict* operation to extract the predictions for the particular batch of data. The first prediction of the batch is passed through the reverse dictionary to convert them back to actual words (along with the batch target words) and then compared with what should have been predicted via printing.

Using the trained model, we can see the following output:

True values (1st line) vs *predicted values* (2nd line):

stock market is headed many traders were afraid to trust stock prices quoted on the big board <eos> the futures halt was even <unk> by big board floor traders <eos> it <unk> things up said

*market market is n’t for traders say willing to buy the prices <eos> <eos> the big board <eos> the dow market is a worse <eos> the board traders traders <eos> the ‘s the to to*

Average accuracy: 0.283

The accuracy isn’t fantastic, but you can see the network is matching the “gist” of the sentence i.e. not producing all of the exact words but matching the general subject matter. As I mentioned above, in a future post I’ll present the data from a model trained for longer using GPUs.

I hope you enjoyed the post – it’s been a long one, but I hope that this gives you a solid foundation in understanding recurrent neural networks and LSTMs and how to implement them in TensorFlow. If you’d like to learn how to build LSTM networks in Keras, see this tutorial.