Keras LSTM tutorial – How to easily build a powerful deep learning language model

Keras LSTM tutorial architecture

In previous posts, I introduced Keras for building convolutional neural networks and performing word embedding. The next natural step is to talk about implementing recurrent neural networks in Keras. In a previous tutorial of mine, I gave a very comprehensive introduction to recurrent neural networks and long short term memory (LSTM) networks, implemented in TensorFlow. In this tutorial, I’ll concentrate on creating LSTM networks in Keras, briefly giving a recap or overview of how LSTMs work. In this Keras LSTM tutorial, we’ll implement a sequence-to-sequence text prediction model by utilizing a large text data set called the PTB corpus. All the code in this tutorial can be found on this site’s Github repository.

A brief introduction to LSTM networks

Recurrent neural networks

A LSTM network is a kind of recurrent neural network. A recurrent neural network is a neural network that attempts to model time or sequence dependent behaviour – such as language, stock prices, electricity demand and so on. This is performed by feeding back the output of a neural network layer at time t to the input of the same network layer at time t + 1. It looks like this:

Recurrent LSTM tutorial - RNN diagram with nodes

Recurrent neural network diagram with nodes shown

Recurrent neural networks are “unrolled” programmatically during training and prediction, so we get something like the following:

Recurrent LSTM tutorial - unrolled RNN

Unrolled recurrent neural network

Here you can see that at each time step, a new word is being supplied – the output of the previous F (i.e. $h_{t-1}$) is supplied to the network at each time step also. If you’re wondering what those example words are referring to, it is an example sentence I used in my previous LSTM tutorial in TensorFlow: “A girl walked into a bar, and she said ‘Can I have a drink please?’.  The bartender said ‘Certainly’”. The problem with vanilla recurrent neural networks, constructed from regular neural network nodes, is that as we try to model dependencies between words or sequence values that are separated by a significant number of other words, we experience the vanishing gradient problem (and also sometimes the exploding gradient problem) – to learn more about the vanishing gradient problem, see my post on the topic. This is because small gradients or weights (values less than 1) are multiplied many times over through the multiple time steps, and the gradients shrink asymptotically to zero. This means the weights of those earlier layers won’t be changed significantly and therefore the network won’t learn long-term dependencies. LSTM networks are a way of solving this problem.

LSTM networks

As mentioned previously, in this Keras LSTM tutorial we will be building an LSTM network for text prediction. An LSTM network is a recurrent neural network that has LSTM cell blocks in place of our standard neural network layers. These cells have various components called the input gate, the forget gate, and the output gate – these will be explained more fully later. Here is a graphical representation of the LSTM cell:

Recurrent neural network LSTM tutorial - LSTM cell diagram

LSTM cell diagram

Notice first, on the left hand side, we have our new word/sequence value $x_t$ being concatenated to the previous output from the cell $h_{t-1}$. The first step for this combined input is for it to be squashed via a tanh layer. The second step is that this input is passed through an input gate. An input gate is a layer of sigmoid activated nodes whose output is multiplied by the squashed input. These input gate sigmoids can act to “kill off” any elements of the input vector that aren’t required. A sigmoid function outputs values between 0 and 1, so the weights connecting the input to these nodes can be trained to output values close to zero to “switch off” certain input values (or, conversely, outputs close to 1 to “pass through” other values).

The next step in the flow of data through this cell is the internal state / forget gate loop. LSTM cells have an internal state variable $s_t$. This variable, lagged one time step i.e. $s_{t-1}$ is added to the input data to create an effective layer of recurrence. This addition operation, instead of a multiplication operation, helps to reduce the risk of vanishing gradients. However, this recurrence loop is controlled by a forget gate – this works the same as the input gate, but instead helps the network learn which state variables should be “remembered” or “forgotten”.

Finally, we have an output layer tanh squashing function, the output of which is controlled by an output gate. This gate determines which values are actually allowed as an output from the cell $h_t$. The mathematics of the LSTM cell looks like this:


First, the input is squashed between -1 and 1 using a tanh activation function. This can be expressed by:

$$g = tanh(b^g + x_tU^g + h_{t-1}V^g)$$

Where $U^g$ and $V^g$ are the weights for the input and previous cell output, respectively, and $b^g$ is the input bias. Note that the exponents g are not a raised power, but rather signify that these are the input weights and bias values (as opposed to the input gate, forget gate, output gate etc.). This squashed input is then multiplied element-wise by the output of the input gate, which, as discussed above, is a series of sigmoid activated nodes:

$$i = \sigma(b^i + x_tU^i + h_{t-1}V^i)$$

The output of the input section of the LSTM cell is then given by: $$g \circ i$$

Where the $\circ$ operator expresses element-wise multiplication.

Forget gate and state loop

The forget gate output is expressed as:

$$f = \sigma(b^f + x_tU^f + h_{t-1}V^f)$$

The output of the element-wise product of the previous state and the forget gate is expressed as $s_{t-1} \circ f$. The output from the forget gate / state loop stage is:

$$s_t = s_{t-1} \circ f + g \circ i$$

Output gate

The output gate is expressed as:

$$o = \sigma(b^o + x_tU^o + h_{t-1}V^o)$$

So the final output of the cell, with the tanh squashing, can be shown as:

$$h_t = tanh(s_t) \circ o$$

LSTM word embedding and hidden layer size

It should be remembered that in all of the mathematics above we are dealing with vectors i.e. the input $x_t$ and $h_{t-1}$ are not single-valued scalars, but rather vectors of a certain length. Likewise, all the weights and bias values are matrices and vectors respectively.

Now, you may be wondering, how do we represent words to input them to a neural network? The answer is word embedding. I’ve written about this extensively in previous tutorials, in particular Word2Vec word embedding tutorial in Python and TensorFlow and A Word2Vec Keras tutorial. Basically it involves taking a word and finding a vector representation of that word which captures some meaning of the word.

In Word2Vec, this meaning is usually quantified by context – i.e. word vectors which are close together in vector space are those words which appear in sentences close to the same words. The word vectors can be learnt separately, as in this tutorial, or they can be learnt during the training of your Keras LSTM network. In the example to follow, we’ll be setting up what is called an embedding layer, to convert each word into a meaningful word vector. We have to specify the size of the embedding layer – this is the length of the vector each word is represented by – this is usually in the region of between 100-500. In other words, if the embedding layer size is 250, each word will be represented by a 250-length vector i.e. [$x_1, x_2, x_3,\ldots, x_{250}$].

LSTM hidden layer size

We usually match up the size of the embedding layer output with the number of hidden layers in the LSTM cell. You might be wondering where the hidden layers in the LSTM cell come from. In my LSTM overview diagram, I simply showed “data rails” through which our input data flowed. However, each sigmoidtanh or hidden state layer in the cell is actually a set of nodes, whose number is equal to the hidden layer size. Therefore each of the “nodes” in the LSTM cell is actually a cluster of normal neural network nodes, as in each layer of a densely connected neural network.

The Keras LSTM architecture

This section will illustrate what a full LSTM architecture looks like, and show the architecture of the network that we are building in Keras. This will further illuminate some of the ideas expressed above, including the embedding layer and the tensor sizes flowing around the network. The proposed architecture looks like the following:  

Keras LSTM tutorial architecture

Keras LSTM tutorial architecture

The input shape of the text data is ordered as follows : (batch size, number of time steps, hidden size). In other words, for each batch sample and each word in the number of time steps, there is a 500 length embedding word vector to represent the input word. These embedding vectors will be learnt as part of the overall model learning. The input data is then fed into two “stacked” layers of LSTM cells (of 500 length hidden size) – in the diagram above, the LSTM network is shown as unrolled over all the time steps. The output from these unrolled cells is still (batch size, number of time steps, hidden size). This output data is then passed to a Keras layer called TimeDistributed, which will be explained more fully below.

Finally, the output layer has a softmax activation applied to it. This output is compared to the training y data for each batch, and the error and gradient back propagation is performed from there in Keras. The training data in this case is the input words advanced one time step – in other words, at each time step the model is trying to predict the very next word in the sequence. However, it does this at every time step – hence the output layer has the same number of time steps as the input layer. This will be made more clear later.

Building the Keras LSTM model

In this section, each line of code to create the Keras LSTM architecture shown above will be stepped through and discussed. However, I’ll only briefly discuss the text preprocessing code which mostly uses the code found on the TensorFlow site here. The complete code for this Keras LSTM tutorial can be found at this site’s Github repository and is called Note, you first have to download the Penn Tree Bank (PTB) dataset which will be used as the training and validation corpus. You’ll need to change the data_path variable in the Github code to match the location of this downloaded data.

The text preprocessing code

In order to get the text data into the right shape for input into the Keras LSTM model, each unique word in the corpus must be assigned a unique integer index. Then the text corpus needs to be re-constituted in order, but rather than text words we have the integer identifiers in order. The three functions which do this in the code are read_words, build_vocab and file_to_word_ids. I won’t go into these functions in detail, but basically, they first split the given text file into separate words and sentence based characters (i.e. end-of-sentence <eos>). Then, each unique word is identified and assigned a unique integer. Finally, the original text file is converted into a list of these unique integers, where each word is substituted with its new integer identifier. This allows the text data to be consumed in the neural network.

The load_data function which I created to run these functions is shown below:

def load_data():
    # get the data paths
    train_path = os.path.join(data_path, "ptb.train.txt")
    valid_path = os.path.join(data_path, "ptb.valid.txt")
    test_path = os.path.join(data_path, "ptb.test.txt")
    # build the complete vocabulary, then convert text data to list of integers
    word_to_id = build_vocab(train_path)
    train_data = file_to_word_ids(train_path, word_to_id)
    valid_data = file_to_word_ids(valid_path, word_to_id)
    test_data = file_to_word_ids(test_path, word_to_id)
    vocabulary = len(word_to_id)
    reversed_dictionary = dict(zip(word_to_id.values(), word_to_id.keys()))
    print(" ".join([reversed_dictionary[x] for x in train_data[:10]]))
    return train_data, valid_data, test_data, vocabulary, reversed_dictionary

To call this function, we can run:

train_data, valid_data, test_data, vocabulary, reversed_dictionary = load_data()

The three outputs from this function are the training data, validation data and test data from the data set, respectively, but with each word represented as an integer in a list. Some information is printed out during the running of load_data(), one of which is print(train_data[:5]) – this produces the following output:

[9970, 9971, 9972, 9974, 9975]

As you can observe, the training data is comprised of a list of integers, as expected. Next, the output vocabulary is simply the size of our text corpus. When words are incorporated into the training data, every single unique word is not considered – rather, in natural language processing, the text data is usually limited to a certain number of the most common words. In this case N = vocabulary = 10,000.

Finally, reversed_dictionary is a Python dictionary where the key is the unique integer identifier of a word, and the associated value is the word in text. This allows us to work backwards from predicted integer words that our model will produce, and translate them back to real text. For instance, the following code converts the integers in train_data back to text which is then printed: print(” “.join([reversed_dictionary[x] for x in train_data[100:110]])). This code snippet produces:

workers exposed to it more than N years ago researchers

That’s about all the explanation required with regard to the text pre-processing, so let’s progress to setting up the input data generator which will feed samples into our Keras LSTM model.

Creating the Keras LSTM data generators

When training neural networks, we generally feed data into them in small batches, called mini-batches, or just “batches” (for more information on mini-batch gradient descent, see my tutorial here). Keras has some handy functions which can extract training data automatically from a pre-supplied Python iterator/generator object and input it to the model. One of these Keras functions is called fit_generator. 

The first argument to fit_generator is the Python iterator function that we will create, and it will be used to extract batches of data during the training process. This function in Keras will handle all of the data extraction, input into the model, executing gradient steps, logging metrics such as accuracy, and executing callbacks (these will be discussed later). The Python iterator function needs to have a form like:

while True:
    #do some things to create a batch of data (x, y)
   yield x, y

In this case, I have created a generator class that contains a method that implements such a structure. The initialization of this class looks like:

class KerasBatchGenerator(object):
    def __init__(self, data, num_steps, batch_size, vocabulary, skip_step=5): = data
        self.num_steps = num_steps
        self.batch_size = batch_size
        self.vocabulary = vocabulary
        # this will track the progress of the batches sequentially through the
        # data set - once the data reaches the end of the data set it will reset
        # back to zero
        self.current_idx = 0
        # skip_step is the number of words which will be skipped before the next
        # batch is skimmed from the data set
        self.skip_step = skip_step

Here the KerasBatchGenerator object takes our data as the first argument. Note, this data can be either training, validation or test data – multiple instances of the same class can be created and used in the various stages of our machine learning development cycle – training, validation tuning, test.

The next argument supplied is called num_steps – this is the number of words that we will feed into the time distributed input layer of the network. In other words (pun intended), this is the set of words that the model will learn from to predict the words coming after. The argument batch_size is pretty self-explanatory, and we’ve discussed vocabulary already (it is equal to 10,000 in this case).

Finally, skip_steps is the number of words we want to skip over between training samples within each batch. To make this a bit clearer, consider the following sentence: “The cat sat on the mat, and ate his hat. Then he jumped up and spat” If num_steps is set to 5, the data consumed as the input data for a given sample would be “The cat sat on the”. In this case, because we are predicted the very next word in the sequence via our model, for each time step, the matching output or target data would be “cat sat on the mat”.

The skip_steps is the number of words to skip over before the next data batch is taken. If, in this example, it is skip_steps=num_steps the next 5 input words for the next batch would be “mat and ate his hat”. Hopefully, that makes sense.

One final item in the initialization of the class needs to be discussed. This the variable current_idx, which is initialized at zero. This variable is required to track the extraction of data through the full data set – once the full data set has been consumed in the training, we need to reset current_idx to zero so that the data consumption starts from the beginning of the data set again. In other words it is basically a data set location pointer.

Ok, now we need to discuss the generator method that will be called during fit_generator:

def generate(self):
    x = np.zeros((self.batch_size, self.num_steps))
    y = np.zeros((self.batch_size, self.num_steps, self.vocabulary))
    while True:
        for i in range(self.batch_size):
            if self.current_idx + self.num_steps >= len(
                # reset the index back to the start of the data set
                self.current_idx = 0
            x[i, :] =[self.current_idx:self.current_idx + self.num_steps]
            temp_y =[self.current_idx + 1:self.current_idx + self.num_steps + 1]
            # convert all of temp_y into a one hot representation
            y[i, :, :] = to_categorical(temp_y, num_classes=self.vocabulary)
            self.current_idx += self.skip_step
        yield x, y

In the first couple of lines, our x and y output arrays are created. The size of variable x is fairly straight forward to understand – it’s first dimension is the number of samples we specify in the batch. The second dimension is the number of words we are going to base our predictions on.

The size of variable y is a little more complicated. First, it has the batch size as the first dimension, then it has the number of time steps as the second, as discussed above. However, y has an additional third dimension, equal to the size of our vocabulary, in this case, 10,000. The reason for this is that the output layer of our Keras LSTM network will be a standard softmax layer, which will assign a probability to each of the 10,000 possible words. The one word with the highest probability will be the predicted word – in other words, the Keras LSTM network will predict one word out of 10,000 possible categories. Therefore, in order to train this network, we need to create a training sample for each word that has a 1 in the location of the true word, and zeros in all the other 9,999 locations. It will look something like this: (0, 0, 0, …, 1, 0, …, 0, 0) – this is called a one-hot representation, or alternatively, a categorical representation. Therefore, for each target word, there needs to be a 10,000 length vector with only one of the elements in this vector set to 1.

Ok, now onto the while True: yield x, y paradigm that was discussed earlier for the generator. In the first line, we enter into a for loop of size batch_size, to populate all the data in the batch. Next, there is a condition to test regarding whether we need to reset the current_idx pointer. Remember that for each training sample we consume num_steps words. Therefore, if the current index point plus num_steps is greater than the length of the data set, then the current_idx pointer needs to be reset to zero to start over with the data set.

After this check is performed, the input data is consumed into the array. The data indices consumed are pretty straight-forward to understand – it is the current index to the current-index-plus-num_steps number of words. Next, a temporary variable is populated which works in pretty much the same way – the only difference is that the starting point and the endpoint of the data consumption is advanced by 1 (i.e. + 1). If this is confusing, please refer to the “cat sat on the mat etc.” example discussed above.

The final step is converting each of the target words in each sample into the one-hot or categorical representation that was discussed previously. To do this, you can use the Keras to_categorical function. This function takes a series of integers as its first arguments and adds an additional dimension to the vector of integers – this dimension is the one-hot representation of each integer.

Its size is specified by the second argument passed to the function. So say we have a series of integers with a shape (100, 1) and we pass it to the to_categorical function and specify the size to be equal to 10,000 – the returned shape will be (100, 10000). For instance, let’s say the series / vector of integers looked like: (0, 1, 2, 3, ….), the to_categorical output would look like: (1, 0, 0, 0, 0, ….) (0, 1, 0, 0, 0, ….) (0, 0, 1, 0, 0, ….) and so on… Here the “…” represents a whole lot of zeroes ensuring that the total number of elements associated with each integer is 10,000. Hopefully, that makes sense.

The final two lines of the generator function are straight-forward – first, the current_idx pointer is incremented by skip_step whose role was discussed previously. The last line yields the batch of and data. Now that the generator class has been created, we need to create instances of it. As mentioned previously, we can set up instances of the same class to correspond to the training and validation data.

In the code, this looks like the following:

train_data_generator = KerasBatchGenerator(train_data, num_steps, batch_size, vocabulary,
valid_data_generator = KerasBatchGenerator(valid_data, num_steps, batch_size, vocabulary,

Now that the input data for our Keras LSTM code is all setup and ready to go, it is time to create the LSTM network itself.

Creating the Keras LSTM structure

In this example, the Sequential way of building deep learning networks will be used. This way of building networks was introduced in my Keras tutorial – build a convolutional neural network in 11 lines. The alternate way of building networks in Keras is the Functional API, which I used in my Word2Vec Keras tutorial. Basically, the sequential methodology allows you to easily stack layers into your network without worrying too much about all the tensors (and their shapes) flowing through the model. However, you still have to keep your wits about you for some of the more complicated layers, as will be discussed below.

In this example, it looks like the following:

model = Sequential()
model.add(Embedding(vocabulary, hidden_size, input_length=num_steps))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(LSTM(hidden_size, return_sequences=True))
if use_dropout:

The first step involves creating a Keras model with the Sequential() constructor.

The first layer in the network, as per the architecture diagram shown previously, is a word embedding layer. This will convert our words (referenced by integers in the data) into meaningful embedding vectors. This Embedding() layer takes the size of the vocabulary as its first argument, then the size of the resultant embedding vector that you want as the next argument.

Finally, because this layer is the first layer in the network, we must specify the “length” of the input i.e. the number of steps/words in each sample. It’s worthwhile keeping track of the Tensor shapes in the network – in this case, the input to the embedding layer is (batch_size, num_steps) and the output is (batch_size, num_steps, hidden_size). Note that Keras, in the Sequential model, always maintains the batch size as the first dimension. It receives the batch size from the Keras fitting function (i.e. fit_generator in this case), and therefore it is rarely (never?) included in the definitions of the Sequential model layers.

The next layer is the first of our two LSTM layers. To specify an LSTM layer, first you have to provide the number of nodes in the hidden layers within the LSTM cell, e.g. the number of cells in the forget gate layer, the tanh squashing input layer and so on. The next argument that is specified in the code above is the return_sequences=True argument. What this does is ensure that the LSTM cell returns all of the outputs from the unrolled LSTM cell through time. If this argument is left out, the LSTM cell will simply provide the output of the LSTM cell from the last time step.

The diagram below shows what I mean:

Keras LSTM tutorial - return sequences argument comparison

Keras LSTM return sequences argument comparison

As can be observed in the diagram above, there is only one output when return_sequences=False – $h_t$ . However, when return_sequences=True all of the unrolled outputs from the LSTM cells are returned $h_0 … h_t$. In this case, we want the latter arrangement. Why?

Well, in this example we are trying to predict the very next word in the sequence. However, if we are trying to train the model, it is best to be able to compare the LSTM cell output at each time step with the very next word in the sequence – in this way we get num_steps sources to correct errors in the model (via back-propagation) rather than just one for each sample. Therefore, for both stacked LSTM layers, we want to return all the sequences. The output shape of each LSTM layer is (batch_size, num_steps, hidden_size).

The next layer in our Keras LSTM network is a dropout layer to prevent overfitting. After that, there is a special Keras layer for use in recurrent neural networks called TimeDistributed. This function adds an independent layer for each time step in the recurrent model. So, for instance, if we have 10 time steps in a model, a TimeDistributed layer operating on a Dense layer would produce 10 independent Dense layers, one for each time step. The activation for these dense layers is set to be softmax in the final layer of our Keras LSTM model.

Compiling and running the Keras LSTM model

The next step in Keras, once you’ve completed your model, is to run the compile command on the model. It looks like this:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

In this command, the type of loss that Keras should use to train the model needs to be specified. In this case, we are using ‘categorical_crossentropy’ which is cross-entropy applied in cases where there are many classes or categories, of which only one is true. Next, in this example, the optimizer that will be used is the Adam optimizer – an effective “all-round” optimizer with adaptive stepping.

Finally, a metric is specified – ‘categorical_accuracy’, which can let us see how the accuracy is improving during training. The next line of code involves creating a Keras callback – callbacks are certain functions which Keras can optionally call, usually after the end of a training epoch. For more on callbacks, see my Keras tutorial. The callback that is used in this example is a model checkpoint callback – this callback saves the model after each epoch, which can be handy for when you are running long-term training.

checkpointer = ModelCheckpoint(filepath=data_path + '/model-{epoch:02d}.hdf5', verbose=1)

Note that the model checkpoint function can include the epoch in its naming of the model, which is good for keeping track of things. The final step in training the Keras LSTM model is to call the aforementioned fit_generator function.

The line below shows you how to do this:

model.fit_generator(train_data_generator.generate(), len(train_data)//(batch_size*num_steps), num_epochs,
                        validation_steps=len(valid_data)//(batch_size*num_steps), callbacks=[checkpointer])

The first argument to fit_generator is our generator function that was explained earlier. The next argument is the number of iterations to run for each training epoch. The value given len(train_data)//(batch_size*num_steps) ensures that the whole data set is run through the model in each epoch. Likewise, a generator for the smaller validation data set is called, with the same argument for the number of iterations to run. At the end of each epoch, the validation data will be run through the model and the accuracy will be returned.

Finally, the model checkpoint callback explained above is supplied via the callbacks argument in fit_generator. Now the model is good to go! Before some results are presented – some caveats are required. First, the PTB data set is a serious text data set – not a toy problem to demonstrate how good LSTM models are. Therefore, in order to get good results, you’ll likely have to run the model over many epochs, and the model will need to have a significant level of complexity. Therefore, it is likely to take a long time on a CPU machine, and I’d suggest running it on a machine with a good GPU if you want to try and replicate things.

If you don’t have a GPU machine yourself, you can create an Amazon EC2 instance as shown in my Amazon AWS tutorial. Another alternative is to use Google Colaboratory which offers free GPU time, see my introduction here. I’m in the latter camp, and wasn’t looking to give too many dollars to Amazon to train, optimize learning parameters and so on. However, I’ve run the model up to 40 epochs and gotten some reasonable initial results. My model parameters for the results presented below are as follows:

num_steps=30 batch_size=20 hidden_size=500

After 40 epochs, training data set accuracy was around 40%, while validation set accuracy reached approximately 20-25%. This is the sort of output you’ll see while running the training session:

Keras LSTM tutorial - example training output

Keras LSTM tutorial – example training output

The Keras LSTM results

In order to test the trained Keras LSTM model, one can compare the predicted word outputs against what the actual word sequences are in the training and test data set. The code below is a snippet of how to do this, where the comparison is against the predicted model output and the training data set (the same can be done with the test_data data).

model = load_model(data_path + "\model-40.hdf5")
dummy_iters = 40
example_training_generator = KerasBatchGenerator(train_data, num_steps, 1, vocabulary,
print("Training data:")
for i in range(dummy_iters):
    dummy = next(example_training_generator.generate())
num_predict = 10
true_print_out = "Actual words: "
pred_print_out = "Predicted words: "
for i in range(num_predict):
    data = next(example_training_generator.generate())
    prediction = model.predict(data[0])
    predict_word = np.argmax(prediction[:, num_steps-1, :])
    true_print_out += reversed_dictionary[train_data[num_steps + dummy_iters + i]] + " "
    pred_print_out += reversed_dictionary[predict_word] + " "

In the code above, first, the model is reloaded from the trained data (in the example above, it is the checkpoint from the 40th epoch of training). Then another KerasBatchGenerator class is created, as was discussed previously – in this case, a batch of length 1 is used, as we only want one num_steps worth of text data to compare. Then a loop of dummy data extractions from the generator is created – this is to control where in the data-set the comparison sentences are drawn from.

The second loop, from 0 to num_predict is where the interesting stuff is happening. First, a batch of data is extracted from the generator and this is passed to the model.predict() method. This returns num_steps worth of predicted words – however, each word is represented by a categorical or one hot output. In other words, each word is represented by a vector of 10,000 items, with most being zero and only one element being equal to 1. The index of this “1” is the integer representation of the actual English word. So to extract the index where this “1” occurs, we can use the np.argmax() function. This function identifies the index where the maximum value occurs in a vector – in this case the maximum value is 1, compared to all the zeros, so this is a handy function for us to use.

Once the index has been identified, it can be translated into an actual English word by using the reverse_dictionary that was constructed during the data pre-processing. This English word is then added to the predicted words string, and finally the actual and predicted words are returned.

The output below is the comparison between the actual and predicted words after 10 epochs of training on the training data set:

Keras LSTM tutorial - comparison on the training data set after 10 epochs

Comparison on the training data set after 10 epochs of training

As can be observed, while some words match, after 10 epochs of training the match is pretty poor. By the way “<unk>” refers to words not included in the 10,000 length vocabulary of the data set. Alternatively, if we look at the comparison after 40 epochs of training (again, just on the training data set):

Keras LSTM tutorial - comparison on the training data set after 40 epochs

Comparison on the training data set after 40 epochs of training

It can be observed that the match is quite good between the actual and predicted words in the training set. However, when we look at the test data set, the match after 40 epochs of training isn’t quite as good:

Keras LSTM tutorial - comparison on the test data set after 40 epochs

Comparison on the test data set after 40 epochs of training

Despite there not being a perfect correspondence between the predicted and actual words, you can see that there is a rough correspondence and the predicted sub-sentence at least makes some grammatical sense. So not so bad after all. However, in order to train a Keras LSTM network which can perform well on this realistic, large text corpus, more training and optimization is required. I will leave it up to you, the reader, to experiment further if you desire. However, the current code is sufficient for you to gain an understanding of how to build a Keras LSTM network, along with an understanding of the theory behind LSTM networks.

I hope this (large) tutorial is a help to you in understanding Keras LSTM networks, and LSTM networks in general.