Python gensim Word2Vec tutorial with TensorFlow and Keras

gensim Word2Vec - nearest words
Nearest words by cosine similarity

I’ve been dedicating quite a bit of time recently to Word2Vec tutorials because of the importance of the Word2Vec concept for natural language processing (NLP) and also because I’ll soon be presenting some tutorials on recurrent neural networks and LSTMs for sequence prediction/NLP.  There are also some very interesting ideas floating around such as thought vectors which require an understanding of the Word2Vec concept.  My two Word2Vec tutorials are Word2Vec word embedding tutorial in Python and TensorFlow and A Word2Vec Keras tutorial showing the concepts of Word2Vec and implementing in TensorFlow and Keras, respectively.  In this tutorial, I am going to show you how you can use the original Google Word2Vec C code to generate word vectors, using the Python gensim library which wraps this cod,e and apply the results to TensorFlow and Keras.

The gensim Word2Vec implementation is very fast due to its C implementation – but to use it properly you will first need to install the Cython library. In this tutorial, I’ll show how to load the resulting embedding layer generated by gensim into TensorFlow and Keras embedding implementations.  Because of gensim’s blazing fast C wrapped code, this is a good alternative to running native Word2Vec embeddings in TensorFlow and Keras.


Recommended online course: If you are more of a video course learner, check out this inexpensive Udemy course: Natural Language Processing with Deep Learning in Python


Word2Vec and gensim

I’ve devoted plenty of words to explaining Word2Vec in my previous tutorials (here and here) so I’ll only briefly introduce the Word2Vec concepts here.  For further details, check out those tutorials. Here’s the (relatively) quick version – for each text data set that we create, we have to create a vocabulary. The vocabulary is the list of unique words within the text.  Often it is >10,000 words for serious data sets.  Machine learning models generally can’t take raw word inputs, so we first need to convert our data set into some number format – generally a list of unique integers.

Neural network based models like vector inputs. We, therefore, need to convert the integers into vectors.  A naive way of converting integers into vectors is to convert them into one-hot vectors – these are vectors where all of the values are set to zero, except for one i.e. [0, 0, 0, …, 1, …, 0, 0].  The “one-hot” value is located at the array index which matches the unique integer representation of the word. Therefore, our input one-hot vector must be at least the size of the vocabulary in length – i.e. >10,000 words.

There are two main problems with this type of representation of words – the first is that it is inefficient. Each word is represented by a 10,000 word plus vector, which for neural networks means a heck of a lot of associated weights between the input layer and the first hidden layer (generally millions).  The second is that it loses all contextual meaning of the words.  We need a way of representing words that is both efficient and yet retains some of the original meaning of the word and its relation to other words. Enter word embedding and Word2Vec.

Word embedding and Word2Vec

Word embedding involves creating better vector representations of words – both in terms of efficiency and maintaining meaning. For instance, a word embedding layer may involve creating a 10,000 x 300 sized matrix, whereby we look up a 300 length vector representation for each of the 10,000 words in our vocabulary.  This new, 300 length vector is obviously a lot more efficient than a 10,000 length one-hot representation.  But we also need to create this 300 length vector in such a way as to preserve some semblance of the meaning of the word.

Word2Vec does this by taking the context of words surrounding the target word.  So, if we have a context window of 2, the context of the target word “sat” in the sentence “the cat sat on the mat” is the list of words [“the”, “cat”, “on”, “the”]. In Word2Vec, the meaning of a word is roughly translatable to context – and it basically works. Target words which share similar common context words often have similar meanings. The way Word2Vec trains the embedding vectors is via a neural network of sorts – the neural network, given a one-hot representation of a target word, tries to predict the most likely context words.  For an introduction to neural networks, see this tutorial.

Here’s a naive way of performing the neural network training using an output softmax layer:

gensim word embedding softmax trainer
A word embedding softmax trainer

In this network, the 300 node hidden layer weights are training by trying to predict (via a softmax output layer) genuine, high probability context words.  Once the training is complete, the output softmax layer is discarded and what is of real value is the 10,000 x 300 weight matrix connecting the input to the hidden layer. This is our embedding matrix, and we can look up any member of our 10,000-word vocabulary and get it’s 300 length vector representation.

It turns out that this softmax way of training the embedding layer is very inefficient, due to the millions of weights that need to be involved in updating and calculating the softmax values. Therefore, a concept called negative sampling is used in the real Word2Vec, which involves training the layer with real context words and a few negative samples which are chosen randomly from outside the context.  For more details on this, see my Word2Vec Keras tutorial.

Now we understand what Word2Vec training of embedding layers involves, let’s talk about the gensim Word2Vec module.

A gensim Word2Vec tutorial

gensim Word2Vec - nearest words
Nearest words by cosine similarity

This section will give a brief introduction to the gensim Word2Vec module.  The gensim library is an open-source Python library that specializes in vector space and topic modeling.  It can be made very fast with the use of the Cython Python model, which allows C code to be run inside the Python environment. This is good for our purposes, as the original Google Word2Vec implementation is written in C, and gensim has a wrapper for this code, which will be explained below.

For this tutorial, we are going to use the text8 corpus sourced from here for our text data. All the code for this tutorial can be found on this site’s Github repository.

First off, we need to download the text8.zip file (if required) and extract it:

url = 'http://mattmahoney.net/dc/'
filename = maybe_download('text8.zip', url, 31344016)
root_path = "C:\\Users\Andy\PycharmProjects\\adventures-in-ml-code\\"
if not os.path.exists((root_path + filename).strip('.zip')):
    zipfile.ZipFile(root_path+filename).extractall()

This is all fairly straightforward Python file handling, downloading and zip file manipulation, so I won’t go into it here.

The next step that is required is to create an iterator for gensim to extract its data from.  We can cheat a little bit here and use a supplied iterator that gensim provides for the text8 corpus:

sentences = word2vec.Text8Corpus((root_path + filename).strip('.zip'))

The required input to the gensim Word2Vec module is an iterator object, which sequentially supplies sentences from which gensim will train the embedding layer. The line above shows the supplied gensim iterator for the text8 corpus, but below shows another generic form that could be used in its place for a different data set (not actually implemented in the code for this tutorial), where the data set also contains multiple files:

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

This capability of gensim is great, as it means you can setup iterators which cycle through the data without having to load the entire data set into memory.  This is vital, as some text data sets are huge  i.e. tens of GB.

After we’ve setup the iterator object, it is dead simple to train our word vectors:

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = word2vec.Word2Vec(sentences, iter=10, min_count=10, size=300, workers=4)

The first line just lets us see the INFO logging that gensim provides as it trains. The second line will execute the training on the provided sentences iterator.  The first optional argument iter specifies how many times the training code will run through the data set to train the neural network (kind of like the number of training epochs). The gensim training code will actually run through all the data iter+1 time, as the first pass involves collecting all the unique words, creating dictionaries etc.  The next argument, min_count, specifies the minimum amount of times that the word has to appear in the corpus before it is included in the vocabulary – this allows us to easily eliminate rare words and reduce our vocabulary size.  The third argument is the size of the resultant word vector – in this case, we set it to 300. In other words, each word in our vocabulary, after training, will be represented by a 300 length word vector. Finally, if we are using Cython, we can specify how many parallel workers we would like to work on the data – this will speed up the training process. There are lots of other arguments, but these are the main ones to consider.

Let’s examine our results and see what else gensim can do.

# get the word vector of "the"
print(model.wv['the'])

This returns a 300 length numpy vector – as you can see, each word vector can be retrieved from the model via a dictionary key i.e. a word within our vocabulary.

# get the most common words
print(model.wv.index2word[0], model.wv.index2word[1], model.wv.index2word[2])

The word vectors are also arranged within the wv object with indexes – the lowest index (i.e. 0) represents the most common word, the highest (i.e. the length of the vocabulary minus 1) the least common word.  The above code returns: “the of and”, which is unsurprising, as these are very common words.

# get the least common words
vocab_size = len(model.wv.vocab)
print(model.wv.index2word[vocab_size - 1], model.wv.index2word[vocab_size - 2], model.wv.index2word[vocab_size - 3])

The discovered vocabulary is found in model.wv.vocab – by taking the length of this dictionary, we can determine the vocabulary size (in this case, it is 47,134 elements long). The code above returns: “zanetti markschies absentia” – rare words indeed.

# find the index of the 2nd most common word ("of")
print('Index of "of" is: {}'.format(model.wv.vocab['of'].index))

We can also go the other way i.e. retrieve the index of a word we supply.  In this case, we are getting the index of the second most common word “of”. As expected the above code returns “Index of “of” is: 1″.

# some similarity fun
print(model.wv.similarity('woman', 'man'), model.wv.similarity('man', 'elephant'))

We can also easily extract similarity measures between word vectors (gensim uses cosine similarity). The above code returns “0.6599 0.2955”, which again makes sense given the context such words are generally used in.

# what doesn't fit?
print(model.wv.doesnt_match("green blue red zebra".split()))

This fun function determines which word doesn’t match the context of the others – in this case, “zebra” is returned.

We also want to able to convert our data set from a list of words to a list of integer indexes, based on the vocabulary developed by gensim.  To do so, we can use the following code:

# convert the input data into a list of integer indexes aligning with the wv indexes
# Read the data into a list of strings.
def read_data(filename):
    """Extract the first file enclosed in a zip file as a list of words."""
    with zipfile.ZipFile(filename) as f:
        data = f.read(f.namelist()[0]).split()
    return data

def convert_data_to_index(string_data, wv):
    index_data = []
    for word in string_data:
        if word in wv:
            index_data.append(wv.vocab[word].index)
    return index_data

str_data = read_data(root_path + filename)
index_data = convert_data_to_index(str_data, model.wv)
print(str_data[:4], index_data[:4])

The first function, read_data simply extracts the zip file data and returns a list of strings in the same order as our original text data set.  The second function loops through each word in the data set, determines if it is in the vocabulary*, and if so, adds the matching integer index to a list.  The code above returns: “[‘anarchism’, ‘originated’, ‘as’, ‘a’] [5237, 3080, 11, 5]”.

* Remember that some words in the data set will be missing from the vocabulary if they are very rare in the corpus.

We can also save and reload our trained word vectors/embeddings by the following simple code:

# save and reload the model
model.save(root_path + "mymodel")
model = gensim.models.Word2Vec.load(root_path + "mymodel")

Finally, I’ll show you how we can extract the embedding weights from the gensim Word2Vec embedding layer and store it in a numpy array, ready for use in TensorFlow and Keras.

# convert the wv word vectors into a numpy matrix that is suitable for insertion
# into our TensorFlow and Keras models
embedding_matrix = np.zeros((len(model.wv.vocab), vector_dim))
for i in range(len(model.wv.vocab)):
    embedding_vector = model.wv[model.wv.index2word[i]]
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In this case, we first create an appropriately sized numpy zeros array.  Then we loop through each word in the vocabulary, grabbing the word vector associated with that word by using the wv dictionary.  We then add the word vector into our numpy array.

So there we have it – gensim Word2Vec is a great little library that can execute the word embedding process very quickly, and also has a host of other useful functionality.

Now I will show how you can use pre-trained gensim embedding layers in our TensorFlow and Keras models.

Using gensim Word2Vec embeddings in TensorFlow

For this application, we’ll setup a dummy TensorFlow network with an embedding layer and measure the similarity between some words.  If you’re not up to speed with TensorFlow, I suggest you check out my TensorFlow tutorial or this online course Data Science: Practical Deep Learning in Theano + TensorFlow.  Also, it’s probably a good idea to check out my Word2Vec TensorFlow tutorial to understand how the embedding layer works.

The first step is to select some random words from the top 100 most common words in our text data set.

valid_size = 16  # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

The last line saves the array of 16 random words into a TensorFlow constant valid_dataset.

For the next step, we take the embedding matrix from our gensim Word2Vec simulation and “implant it” into a TensorFlow variable which we use as our embedding layer.

# embedding layer weights are frozen to avoid updating embeddings while training
saved_embeddings = tf.constant(embedding_matrix)
embedding = tf.Variable(initial_value=saved_embeddings, trainable=False)

Note that in the second line above for the TensorFlow variable declaration, I’ve set the trainable argument to False. If we were using this layer in, say, training a recurrent neural network, if we didn’t set this argument to False our embedding layer would be trained in TensorFlow with negative performance impacts. It’s probably not an overall bad strategy, i.e. starting with a gensim embedding matrix and then training further using something like a recurrent NN, but if you want your embedding layer fixed for performance reasons, you need to set trainable to False.

The next chunk of code calculates the similarity between each of the word vectors using the cosine similarity measure. It is explained more fully in my Word2Vec TensorFlow tutorial, but basically it calculates the norm of all the embedding vectors, then performs a dot product between the validation words and all other word vectors.

# create the cosine similarity operations
norm = tf.sqrt(tf.reduce_sum(tf.square(embedding), 1, keep_dims=True))
normalized_embeddings = embedding / norm
valid_embeddings = tf.nn.embedding_lookup(
      normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

Now we can run our TensorFlow session and sort the eight words which are closest to our validation example words.  Again, this code is explained in more detail in the previously mentioned tutorial.

# Add variable initializer.
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    # call our similarity operation
    sim = similarity.eval()
    # run through each valid example, finding closest words
    for i in range(valid_size):
        valid_word = wv.index2word[i]
        top_k = 8  # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k + 1]
        log_str = 'Nearest to %s:' % valid_word
            for k in range(top_k):
            close_word = wv.index2word[nearest[k]]
            log_str = '%s %s,' % (log_str, close_word)
        print(log_str)

This code will produce lines like:

Nearest to two: three, five, zero, four, six, one, seven, eight

As you can see, our Word2Vec embeddings produced by gensim have the expected results – in this example, we have number words being grouped together in similarity which makes sense.

Next up, let’s see how we can use the gensim Word2Vec embeddings in Keras.

Using gensim Word2Vec embeddings in Keras

We can perform similar steps with a Keras model. In this case, following the example code previously shown in the Keras Word2Vec tutorial, our model takes two single word samples as input and finds the similarity between them.  The top 8 closest words loop is therefore slightly different than the previous example:

valid_size = 16  # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
# input words - in this case we do sample by sample evaluations of the similarity
valid_word = Input((1,), dtype='int32')
other_word = Input((1,), dtype='int32')
# setup the embedding layer
embeddings = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1],
                      weights=[embedding_matrix])
embedded_a = embeddings(valid_word)
embedded_b = embeddings(other_word)
similarity = merge([embedded_a, embedded_b], mode='cos', dot_axes=2)
# create the Keras model
k_model = Model(input=[valid_word, other_word], output=similarity)

def get_sim(valid_word_idx, vocab_size):
    sim = np.zeros((vocab_size,))
    in_arr1 = np.zeros((1,))
        in_arr2 = np.zeros((1,))
    in_arr1[0,] = valid_word_idx
    for i in range(vocab_size):
        in_arr2[0,] = i
        out = k_model.predict_on_batch([in_arr1, in_arr2])
        sim[i] = out
    return sim

# now run the model and get the closest words to the valid examples
for i in range(valid_size):
    valid_word = wv.index2word[valid_examples[i]]
    top_k = 8  # number of nearest neighbors
    sim = get_sim(valid_examples[i], len(wv.vocab))
    nearest = (-sim).argsort()[1:top_k + 1]
    log_str = 'Nearest to %s:' % valid_word
    for k in range(top_k):
        close_word = wv.index2word[nearest[k]]
        log_str = '%s %s,' % (log_str, close_word)
    print(log_str)

As you can see when I setup the embeddings layer (using Keras’ dedicated Embedding() layer), all we need to do is specify the input and output dimensions (vocabulary size and embedding vector length, respectively) and then assign the gensim embedding_matrix to the weights argument. All the remaining logic is a copy from the Keras Word2Vec tutorial, so check that post out for more details.

The code produces lines like:

Nearest to when: unless, if, where, whenever, then, before, once, finally

Here we can see that subordinating conjunction word types have been grouped together – which is a good, expected result.

So that wraps up the tutorial – in this post, I’ve shown you how to use gensim to create Word2Vec word embeddings in a quick and efficient fashion.  I then gave an overview of how to “upload” these learned embeddings into TensorFlow and Keras.  I hope it has been helpful.

 


Recommended online course: If you are more of a video course learner, check out this inexpensive Udemy course: Natural Language Processing with Deep Learning in Python


 

2 Comments

  1. Hey Andy! Great article! thanks for publishing. Btw why you are initializing:

    `embedding_matrix = np.zeros((len(model.wv.vocab) + 1, vector_dim))` and not
    `embedding_matrix = np.zeros((len(model.wv.vocab), vector_dim))`. It doesnt seem there is a reason to add one more row.

Leave a Reply

Your email address will not be published.


*