How to create Words Embedding in TensorFlow

Francesco Zuppichini

Published in

Towards Data Science

5 min readJul 20, 2018

Three different ways

You can find the Jupyter notebook for this article here

Today we are going to see how to create words embedding using TensorFlow.

Updated to tf 1.9

Words embedding is a way to represent words by creating high dimensional vector space in which similar words are close to each other.

Long story short, Neural Networks work with numbers so you can’t just throw words in it. You could one-hot encoded all the words but you will lose the notion of similarity between them.

Usually, almost always, you place your Embedding layer in-front-of your neural network.

Preprocessing

Usually, you have some text files, you extract tokens from the text and you build vocabulary. Something similar to the following code

Then

This is a very easy example, you usually need some more preprocessing to open multiple text source in parallel, create tokens and create the vocab. For example, you could use PySpark to efficiently preprocess text.

Let’s print the vocabulary

{'abutere': 0, 'ad': 1, 'audacia': 2, 'catilina': 3, 'effrenata': 4, 'eludet': 5, 'etiam': 6, 'finem': 7, 'furor': 8, 'iactabit': 9, 'iste': 10, 'nos': 11, 'nostra': 12, 'patientia': 13, 'quamdiu': 14, 'quem': 15, 'quo': 16, 'sese': 17, 'tandem': 18, 'tuus': 19, 'usque': 20}

Now we need to define the Embedding size, so the dimension of each vector, in our case 50, and the vocabulary length

We know need to define the ids of the words we want to embed. Just for example, we are going to take abutere and patientia

Just to be sure you are following me. words_ids represent the ids of some words in a vocabulary. A vocabulary is a map of words (tokens) to ids.

WHY we have to do that? Neural Networks work with numbers, so we have to pass a number to the embedding layer

‘Native’ method

To embed we can use the low-level API. We first need to define a matrix of size [VOCAL_LEN, EMBED_SIZE] (20, 50) and then we have to tell TensorFlow where to look for our words ids using tf.nn.embedding_lookup.

tf.nn.embedding_lookup creates an operation that retrieves the rows of the first parameters based on the index of the second.

[[0.2508639 0.6186842 0.04858994 0.5210395 0.46944225 0.93606484 0.31613624 0.37244523 0.8245921 0.7652482 0.05056596 0.82652867 0.637517 0.5321804 0.84733844 0.90017974 0.41220248 0.659974 0.7645968 0.5598999 0.40155995 0.06464231 0.8390876 0.139521 0.23042619 0.04655147 0.32764542 0.80585504 0.01360166 0.9290798 0.25056374 0.9695363 0.5877855 0.9006752 0.49083364 0.5052364 0.56793296 0.50847435 0.89294696 0.4142543 0.70229757 0.56847537 0.8818027 0.8013681 0.12879837 0.75869775 0.40932536 0.04723692 0.61465013 0.97508 ] [0.846097 0.8248534 0.5730028 0.32177114 0.37013817 0.71865106 0.2488327 0.88490605 0.6985643 0.8720304 0.4982674 0.75656927 0.34931898 0.20750809 0.16621685 0.38027227 0.23989546 0.43870246 0.49193907 0.9563453 0.92043686 0.9371239 0.3556149 0.08938527 0.28407085 0.29870117 0.44801772 0.21189022 0.48243213 0.946913 0.40073442 0.71190274 0.59758437 0.70785224 0.09750676 0.27404332 0.4761486 0.64353764 0.2631061 0.19715095 0.6992599 0.72724617 0.27448702 0.3829409 0.15989089 0.09099603 0.43427885 0.78103256 0.30195284 0.888047 ]]

Keras Layer

I am not a fan of keras, but something it can be useful. You can use a stand-alone layer to create the embeddings

[[0.9134803 0.36847484 0.51816785 0.19543898 0.07610226 0.8685185 0.7445053 0.5340642 0.5453609 0.72966635 0.06846464 0.19424069 0.2804587 0.77481234 0.7343868 0.16347027 0.56002617 0.76706755 0.16558647 0.6719606 0.05563295 0.22389805 0.47797906 0.98075724 0.47506428 0.7846818 0.65209556 0.89036727 0.14960134 0.8801923 0.23688185 0.70695686 0.59664845 0.6206044 0.69665396 0.60709286 0.42249918 0.7317171 0.03822994 0.37915635 0.60433483 0.4168439 0.5516542 0.84362316 0.27857065 0.33540523 0.8601098 0.47720838 0.9827635 0.09320438] [0.27832222 0.8259096 0.5726856 0.96932447 0.21936393 0.26346993 0.38576245 0.60339177 0.03083277 0.665465 0.9077859 0.6219367 0.5185654 0.5444832 0.16380131 0.6688931 0.82876015 0.9705752 0.40097427 0.28450823 0.9425919 0.50802815 0.02394092 0.24661314 0.45858765 0.7080616 0.8434526 0.46829247 0.0329994 0.10844195 0.6812979 0.3505745 0.67980576 0.71404254 0.8574227 0.40939808 0.8668809 0.58524954 0.52820635 0.31366992 0.05352783 0.8875419 0.04600751 0.27407455 0.6398467 0.74402344 0.9710648 0.5717342 0.78711486 0.9209585 ]]

Tensorflow Layers

Since Tensorflow is a mess, there are always several ways to do the same things and it is not clear which one you should use, so no surprise there is also another function to create embeddings.

[[ 0.11656719 -0.21488819 0.04018757 -0.18151578 -0.12417153 -0.00693065 0.27286723 0.00712651 -0.05931629 -0.20677638 0.14741448 -0.24938995 -0.21667814 0.09805503 0.2690411 0.20826831 0.19904876 0.08541816 0.20128882 0.15323257 -0.0386056 0.03025511 0.11573204 0.2161583 -0.02596462 -0.15845075 -0.26478297 -0.13366173 0.27797714 -0.08158416 -0.25292248 -0.16360758 -0.1846793 0.2444193 0.13292032 0.15807101 0.24052963 -0.0346185 0.02243239 0.2350963 -0.0260604 0.12481615 -0.1984439 0.20924723 -0.00630271 -0.26579106 0.04491454 0.10764262 0.170991 0.21768841] [-0.09142873 -0.25572282 0.2879894 -0.2416141 0.0688259 -0.06163606 0.2885336 -0.19590749 -0.04164416 0.28198788 0.18056017 -0.03718823 -0.09900685 0.14315534 -0.25260317 -0.00199199 -0.08959872 0.23495004 -0.18945126 -0.16665417 0.18416747 0.05468053 -0.23341912 0.02287021 0.27363363 0.07707322 -0.02453846 0.08111072 0.12435484 0.12095574 0.2879583 0.12930956 0.09152126 -0.2874632 -0.26153982 -0.10861655 -0.01751739 0.20820773 0.22776482 -0.17411226 -0.10380474 -0.14888035 0.01492503 0.24255303 -0.10528904 0.19635591 -0.22860856 0.2117649 -0.08887576 0.16184562]]

Write your own module

You can create your custom class with the classic paradigm __init__ + __call__ to build it.

[[0.19501674 0.954353 0.30957866 0.65923584 0.28241146 0.80623126 0.46677458 0.5877205 0.25624812 0.03041542 0.24185908 0.8056189 0.61915445 0.04368758 0.16852558 0.24910712 0.66250837 0.01929498 0.82387006 0.8489572 0.3970251 0.8156922 0.5550339 0.39991164 0.64657426 0.1980362 0.35962176 0.89992213 0.99705064 0.7636745 0.5627477 0.09286976 0.12509382 0.9644747 0.3412783 0.3238287 0.08844066 0.06885219 0.2377944 0.04519224 0.6535493 0.39360797 0.69070065 0.44310153 0.58286166 0.32064807 0.9180571 0.47852004 0.6686201 0.44279683] [0.0843749 0.77335155 0.14301467 0.23359239 0.77076364 0.3579203 0.95124376 0.03154683 0.11837351 0.622192 0.44682932 0.4268434 0.21531689 0.5922301 0.12666893 0.72407126 0.7601874 0.9128723 0.07651949 0.7025702 0.9072187 0.5582067 0.14753926 0.6066953 0.7564144 0.2200278 0.1666696 0.63408077 0.57941747 0.9417999 0.6540415 0.01334655 0.8736309 0.4756062 0.66136014 0.12366748 0.8578756 0.71376395 0.624522 0.22263229 0.35624254 0.00424874 0.1616261 0.43327594 0.83355534 0.51896024 0.53433514 0.47303247 0.7777432 0.4082179 ]]

Pretrain Embed

You can benefit by using pre-trained words embedding since they can improve the performance of your model. They are usually trained with enormous datasets, e.g. Wikipedia, with bag-of-words of skip-grammar models.

There are several of already trained embeddings, the most famous are word2vec and Glove.

You can read more about training words embedding and how to train them from scratch in this tensorflow tutorial and this nice article

In the past, load into TensorFlow pre-trained word-embedding was not so easy. You had to download the matrix from somewhere and load into your program and maybe store it again. Now, we can use TensorFlow Hub.

TensorFlow Hub

You can use pre-trained word-embeddings easily with TensorFlow hub: a collection of the pre-trained module that you can just import in your code. The full list is here

Let’s see it in action.

By the way, TensorFlow Hub is buggy and does not work well on Jupiter. Or at least on my machine, you can try to run it and see if it works for you. On PyCharm I have not any problem so I will copy and paste the output.

If you have the same problem please open an Issue on TensorFlow repository.

You can also re-train them by just setting the trainable parameter of the hub.Module constructor to True. This is useful when you have a specific domain text corpus and you want your embeddings to specialize on that.

hub.Module("https://tfhub.dev/google/Wiki-words-250-with-normalization/1", trainable=True)

Conclusions

Now you should be able to create your word embeddings layer efficiently in TensorFlow!

You may also find the following article interesting

How to use Dataset in TensorFlow

The built-in Input Pipeline. Never use ‘feed-dict’ anymore

towardsdatascience.com

Reinforcement Learning Cheat Sheet

Disclaimer: This is a work in progress project there may be errors!