Identifying the right meaning of the words using BERT

Published in

Towards Data Science

5 min readOct 21, 2019

An important reason for using contextualised word embeddings is that the standard embeddings assign one vector for every meaning of a word, however, there are multiple-meaning words. The hypothesis is that the use of the context can solve the problem of categorizing multiple-meaning words (homonyms and homographs) into the same embedding vector. In this story, we will analyse whether BERT embeddings can be used to classify different meanings of a word to prove that contextualised word embeddings solve the problem.

Duck or duck — based on images from Pixabay

Building a dataset

The idea of using the word ‘duck’ comes from a tweet I’ve seen earlier today. As words have multiple meanings, this type of misunderstanding is relatively common. According to the Merrian-Webster dictionary, the word ‘duck’ has 4 meanings. For the simplicity of this project, I categorised the verb ‘duck’ and the corresponding noun as the same meaning:

a) any of various swimming birds (family Anatidae, the duck family) in which the neck and legs are short, the feet typically webbed, the bill often broad and flat, and the sexes usually different from each other in plumage
b) the flesh of any of these birds used as food
to lower the head or body suddenly
a) a durable closely woven usually cotton fabric
b) ducks plural: light clothes and especially trousers made of duck

Misunderstanding the meaning of ‘duck’ — Tweet by natsmama75 on Twitter

To build a dataset, I used Your Dictionary’s sentence database, searching for sentences including ‘duck’. From the examples given there, I excluded the names, very long sentences and cases where I could not classify the meaning.

The final dataset is available in GitHub. It contains 77 sentences with the following distribution: 50 are referring to the animal (type 0), 17 are a form of the verb (type 1) and 10 are referring to the fabric (type 2).

The duck dataset

BERT embeddings

For the next part of this story, we will use the BERT embeddings from the original BERT base uncased model [1]. This means that using the last hidden layer, we generate a 768 size vector for every word.

We generate contextualised embedding vectors for every word depending on its sentence. Then, we keep only the embedding for the ‘duck’ word’s token. The question of the story is whether we can classify the different meanings using these 768 size vectors.

Principal Component Analysis (PCA)

PCA is an orthogonal transformation that we will use to reduce the dimension of the vectors [2,3]. PCA finds special basis vectors (eigenvectors) for the projection in a way that it maximises the variance of the reduced-dimension data. Using PCA has two important benefits. On the one hand, we can project the 768 dimension vectors to a 2D subspace where we can plot it to see the data. On the other hand, it keeps the maximum possible variance (projection loses information), therefore, it might keep enough variance so we can identify the classes on the picture.

Maximising variance for the first principal component — Equation from Wikipedia

Using PCA with sklearn.decomposition.PCA is a one-liner:
duck_pca = PCA(n_components=2).fit_transform(duck_embs)

The figure below shows the result of the projection using the first two principal components. The classes are manually annotated types. As we can see, using PCA we can easily separate the verb type from the others.

Duck meaning types. Projection using PCA

Nearest Neighbour Classification

Because of the relatively small size of the dataset, we will use a k-NN classifier instead of a neural network [4]. A k-NN uses the k closest samples to predict the class of a new example. Because we have only 10 samples in the third class, we need to use k<20 as the k-NN selects the most represented class from the neighbourhood.

For the validation, we will use LOOCV (Leave One Out Cross-Validation) [5]. This means that for every sample, we build a model based on the other 76 samples and validate it with the single sample. Then, we calculate the accuracy of these 77 successful or unsuccessful prediction.

Using sklearn, these steps are well prepared and easy to execute:

loocv = model_selection.LeaveOneOut() model = KNeighborsClassifier(n_neighbors=8) results = model_selection.cross_val_score(model, X, Y, cv=loocv)

The evaluation of the LOOCV states an accuracy of 92.208% with a standard deviation of 26.805%.

Comparison with Google Translate

My native language, Hungarian uses different words for the different meanings of ‘duck’. Therefore, using the English-Hungarian translator, we can identify which meaning the neural network behind the translator (called GNMT [6]) thinks, the ‘duck’ is.

Google Translate English-Hungarian translation of the ‘duck’ sentences

The table above shows that all the bird type ‘duck’ words are translated using the correct type and 8 out of the 17 verb type words, however, all the fabric ‘duck’ words were recognised as birds. This result shows a 75.641% accuracy.

Summary

In this story, we showed that using contextualised word embeddings can successfully solve the problem of the multiple-meaning words. The experiments support the hypothesis that the BERT embeddings store the different meanings of the words as we reconstructed the types using k-NN classifiers.

All corresponding codes are available on Google Colab.

References

[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[2] Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572.

[3] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6), 417.

[4] Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21–27.

[5] Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis. Technometrics, 10(1), 1–11.

[6] Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … & Klingner, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

Learn NMT with BERT stories