A primer on word embeddings

Words into Vectors

Concepts for word embeddings

Jon Gimpel
Towards Data Science
13 min readJul 26, 2022

--

printing press letters, a coffee cup, and aligned colored pencils
Photos by Amador Loureiro, Mukul Wadhwa, Tamanna Rumee on Unsplash

This article is the 2ⁿᵈ in the series A primer on word embeddings:
1. What’s Behind Word2vec | 2. Words into Vectors |
3. Statistical Learning Theory | 4. The Word2vec Classifier |
5. The Word2vec Hyperparameters | 6. Characteristics of Word Embeddings

The first article, What’s Behind Word2vec, introduced the series of articles and set the stage for word embeddings. In this article, we’ll review the foundational NLP concepts that led to modern word embeddings.

Introduction to Data-Driven NLP

While linguistics, the study of language, has been contemplated since the earliest human writings, and Natural Language Processing (NLP) has been developed since the earliest computers, it wasn’t until the 1990s that the “statistical revolution” in NLP took shape (Johnson, 2009). This era marked a change in NLP’s focus from symbolic or rules-based algorithms, such as grammar or syntax, to data-driven techniques that combined machine learning on more powerful computers with large corpora available via the Internet.

According to Nivre (2002), statistics is involved in NLP in three main methods:

  1. Application: Where the algorithm applied to a stochastic model is deterministic
  2. Acquisition: Where models use inference from empirical data
  3. Evaluation: Which include descriptive statistics, estimation, and hypothesis testing

The “statistical revolution” that Johnson (2009) discusses is principally about acquisition methods.

Next, we’ll review the statistical acquisition methods that led to the application of machine learning techniques to NLP.

Words hold the primary meaning in a written document. Although punctuation, capitalization, symbols, spelling, layout, formatting, fonts/handwriting, and illustrations also convey meaning, and can even be applied to vectors or integrated into word vectors, most of the value is in words and their sequencing.

Frequency-Based Matrices

The idea of using vectors in computer language processing to represent words dates back to at least 1975 when Salton et al. (1975) published foundational work introducing the concept of a vector space model. The simplest vector space model is a one-hot vector, where a numerical 1 is used in a cell of a 1×v vector to show the presence of a unique word in the vocabulary v, with all the other cells as 0s.

One-hot Word Vector Example (image by author)

We will introduce the following three additional examples of how words can be represented by vectors for use in computer language processing (NSS, 2017):

  1. Word Count
  2. Term Frequency–Inverse Document Frequency (TF-IDF)
  3. Co-occurrence

Language modeling techniques using these vector space models are known as count-based approaches because they use sums of the values of global word occurrence (Almeida and Xexéo, 2019; Levy et al., 2015).

Word Count

When collecting the word data for distributional word representations, one can begin with a simple count of the words in a series of documents. The sum of the number of times each word appears per document is a count vector. For example:

Word Count Example
(image by author, inspired by Potts and MacCartney, 2019)

A count vector is a basic descriptive statistic of the words as they are represented in the documents, dimension v×d, where v is the number of words in the vocabulary and d is the number of documents. Each row of this matrix could be considered a word vector of length d, and it is a sparse matrix.

Term Frequency–Inverse Document Frequency (TF-IDF)

A more significant descriptive statistic for a document is the relative frequency of a word in a document, that is, how often a word appears in a document relative to how often it appears in the entire corpus in general. This value is useful when determining how important a word is within a document for text-based search engine result rankings, for example.

A typical way to calculate the relative frequency value is known as Term Frequency–Inverse Document Frequency (TF-IDF) and is defined as multiplying TF×IDF according to various formulas, but the following two formulas are most typically used (Voita, 2020):

TF and IDF equations

where, t is the term or word, d is the document, n is the number of occurrences of t in d, N is the number of occurrences of t in all documents, D is the total number of documents, and |{dD : td}| is the number of documents in which t occurs.

There are other variations of these formulas for both the TF and IDF statistics. A notable aspect of the IDF formula is the consistent use of log across the variations, which has theoretical justification in information theory (Robertson, 2004), the specifics of which were formally developed after TF-IDF was put to wide use on the Internet for search.

A take-away here is that the TF-IDF value has been extremely useful in large part because it is a reweighting of the raw statistics.

The TF-IDF word vector dimension is the same as the count vector’s dimension: v×d. Each row of the matrix can be used as a word vector of length d.

Co-occurrence

Now, let’s look at an example of a co-occurrence matrix. The figure below is a table that shows how often two words appear together within each document in a corpus. This table can also be considered a matrix of size v×v that is independent of the number of documents. Because the matrix quantifies word proximity, the values point towards word meaning according to the distributional hypothesis in linguistics (Firth, 1957).

Co-occurance Matrix Example
(image by author, inspired by Potts and MacCartney, 2019)

Each row of this matrix can be used as a word vector of length v, where v is the vocabulary. The matrix is a model of word proximity and is a denser matrix than the word count matrix shown previously.

But a co-occurrence matrix need not be used solely for words per document. One could look instead at words paired in paragraphs, sentences, or within a window of a certain number of words. It turns out that setting a relevant window size is quite important. We’ll discuss window size in the section below titled “Experimental Design Considerations”.

Still, a co-occurrence matrix is not as dense as would be ideal because most values are repeated, so one could apply dimension reduction techniques without losing significant information, as we’ll see in the section below titled “Dimensionality Reduction”.

There are many ways to form word count vectors, depending on what’s being modeled. For context, a few more count vectors are: word × discourse context, phonological segment × feature values, and word × syntactic context (Potts and MacCartney, 2019).

Reweightings

We previously discussed TF-IDF, where the idf added a weighting to the word count values. TF-IDF values are an example of reweighting, which can be used to concentrate information into a word vector.

Reweighting in a vector space model typically involves an adjustment for the frequency of words as we saw with TF-IDF. With natural language, raw word counts are heavily shaped by the fact that word frequency has a tendency to follow Zipf’s law, that is, it is nearly linear when plotting log(rank) vs. log(frequency).

Zipf’s Law in Word Frequency of Wikipedia Corpora
(by Sergio Jimenez via Wikipedia, CC BY-SA 4.0)

A challenge of reweighting is to ensure that information for infrequent words is adequately represented without introducing anomalies from the amplification of outlier data. Examples of reweighting formulas for count-based vectors include (Potts and MacCartney, 2019):

  1. Normalization: Euclidian or L² norming (L2 norm) of the vector values
  2. Probability: Representing vector values as a probability P(d|t), summing to 1
  3. Observed/Expected: O/E and associated χ² or G-test statistics
  4. TF-IDF: See the section above, noting that variants of TF-IDF consider the empirical distribution of words
  5. PMI: Pointwise Mutual Information
  6. PPMI: Positive Pointwise Mutual Information

The reweighting formulas above apply to word count vectors where t is the word and d is the document.

Which reweighting formula to use depends on the application. In Natural Language Understanding, Potts refers to PMI as being the “hero of the story” (Potts and MacCartney, 2019) because it tends to reveal insights in natural language processing. The formula for PMI is:

PMI equation, uses log base 2

where P(t,d) is the probability of the word t and document d occurring together, P(t) is the probability of the word occurring, and P(d) is the probability of the document occurring. PMI quantifies the discrepancy between the probability of two events occurring given their joint distribution versus their individual distributions if the events are independent. Since PMI is difficult to define when individual probabilities get very small, PPMI is used more often (Jurafsky and Martin, 2019). PPMI replaces negative PMI values with 0.

Experimental Design Considerations

Window scaling is a concept related to reweighting that has to do with how the data is gathered. In the co-occurrence matrix example, the co-occurrence of words is measured and recorded for each document. If the scope of co-occurrence is changed to sections or paragraphs instead of the entire document, the resulting matrix is different. Context windows are, in fact, commonly measured as a certain number of words, such as ±10. Whether or not words are counted when they appear in separate sentences or include punctuation are other considerations.

An important aspect of counting words to create word vectors is the criteria used to pre-process the documents. For example:

  • Should punctuation be included or removed?
  • Should capitalization be removed so words are considered the same even though capitalizing some words can change their meaning?
  • Should a named-entity recognition script be run to classify names into pre-defined categories, such as person names, business names, and street names?
  • Should parts of speech be tagged so that the verb and noun forms of a word, such as ‘run’, can be distinguished?
  • Should verbs be modified so that all tenses of a verb are the same (e.g., lemmatization or stemming)? (Pre-processing of verbs is often done by removing the ‘ed’ ending of verbs in English.)
  • What should be done with misspellings and rare words?
  • Should very common words, such as ‘the’ and ‘a’ be removed?
  • Should emojis be included? 🤔

These types of questions need to be answered when deciding whether and how to pre-process documents to prepare them for the word embedding training.

Something as simple as where a word begins and ends is not always clear, either. In English, spaces generally separate words, but hyphens aid in creating compound words. In German, on the other hand, compound words are unhyphenated and can get quite long, as for example in the spelling out of the number 123,456:

einhundertdreiundzwanzigtausendvierhundertsechsundfünfzig

Furthermore, is it worthwhile to not even consider ‘words’ as the basic unit of measurement for a vector, but instead consider groupings of letters (character n-grams) to find the patterns of meaning?

Making appropriate decisions about how to pre-process documents is an important aspect of the experimental design that depends on the application and language model being built.

Distance Measures

Once the word vectors are designed and created from the data, how should one measure their similarity? That is, if a language model predicts a word with a certain vector estimation, how should one discover the nearest word, given its vector in the data set?

Euclidian distance is the geometric standard for vectors. Its equation is:

Euclidian distance equation, often learned as a squared plus b squared equals c squared

where a and b are two vectors of size n. However, in the vector space model where the frequency of words is recorded, more frequent words tend to have a greater magnitude, which often hinders word comparison (Jurafsky and Martin, 2019). Geometrically, when vector direction is the sole measure, one can use cosine similarity, which is the most common metric used to measure the similarity of word embeddings (Jurafsky and Martin, 2019; Turney and Pantel, 2010). The equation for cosine similarity is:

cosine similarity equation, a dot b over norm a times norm b

Here, the dot symbol indicates the dot product on the vector (in bold), and the double vertical line indicates the L² norm of the vector, which is used in the denominator to normalize for magnitude. The values range from −1 meaning exactly opposite to 1 meaning exactly the same.

The following figure shows how Euclidian distance and cosine similarity can contradict one another when determining a closest neighbor.

three points in 2D space with lines illustrating Euclidian and cosine distance measures
Euclidian Distance and Cosine Similarity (image by author)

Note that if the cosine similarity is normalized by subtracting the mean, it becomes the Pearson correlation coefficient. Like cosine similarity, the Pearson correlation coefficient ranges from −1 to 1, but it is a measure of the amount of linear relationship that is invariant to location and scale.

To convert the cosine similarity into a ‘distance’ measure, where farther away is indicated by a larger number, one might consider using:

cosine ‘distance’ equation

But note that this measure is not a ‘proper’ distance measure. According to Potts and MacCartney, a true distance must be symmetric, must be 0 for identical vectors, and must satisfy the triangle inequality (Potts and MacCartney, 2019). So, cosine similarity should be converted to angular distance, which ranges from 0 to 1:

angular distance equation

Additional distance measures, whether proper or not, that have been used in the literature for word vectors are (Potts and MacCartney, 2019):

  • Matching-based methods related to the intersection, including: Matching, Jaccard, Dice, and Overlap
  • Kullback-Leibler (KL) divergence methods related to Fisher information to derive probabilities, including: KL divergence, Symmetric KL and Jensen-Shannon divergence, and KL with skew

Which distance measure to choose depends on what needs to be emphasized for the application. There is more trial and error than statistical purity for measured results in statistical NLP.

Dimensionality Reduction

Visualizations help our understanding of vector relationships by reducing hyperdimensional data down to two or three dimensions that visually show the dominant relationships. In statistics, one often uses Principal Component Analysis (PCA), in which the vectors are mapped linearly into a two-dimensional plane where the variance of the data is maximized.

t-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization that was published in 2008 by Van Der Maaten and Hinton with the goal of capturing much of the local structure of the high-dimensional data very well, while also revealing global structure such as the presence of clusters at several scales” (Van der Maaten and Hinten, 2008). It’s common to use t-SNE to visualize a set of word relationships or even get a high-level view of a large set of word relationships.

PCA and t-SNE are dimension reduction techniques that are well known for visualizing vectors in two or three dimensions, but they can also be used to reduce the number of dimensions for a word vector to any number of lower dimensions.

Additional dimensionality reduction techniques for the vector space model include (Potts and MacCartney, 2019):

  1. Singular value decomposition (SVD)
  2. LSA (also known as LSI), which uses truncated SVD
  3. Non-negative matrix factorization (NMF)
  4. Probabilistic LSA (PLSA)
  5. Latent Dirichlet Allocation (LDA)
  6. Autoencoders using neural networks

Dimensionality reduction techniques led to machine learning prediction-based word embeddings, the most notable early method being Word2vec, which we’ll discuss in detail in future articles after an introduction to statistical learning theory.

Summary

In this article, we learned how to use a vector to quantify word proximity relationships in text, how to measure those relationships, and how to maximize the usability of the data using dimensionality reduction and machine learning techniques.

In the next article in this series, Statistical Learning Theory, we’ll review the mathematical theory for understanding a shallow neural network classifier, such as Word2vec.

This article was the 2ⁿᵈ in the series A primer on word embeddings:
1. What’s Behind Word2vec | 2. Words into Vectors |
3. Statistical Learning Theory | 4. The Word2vec Classifier |
5. The Word2vec Hyperparameters | 6. Characteristics of Word Embeddings

More on this Topic: A resource I recommend for learning more about the foundations of word embeddings is this online computer science class at Stanford University: Potts, C. and MacCartney, B. (2019). CS224U Natural Language Understanding.

References

Almeida, F. and Xexéo, G. (2019). Word Embeddings: A Survey. Available at arXiv:1901.09069v1.

Firth, J. R. (1957). A synopsis of linguistic theory, 1930–1955. In Firth (ed), Studies in Linguistic Analysis, Special Volume of the Philological Society, pages 1–32. Oxford, England: Basil Blackwell Publishing.

Johnson, M. (2009). How the Statistical Revolution Changes (Computational) Linguistics. Proceedings of the European Chapter of the Association of Computational Linguistics 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious, or Vacuous?, pages 3–11. PDF.

Jurafsky, D. and Martin, J. (2019). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, Third Edition, 2019 draft.

Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. In Transactions of the Association for Computational Linguistics, 3:211–225. Available at doi 10.1162/tacl_a_00134.

Nivre, J. (2002). On Statistical Methods in Natural Language Processing. In Bubenko, J. and Wangler. B. (eds) Promote IT. Second Conference for the Promotion of Research in IT at New Universities and University Colleges in Sweden, pages 684–694. University of Skövde.

NSS (2017). An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec. Analytics Vidhya.

Potts, C. and MacCartney, B. (2019). CS224U Natural Language Understanding, Online computer science course. Stanford, California: Stanford University.

Robertson, S. (2004). Understanding Inverse Document Frequency: On Theoretical Arguments for IDF. Journal of Documentation, 60(5):503–520. Available at doi 10.1108/00220410410560582.

Salton, G., Wong A., and Yang, C.S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11):613–620. Available at doi 10.1145/361219.361220.

Turney, P. D. and Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37:141–188. PDF.

Van der Maaten, L. and Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9:2579–2605. PDF.

Voita, L. (2020). Natural Language Processing Course: Word Embeddings. Github.

*Figures and images are by the author, unless otherwise noted.

--

--

Interests in data science | linguistics | philosophy. Studies in statistics, physics, data analytics, web development, French civilization.