The world’s leading publication for data science, AI, and ML professionals.

Embeddings, Embeddings Everywhere!

Food for thought on the incredible versatility of word embeddings with five different-from-usual domains of application


What is a Word Embedding?

The concept of embedding is arguably one of the most fascinating ideas in modern machine learning. If you have ever used a voice assistant, googled something, followed a Spotify artist recommendation, chances are that you stepped into embeddings without even being aware of it.

Even though we are assuming some kind of familiarity with the concept of word embedding here, let’s start with a quick recap of what it means and why we care about this technique. And if you need a broader overview of word embeddings, I strongly recommend Jay Alammar’s awesome blog.

A word embedding is any of a set of NLP techniques that maps plain text in a continuous low-dimensional vector space where the semantic properties of the human language are preserved in form of mathematical operations.

In other words, embeddings convert words into small vectors that still carry their meaning somehow and can be used to perform linguistic analyses.

For instance, the words man and woman, which are semantically similar, will still be similar in the embedding space. The concept of similarity in the embedding space is expressed through proximity and can be evaluated with scalar products and vector sums.

Being able to cleverly transpose human language in a mathematical space allows machine learning algorithms to leverage text for their predictions and unlocks infinite opportunities in many fields of knowledge.

This technique is incredibly versatile, to the point that the information learned and encoded with word embeddings can be easily passed from one task to another using transfer learning. The most common ways to embed words and documents nowadays are based on neural networks, like Word2Vec (Mikolov et al., 2013) and BERT (Devlin et al.,2019).

We will now investigate this versatility by showing embeddings applications from five completely different domains.


1. Healthcare

Up to 70% of meaningful information for medical registries, outcomes researchers, and clinicians is held within medical notes in Electronic Health Records (Lin et al.,2013). These clinical texts have little to no standardization in terms of content, format, and quality, which makes them really hard to handle.

One particular problem with clinical text is the presence of complex words, prone to be misspelled and highly specific. Traditional text mining methods based on frequencies can struggle with these terms, manageable with prediction-based techniques like neural word embeddings instead.

Word embeddings convey this information in a format that is digestible for traditional predictive models, leveraging clinical notes to achieve better performance on many downstream tasks like automatic code assignment (Schafer et al., 2019) to medical discharge summaries ** or mortality prediction** (Payrovnaziri et al.,2019).


2. Finance

Text classification is one of the most important business challenges nowadays, involving financial forecasting, banking, and corporate Finance, where unstructured textual data have been increasing rapidly (Lewis et al., 2019).

The textual part of an annual report contains richer information than the financial ratios (Kloptchenko et al., 2004). Word Embeddings can then help retrieve this information, providing useful insights for data-driven decision-making and to take advantage of minimizing risks in the financial market.

In practical terms, the concept of similarity between words carried by word embeddings can be helpful for more accurate sentiment analysis of financial reports for stock return classification, for instance (Yeh et al., 2019).


3. Music Recommendation

Modern streaming platforms are very focused on recommending content. Spotify, for instance, has always stood out for its great ability to provide personalized radios ** or recommended playlists like Discover Weekly, where the main task is generating sequences of similar song**s that go well together and match your taste.

Observing global song preferences across all users can be helpful to understand if users who like Metallica would also like Britney Spears (spoiler: probably no), but can’t tell much about local co-occurrences, i.e. what songs are played frequently together in similar contexts (Ramakrishnan et al., 2017).

Embeddings are very good at encoding information about the meaning of words in relation to the context in which they appear. Replacing words with songs, we can use the same neural model to find song embeddings leveraging user queues.

We can then average the embeddings of the last songs played by the user to retrieve a user embedding, and look at the k-nearest neighbors to populate our suggested playlist.


4. Advertising

Embeddings have the power of preserving meaning, i.e. semantic relationships and similarities among words. Semantic similarity is preserved through proximity, therefore similar contents will be clustered together in the embedding space.

This can be very useful in advertising and sponsored search.

Sponsored search represents a major source of revenue for web search engines. When s web-user communicates his/her intent through a search query, advertising services need to intercept this intent, providing coherent ads. However, this can be challenging with advanced/ambiguous queries.

Embedding techniques allow learning query and ad representations in a low-dimensional space where queries are close to related ads. The training data consists of user activity logs, including search queries, ads clicked, and so on (Grbovic et al., 2016).

5. Proteomics

Embeddings are used in biological sequences as well. It should be clear at this point that we don’t really need words for word embeddings: any type of information expressed through sequences of symbols can benefit from embedding techniques.

For instance, nature uses certain languages to describe biological sequences such as DNA, RNA, and proteins, which are just amino-acid sequences.

With minimal changes to Word2Vec or any other word embedding model, we can learn low-dimensional embeddings also for proteins, replacing words with amino-acids (Asgari et al., 2015). These protein embeddings can be applied to a wide range of bioinformatic tasks, such as protein visualization, protein family classification, structure prediction, and interaction prediction.


Conclusion

Hopefully, this post has been able to highlight the wide variety of fields of knowledge that can benefit from embedding techniques. Please leave your thoughts in the comments section and share if you find this helpful!

If you want to see word embeddings from a different angle, check my post about social biases and gender inequalities in NLP.

Man is to Doctor as Woman is to Nurse: the Dangerous Bias of Word Embeddings


Tommaso Buonocore – Ph.D. Student – Big Data & Biomedical Informatics – University of Pavia |…


Related Articles