
Large Language Models (LLM) have been all the rage lately, with the introduction of OpenAI’s ChatGPT.
The majority of the 100m+ plus users are using Chat/Text Completions to make their daily lives easier. However, a little-known API provided by OpenAI called Embeddings is truly transformational in how we do search, clustering and anomaly detection.
What is an Embedding?
Embedding (also called Vector Embeddings) is a series of vectors providing a mathematical representation of words or sentences. The vectors capture the semantic meaning and context of the words or phrases.
The benefit of Vector Embeddings is that they allow us to compare and analyse words and phrases mathematically, enabling us to perform tasks such as natural language processing, text classification, and information retrieval. They also allow us to identify similarities and relationships between words, even if they are quite different.

For example, the Vector Embeddings for "dog" and "cat" would be much closer to each other than that of "dog" and "banana". Even though dog and cat do not share the exact spelling, meaning, and space in the dictionary, they are not synonyms. However, they share characteristics like they’re both four-legged, can both be kept as pets, can be trained etc.
Now that we understand the basic concept of Embeddings let’s look at how we can use this to simplify painful Data Management activities.
1. Data Marketplace with Semantic Context
Using data dictionaries and glossaries is manual, time-consuming and prone to errors.
In many organisations I have worked with, the metadata management tool is implemented as part of a project and then not maintained, hence becoming stale and unusable. Vector Embeddings can revolutionise this by providing a more efficient and effective way to organise and access information. By representing each data element as a vector, we can measure the similarity and relationship between different data elements, making it easier to search and retrieve relevant information. This can help create a more intuitive and user-friendly experience, where data elements are organised based on their semantic meaning and context rather than just their names or labels.
Do we really need a dictionary and a glossary anymore? If all we need is someone to provide a reasonably legible prompt and Embeddings to look up a response that closely matches the ask!
2. Classifying Reference Data Sets
Reference Data, such as product hierarchies, currency and country codes, etc., must be explicitly defined and stored to ensure accurate reporting.
Having delivered many Data Management projects, I understand the pain of aligning all the reference data to help the organisation speak the "same language". In this instance, an organisation can use Vector Embeddings to create a similarity score between each product based on the semantic meaning of the product descriptions. This could help identify products that are the same but have different names or identifiers in other parts of the organisation. Vector Embeddings could also identify potential matches for new products based on their similarity to existing products in the reference data.
Instead of storing the reference data explicitly, we can rely on the closest meaning from the vectors to classify the underlying data.
3. Data Quality Anomaly Detection
Data Profiling is the first step in understanding Data Quality.
Using the baseline distribution of the data elements, we can compare the new data set with the existing data distribution. Anomalies, or data points significantly different from the baseline, will have a large distance or dissimilarity score, indicating that they may be incorrect, incomplete, or otherwise problematic. By setting a threshold for the dissimilarity score, we can automatically identify and flag data points that are potential anomalies, allowing data management teams to investigate and resolve the issue.
For example, a customer with a phone number significantly different from the norm for that area code.
Conclusion
Due to the endless use cases of Vector Embeddings, several potential ground-breaking technologies are becoming popular. Vector databases like Pinecone and pgvector, along with sentence transfomers and text embeddings.
Although I am excited about the future of Embeddings, I am cautious knowing the poor state of Data Governance and Quality in many organisations. If you are interested in how we improve the foundations, check out my FREE Ultimate Data Quality handbook. By claiming your copy, you’ll also become part of our educational community, receiving valuable insights and updates via our email list.
If you are not subscribed to Medium, consider subscribing using my referral link. It’s cheaper than Netflix and objectively a much better use of your time. If you use my link, I earn a small commission, and you get access to unlimited stories on Medium, win-win.