RelevanceAI

Encoding Data with Transformers

How to use transformer-based technology to perform data encoding

Michelangiolo Mazzeschi
Towards Data Science
5 min readDec 2, 2021

--

Data encoding has been one of the most recent technological advancements in the domain of Artificial Intelligence. By using encoder models, we can convert categorical data into numerical data, and this allows us to make comparisons, see how the data is related to each other, make recommendations, improve searches…

In this article, I am going to explain how to convert a set of articles (textual data) into vectors (numerical data), by using one of the models which are installed on the RelevanceAI library.

If you wish to use the API, there is a quick start guide that you can follow to perform your first semantic search on a dataset using vector-based technology.

What is encoding

Encoding means that we are converting categorical data into numerical data. There are very rudimental kinds of encoding, for example, one_hot encoding, or index-based encoding. However, when we are working with textual data, the most advanced form of encoding can be done using embeddings.

Embeddings are able to scan a corpus of words, and place each one of them into a multidimensional space, essentially converting each word into a vector. Once the model has been trained, each word in the corpus has been properly placed into a mathematical space in proximity of words with similar meanings.

You can have fun exploring an embedding using Google’s embedding projector:

Tensorflow Embedding projector, retrieved from: https://projector.tensorflow.org/

This technology is having a huge impact on the way searches are working right now, finding most of the applications in search engines, recommendation systems, and computer vision.

How many encoders exist?

The most popular textual encoder, up to a few years ago, was word2vec. Being available in several models, you could convert each word into the corresponding vectors in space. However, this is known as static embedding, meaning that the vectors will never change: the model is encoding word by word ignoring the context of a sentence: we can do better than that!

The answer to this problem has now taken the form of transformers models. These encoders use dynamic embeddings: each word can have a different vector according to the word around it.

As you can imagine, this is much more accurate than using static embeddings: RelevanceAI is committed to using this same technology.

Encoding data in a few lines of code

The only thing you need to do to encode textual data is to download the vectorhub library, which hosts the RelevanceAI encoders:

#encode on local
from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec
model = SentenceTransformer2Vec("bert-base-uncased")
df_json = model.encode_documents(documents=df_json, fields=['raw'])
df_json

Encoding big data

Because it is always useful to try things with a bigger dataset, you can make use of our datasets through the relevanceai API. Let us try to encode a dataset, we will be using it in later articles to upload it onto your relevanceai workspace and experiment with several methods:

1. Install relevanceai and vectorhub

The first step is to install relevanceai on your notebook. The installation is quite straightforward, as it uses pip.

!pip install vectorhub[encoders-text-sentence-transformers]
!pip install -U relevanceai
import relevanceai
print(relevanceai.__version__)
#restart notebook if you are updating the API rather than just installing it for the first time
Output:
0.12.17

2. Load dataset

RelevanceAI allows you to download several possible sample datasets. In this case, I will use the flipkart dataset with around 20.000 samples. To download it, just use the following code:

from relevanceai import datasetsjson_files = datasets.get_flipkart_dataset()
json_files

3. Dataset schema

Once the uploading procedure has ended, let us now check the schema of the dataset: we can see all its fields. So far, none of the fields has been encoded, yet.

{'_id': 0,
'product_name': "Alisha Solid Women's Cycling Shorts",
'description': "Key Features of Alisha Solid...",
'retail_price': 999.0},
{'_id': 1,
'product_name': 'FabHomeDecor Fabric Double Sofa Bed',
'description': "FabHomeDecor Fabric Double ...",
'retail_price': 32157.0},
{'_id': 2,
'product_name': 'AW Bellies',
'description': 'Key Features of AW Bellies Sandals...',
'retail_price': 999.0},
{'_id': 3,
'product_name': "Alisha Solid Women's Cycling Shorts",
'description': "Key Features of Alisha Solid Women's Cycling...",
'retail_price': 699.0},

4. Perform the encoding

To start performing encoding of the textual data locally, you can easily have access to some of our transformers models through the vectorhub library. Performing the encoding is very simple, you just need to pass in the json_files data specifying the fields you wish to encode:

#encode on local
from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec
model = SentenceTransformer2Vec("bert-base-uncased")
df_json = model.encode_documents(documents=json_files[0:1000], fields=['product_name'])
df_json

I will only encode the first 1000 samples, otherwise, the encoder may run for a while. After about one minute, this will be the output: as you can see, a new field containing vectors has been added to the dictionary.

Output:
[{'_id': 0,
'product_name': "Alisha Solid Women's Cycling Shorts",
'description': "Key Features of Alisha Solid Women's...",
'retail_price': 999.0,
'product_name_sentence_transformers_vector_': [0.29085323214530945,
-0.12144982814788818,
-0.33044129610061646,
0.07810567319393158,
0.3813101351261139,
-0.13027772307395935,

5. Prepare data for visualization

Because you have not yet uploaded your dataset into relevanceAI (we will be showing you how to do this in the next article), you will have to visualize your data manually. Here is a sample code you can use to transform the output dictionary into a pandas DataFrame.

import numpy as np
import pandas as pd
df = pd.DataFrame(df_json)
df_vectors = pd.DataFrame(np.column_stack(list(zip(*df[['product_name_sentence_transformers_vector_']].values))))
df_vectors.index = df['product_name']
df_vectors
Result of the conversion into a pandas DataFrame, Image by Author

6. Visualize data

Because the data consists of 768 columns, to visualize it you need to compress it. You can use a PCA to easily visualize your data. Know that there are plenty more advanced techniques to obtain the same result, but this will be sufficient to have a quick look at the data.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2, svd_solver='auto')
pca_result = pca.fit_transform(df_vectors.values)
#display(df)
fig = plt.figure(figsize=(14, 8))
x = list(pca_result[:,0])
y = list(pca_result[:,1])
# x and y given as array_like objects
import plotly.express as px
fig = px.scatter(df, x=x, y=y, hover_name=df_vectors.index)
fig.update_traces(textfont_size=10)
fig.show()

Wonderful! All these 1000 samples have been placed in space, and now we can see them.

Data compression of 1000 products, Image by Author

By zooming on the data, we can look at how each individual product relates to another:

Zoom on the data compression to visualize text, Image by Author

--

--