A Practical Approach for Businesses

Product tags give users a snapshot of what to expect and help them to explore related products. If tags are not created with high accuracy, users will find it difficult to trust them and interaction will be limited. Yet creating accurate tags becomes increasingly difficult as the number of products increases. To solve this, we can utilise various Machine Learning techniques. In particular, we’ll discuss how to create product tags through the combination of manual tagging with machine-learning techniques.
As an example we’ll use www.govoyagin.com where the products are represented by activities.

Choosing Tags
Choosing the right tags to use is not necessarily obvious. There are various approaches:
- User Based: what tags do users care the most about? For this approach, the first top is to talk to users or create a survey to find out which tags they’re interested in and what makes sense to them.
- Machine Learning: what tags are built into the data? Clustering algorithms such as k-means can be used to group together products and then topic modelling techniques can come up with the tags that represent the clusters.
- Market Research: what are competitors doing? Whatever the industry, it is possible to find competitors who use tags in some way or another. Often they’ve done various types of research and testing to come up with their tags and these can be a great starting point. It can also indicated what is the industry standard that users will be expecting.
- Internal Expertise: what do you think is best? Within your company there should be various experts in the industry and they’ll likely have a great way to set tags based on their experience. It’s another great starting point, though you should always be wary about making product decisions based on the ideas of a few people.
Of course, these are only some of the approaches and it’s necessary to combine them all to come out with the best results. For example, you can run a machine learning algorithm to work out an initial set of clusters and then use your internal experts to choose the precise tags, finally getting feedback from users to see if they make sense and resonate well with the market.
For this example, we’ll use a list of tags corresponding to major categories of products:
'Culture', 'Nature', 'Relaxation', 'Crazy', 'Celebration', 'Adventure', 'Food', 'Romantic', 'Socializing', 'Instagrammable', 'Family', 'Theme Parks'
Once a list of tags is decided, the products must be tagged.
Tagging Products
For companies with hundreds of products, it’s feasible to manually tag products but as the products increase with scale it becomes an arduous practice to maintain. In this situation it’s best to turn to an automated process.
The approach detailed below is:
- Manually tag some products
- Convert product descriptions to machine-readable numbers
- Build up an understanding of what that tag means
- Use this general understanding to assign tags to products
This is an example of supervised learning, where we provide examples of input-output pairs and the model learns to predict what the output should be, given the input.
Manual Tagging
The first step is to manually tag some products. By manually tagging activities first, we introduce some ‘expert guidance’ for better performance. A representative set of the products should be chosen to give the most information to the model. For the techniques discussed we will tag just 10s of products in each class.
Manually tagging can also be achieved using word matching. For example, if your product tag is shirt, then the word shirt is probably in the product description. This technique works well in such cases but will fail with more generic tags.
As an example, we could assign the products manually, like this:
34% OFF Robot Restaurant Shinjuku Tokyo Discount E-Tickets (899)
=> Crazy
Studio Ghibli Museum Tickets - Preorder & Last-Minute Tickets (186)
=> Culture
Tokyo Disneyland Tickets - Maihama Station Pickup (10069)
=> Theme Parks
Reservation for Sushi Jiro Roppongi Michelin 2-star Tokyo (1690)
=> Food
Giving us a final JSON object with tags and product ids to work with:
{
"Crazy": [123, 98, 899, etc.],
"Culture": [186, 1200, 323, etc.],
...
}
While these were prepared by just one person, it’s often helpful to have at least three people go over each product as an attempt to remove any biases that may exist.
The Meaning of a Tag
Once we have a set of tags to describe our products, we need to work out what they actually mean. We’ll do this with machine learning. For a machine learning algorithm to understand products it needs a way to reduce those products to numbers. In this case, we’ll refer to the numbers as a vector, as often we represent the product by a group of numbers. Typical things we would feed into the model would be product descriptions or product images. For product descriptions we have a few techniques to choose from, including:
- Bag of Words
- TF-IDF
- Doc2Vec
Doc2Vec is a great technique and quite easy to implement in Python using Gensim. This model will reduce the product descriptions to a vector with a certain number of data points representing qualities of the products. These are unknown but often can be seen to align with particular qualities such as colour, shape, size or in our case, the tags or categories we are using.
from gensim.models.doc2vec import Doc2Vec
# Create a model with some pre-defined parameters.
model = Doc2Vec(vector_size=50, min_count=2, epochs=40)
# Build the vocabulary from some training data.
model.build_vocab(train_corpus)
# Train the model.
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
In the case of product images, image recognition techniques can be applied. For example, converting the images to data, based on the colour of each pixel and then training a neural network to predict which tag (class) that a certain image belongs to. These models can be combined with the ones for product descriptions discussed below.
With the product descriptions reduced to vectors, we can think of what the definition of a certain tag means. A simple approach which does fairly well for independent tags (a product belongs to one tag, but not another), is to take the average of the vectors for products with a certain tag. The result is a vector which represents the average product for that tag.
An example for some products tags is plotted below in a graph. We see similar tags such as Celebration and Instagrammable (center-left) are so similar they almost completely overlap. We might want to reconsider these tags as they are so similar. The tag Crazy is significantly removed from the other tags which makes sense, these products must be very unique!

Assigning Tags
Now that we have the average product, we can find the products which should be tagged accordingly by finding any nearby vectors. If a vector is nearby, it should be similar. We can limit the distance to ensure we’re only capturing products which are sufficiently similar to the tag in question. This distance can be referred to as a similarity measure.
At this point it’s helpful to plot a histogram of the similarity of products to tags to see what a good cutoff point might be. You want to make sure you’re tagging a good amount of activities but not tagging every product with certain tags!
In the four examples shown below we have:
- Nature: many products don’t relate to this tag, with the peak around 0.1 similarity and then a steady drop-off in Nature-related products.
- Relaxation: as opposed to Nature, it seems there is a secondary peak with many products around 0.4 similarity. We want to have a higher cutoff than this to avoid capturing all these products which must have some similar aspects but are not fully related.
- Food: similar to Relaxation, we can see a second peak, but this time it is much higher in similarity, around 0.6.
- Theme Parks: This has a clear plateau on the far right after 0.5 as we see that the similarity drops off quite quickly. This is likely because the tag of Theme Park is straightforward, meaning something can be easily said to be a Theme Park or not. Contrast this to some of the previous tags such as Relaxation.
We might choose 0.6 as a cutoff point that balances having good meaning but not too few products tagged. Looking at the graphs it can be seen that this is not perfect and we should be able to do better.

Validating Tags
Once tags have been prepared, the next step is to validate that they make sense. The simplest approach is to go through a random list of products under each tag and check they make sense. Ultimately, it is down to how users respond to your tags though. It’s useful to test your tags online and see how users interact with them. Are they using tags to research products? Do they view many products from tags or drop-off at this point?
Next Steps
Within this article, we’ve created a list of tagged products using basic data processing and machine learning techniques. With these techniques it’s possible to get a decent quality tags system up and running quickly.
There are various ways we can try to improve upon our method. Rather than choosing the nearest products to our tags we can apply a machine learning algorithm such as a support vector machine (SVM) or Multinomial Naive Bayes which learn to predict the tag in a more sophisticated way. For these models, more training data is required but in return we will have greater predictive power.