Making Sense of Big Data
List of Contents
Introduction
Word2Vec to Product2Vec – Product Embeddings Generator
Product I: Similar Items Recommender
Product II: Product Taxonomy Expander
Product III: Advanced Content-Based Similar Items Recommender
Product IV: Personalised Items Recommender
Product V: Listing Category Corrector
Product VI: User Interest Classifier
Conclusion – The Big Picture
INTRODUCTION
This is a story of how we developed a bundle of Data Science products containing 6 different solutions for an e-commerce platform. (A marketplace in our case) The essence of this story is that all these products in the bundle are stemming from a single source which is a list of product numerical representatives or more popularly known as ‘product embeddings’. These products were developed as a chain in a way that one product outcome led to another. Final product suite can be seen as a comprehensive solution for an e-commerce relating to recommendations, personalisation and product categorisation.
Personally, what I like the most about this project is that once you have representative product numerical vectors, you can have all 6 different data products for your business. Therefore, the most vital thing here is to come up with a robust approach to generate good-quality product embeddings that lie at the heart of all products. Thus, this article will start with describing our way of creating embeddings first.
Before diving into the details of this multi-tiered project, let me draw your attention into one simple yet subtle fact: the importance of thorough business/product understanding of a data scientist. As you will see in the rest of the article, each product we developed throughout the whole project addresses our business problems from a different aspect. Therefore, matching a specific business problem with a proper data solution must be one of the crucial qualities of a data scientist and it requires a good level of product knowledge.
From a technical perspective, a data scientist must be also capable of converting an existing Machine Learning algorithm to fit into his own business problem. In the next section, we will give an example for it and talk about how we converted one of the most well-known NLP (Natural Language Processing) algorithm –‘Word2Vec’- to serve as a ‘product embeddings generator’.
Important Note: The words: ‘product’, ‘item’ and ‘listings’ will be used interchangeably throughout the article.
WORD2VEC to PRODUCT2VEC – PRODUCTS EMBEDDINGS GENERATOR
Word2Vec is a well-known NLP approach that makes use of a single layer Neural Network. It uses the hidden layer’s activations as outcome rather than using final output layer’s predictions. The main logic behind this approach is to create a bottleneck on the hidden layer and convert sparse one-hot-encoded input word vectors into dense low-dimensional numerical vectors.
The way of generating training data for such a neural network is to use a context-window sliding over all sentences in a set of documents. A context-window is a fixed size groupings of consecutive words in a sentence. Using the pair combinations of the words within the context windows, we generate our (x=input word, y=output word) labelled training data and then feed it to the network. In the end, the words sharing common contexts will end up with similar embeddings. For example, people use the words "The Netherlands" and "Holland" interchangeably all the time. (Fun Fact: The official name of the country is "The Netherlands" which represents all the 12 provinces and "Holland" is a name for a central region just contains 2 provinces including the city of Amsterdam.) Since these two words are wrapped around with similar words or sharing similar context in lots of documents, they will also get similar embeddings from the algorithm. Discussing the Word2Vec in a deeper level is beyond the scope of this article but there are lots of blog posts that you could learn more about it.
Now, let’s think about an e-commerce platform to answer the question: What are the basic units in an e-commerce business? First of all, the main assets are our products, users and interaction between those two. Each user is likely to have a few sessions in a certain time period on the platform. Those sessions are consisted of lots of consecutive product views by the same user. Now, take a step back and try to draw an analogy between the basic units in a Word2Vec and in an e-commerce platform.

As shown in the picture, there is an analogy between a word and a single product view. Also, a single session which is a sequence of products viewed by a user can be interpreted as a sentence. A user’s whole journey containing all the sessions can be seen as a document.
With the help of this analogy, we could use the Word2Vec algorithm to generate our product embeddings exactly in the same way it works for words. Instead of ending up with word embeddings, we will have numerical representatives of our products. That’s why, for the sake of naming convention, it makes sense to name our product embedding generator as ‘Product2Vec’.
Now, let’s take a moment to think about what those values in the dimensions of a product embedding represent. The Product2Vec is based on only user clicking/view data and it encodes product attributes implicitly. In other words, we don’t use any explicit product qualities, attributes or features while creating their embeddings. Thus, we can’t match each dimension with a specific product attribute but you could think of each one as a random attribute such as colour, size, location or seller’s trust score etc.
The reason why we had to use such an approach to generate representative feature vectors is that we don’t have many explicit product attributes or tags for the listings on our platform. I believe it’s a common problem for vast majority of e-commerce platforms but especially marketplace platforms suffer from inadequate product information problem since their content completely depends on users. However, if there are many clear descriptive product tags for each and every single item on your platform, I strongly recommend you to go ahead and generate product embeddings using those explicit product tags.
Before wrapping up this section, I would like to mention 2 technical decisions we took during this embedding process. First decision was about different ways of creating our training data. Remember that we treated each user session as a sentence and generate (input word-output word) pairs accordingly. (Note: A session is a sequence of product views with no longer than 30 minutes between any subsequent views.) We also gave it a try to create training data pairs by sliding over a user’ whole journey – combination of its all sessions. It turned out the latter gave us better quality product embeddings in our case. It is a problem-dependent design decision and needs to be tuned by project developer.
Second decision we took was about optimum size of the embeddings. As you will see in the following sections, different products require different embedding sizes. For recommenders, we use 64-dimensional product embeddings and for the other products in which clustering analysis involved, we use 16-dimensional embeddings not to suffer from the curse of dimensionality. Therefore, it’s worth to note that you need to tune the embedding size as well depending on your problem.
The main purpose of this article is to walk you through the development journey of a bundle of 6 different products and show you how product embeddings pave the way for those products. I will not go into more technical details of the Product2Vec here but if you are interested, you could read my previous blog post about it. In the rest of the article, products will be presented to you in the exact same order that we came up with their respective ideas. In this way, we aim to convey a sense of real-time development chain of all products throughout the project.
PRODUCT I: SIMILAR ITEMS RECOMMENDER
Similar Items Recommender is one of the most popular type of recommenders that has been widely used by lots of e-commerce platforms. Its strong impact on CTR (Click Through Rate) and CR (Conversion Rate) makes it the most attractive data science product among all others. Common use-case is to show similar products to users in a recommendation carousel on a Product Page by giving it a name "Recommended Similar Products" or similar.
Now, let’s take a quick look at the ‘Flow – Building Blocks’ chart below. You will see a similar chart for all following products as well so it’s worth to take a moment to define those individual blocks now. The top row indicates high-level product flow with some building blocks and intermediate datasets. The building blocks will be shown as black oval shapes and we will briefly talk about the details of each building block in a separate sub-section under the respective product section. These building blocks will be either an existing algorithm or a function we created for a specific task.
In the flow, those grey coloured rectangular shapes represents input or intermediate datasets. The last colourful block that we zoom in is there to show final outcome of each product. As you could see in the chart, we represent embeddings with a set of small colour coded squares for the sake of easy understanding. In the example, it’s intuitive that Audi car listings are having similar coloured embeddings whereas embeddings of Ford cars are different than Audi but similar to each other.

Building Block: HNSW
Once we have embeddings representing implicit qualities and attributes of products, in order to come up with similar items Recommendations to a particular product, all you should do is find closest products in the embedding space using a distance metric. If you have not many products in your problem, going with the brute-force k-nearest neighbour algorithm to find closest embeddings is the best thing to do. However, since we are talking about millions of products in our case, the brute force algorithm – comparing a single item with all the rest of items – is not a proper and timely option for us. Instead, we utilised one of those ‘Approximate Nearest Neighbour’ algorithms: HNSW. If you have only 10 seconds, all you need to know about how these algorithms work is that drawing a set of lines in the space (number of those lines is a parameter you could set) and then decide on how close two points in the space by comparing number of times they stay on the same side of each line drawn. In other words, it assumes that if points or vectors are close to each other in a space, they will tend to stay on the same side against the lines drawn. This approach makes the process way faster than its brute force version because instead of comparing a point with rest of the points, just compare it with the number of lines only. You can see a simplified explanation of the Approximate Nearest Neighbour below.

For each product, just find top-N closest embeddings by using HNSW and save them in a table as recommended products. As summarised in the Product I-Flow Chart above, you have your similar items recommender now. The final outcome is as simple as a table with only two columns one of which is product id that we are generating recommendations for and the second column is the list of similar product ids.
Before jumping into the showcase, I would like to mention a technical decision we took at this stage of the project. Since HNSW simply returns similar items for an input product embedding, we had our chance to see the quality of those product embeddings produced by the Product2Vec. In this process, we realised that products, which are viewed just a few times by our users, don’t get good-quality embeddings and thus, relevant similar item recommendations returned from the HNSW. In other words, those products are not clicked many times before because of possibly various reasons such as they could be newly posted fresh listings or just unpopular listings on the platform. The reason why those end up with non-representative embeddings is poor performance of the Word2Vec algorithm on rare words by design.
There is a specific parameter in the algorithm named ‘minimum word frequency’ to prevent generating embeddings for those rare words. You could set this parameter to a certain value and the algorithm will ignore those words (or products in our case) having less number of appearances in a whole corpus than the value you set. In our case, we set this parameter to 10 and basically told the algorithm not to bother with products which are viewed less than this number before. For sure, this is also a problem specific parameter that should be tuned by the developer. As you can see now, this product generates recommendations for only ‘popular’ listings viewed at least 10 times before. How about all the rest? Keep reading the article and you will see a solution in the Product III section specifically developed for those fresh, new or unpopular listings.
Showcase
While assessing performance of the product and quality of the recommendations, we randomly selected a product id first as a seed item and then get similar products for it. We put both product pages of the seed and the recommended items into a single picture to help us evaluate it visually. As shown in the picture below, you will see the seed product on the top left and its top-4 recommended similar products on the right hand side.

The seed ad is a vacuum cleaner posted by a user living in Haarlem and its price is 100 euros. From the title of the ad, we do see that its brand is Dyson. Now, let’s check on those recommended items. As you can see, all recommendations are Dyson vacuum cleaners that are posted from various locations close to Haarlem and within the price range between 75 to 150 euros. It’s worth to praise this approach for how good it is at capturing different qualities of products such as location, price or brand despite of not using even a single explicit product attribute at all. Once again, it is all based on user clicking behavioural data.
PRODUCT II: PRODUCT TAXONOMY EXPANDER
After Product I, we shifted our focus back to the product embeddings again. We started thinking about how we could make use of them in other different ways. It is because we believed that there are a lot of potential hidden in those embeddings. That thinking led us to come up with the idea of our second product which is Product Taxonomy Expander.
Since marketplace platforms have completely user dependent contents, product categorisation could become challenging and gets easily out of our hands simply because our inventory is so dynamic and managed by our users. Actually, that’s the main difference between a marketplace and a pure e-commerce platform. In an e-commerce platform where all the products are listed by a separate inventory or a categorisation team, it is relatively easy for them to have a control over those product category names. However, if you are running a marketplace business, it is highly likely that you will need to automate product categorisation at some point. The reason why we need such an automated system for our business is to dynamically discover more specific product types exist on the platform so as to organise all available listings in a way that customers can find what they are looking for in the least clicks possible.
In this product, the flow starts with the embeddings and followed by 2 different building blocks: K-Means and Define Clusters. Lets more look into them in the next two sections.

Building Block I: K-Means
In the first step, we fit k-means clustering algorithm on top of the embeddings to collect similar products into the same product group. We already have some existing categories listed on our marketplace such as dishwashers or refrigerators but not many or detailed enough. Therefore, once we set the number of clusters parameter (K) to some arbitrary number preferably bigger than total number of current categories, it will enable us to end up with more specific and also detailed product type groups exist on our platform such as Countertop Dishwashers or Wine Refrigerators.
Let’s stick with our aforementioned cars example in the Product II – Flow Chart for the sake of understanding. As a result of this clustering, Audi cars will be collected into the same cluster and Fiat cars will get together in another separate cluster. Of course, these clusters will not consist of only Audi or Fiat cars. You could expect to see some other brands from possibly the same segment. For example, there might be some BMW cars in the Audi cluster and some Ford or VW cars in the Fiat cluster. This is completely normal and non-problematic for our solution as long as our clustering algorithm groups similar products together as expected. This is just a hypothetical example to make our point clear to you but in reality, you probably want to set the number of clusters way higher so that even different types of a specific car brand will fall into separate clusters such as Fiat 500 Lounge Manual or Black Fiat 500 Sport . Depending on the number of clusters (parameter K), we have a chance to adjust the degree of refinement of those product clusters and pick the best parameter value accordingly.
How did we find the optimum K in our case? First of all, we tested this product out on a main category – White Goods and Equipments – which consisted of around 40 different sub-categories such as washing machines or stoves. Therefore, we wanted to set K bigger than 40 because our main objective is to expand current categories and discover more detailed ones. You could think of number of existing categories as your lower bar while finding optimum K. How about the higher bar? Here, the size of clusters comes into play. We could increase this number K up to a certain point so that those product clusters are still large enough to be called as categories. We gradually increased the value for K starting from 40 and find the optimum value around 100 which gave us the most meaningful product groups in the end. The bottom line is this number K is needed to be specifically tuned depending on problem and platform.
Building Block II: Define Clusters
After clustering our products into some sub-categories, the next step was to find out good descriptions of those new product categories so as to name them. This building block does that and the logic is simple: For each cluster, we take the titles of all cluster member products and then compute word frequency distribution and list top-used words with their respective percentages. Since titles of listings are really important in a marketplace to reach out more buyers, most of them are already written clear and detailed by our sellers such as "Mobile cool box – serving cabinet – refrigerator on wheels". Thus, this helped us to end up with nice descriptive words for each product cluster just analysing the listing titles.
As shown in the Product II-Flow Chart, output of this building block is also the final outcome of the product which is a table having 2 columns: Product Cluster Number and Top Words Distribution. One important note about the table is that although it is so simple and easy to use, we still need human interpretation on the result for fancy and top quality naming on newly discovered product categories. This is also the same procedure as how it has been done by Netflix Tagger Team to come up with movie genres such as Dark Scandinavian Movies.
Showcase
In the picture below, you will see some examples coming from 2 different clusters out of more than 100 product clusters in total. In the first row, all those products are coming from a single category named ‘alarm clocks’ ("wekkers" in Dutch) as you can see from the breadcrumbs on the top. However, there is something unique and common in these products. These are not just alarm clocks but also a wake up light and the brand for all is Philips. The top 4 words for this product cluster is ‘Philips Wake Up Light’ and this is already pretty self explanatory for this specific product group. In the second row, all those products belong to the same existing category ‘Personal Care Equipments’ but it seems like they are actually one specific type of product which is ‘Foot Bath’ (Voetenbad in Dutch). As expected, it turned out the word ‘Voetenbad’ has relatively higher usage percentage than any other words in this product cluster.

There are mainly 2 clear use cases of this product for our marketplace. In the first case scenario, we could use those product groups as new level-2 categories listed on a level-1 main category page. (Note that the level-1 category was White Goods and Equipments in this case) In the second use case, we could advance our refine search results component by adding more product type filters with the help of the outcome. You could see those pages below on our marketplace platform where we could use the final outcome of this product

PRODUCT III: ADVANCED CONTENT-BASED SIMILAR ITEMS RECOMMENDER
Remember that in Product I, we skipped generating recommendations for ads which are viewed just a few times by our users. The vast majority of those products are newly posted fresh listings on the platform. In the recommendation space, we have a special name for this situation: Cold Start Problem. Here, we had our chance to deal with this problem and being able to generate recommendations for these listings as well.
In this product, we made use of the outcome of previous product which is Product Clusters with Descriptive Top Words. As shown in the flow chart above, we simply added a new building block – Weighted Jaccard Distances – between this outcome and the data collection of newly posted listings’ titles.

Building Block: Weighted Jaccard Distance
This function compares the title of each newly posted listing with the descriptive words list of every product cluster. In order to decide on which product cluster is the best fit for a particular newly posted listing, we use the Jaccard Distance (Similarity) with a little weight trick. Jaccard Distance is a ratio of number of common words between two sentences to the number of words in total union set.

We used a little weight trick while computing numerator which is intersection of the title of a listing and the descriptive words of a product cluster. It gives more weights if there is a match with the words on the top of the descriptive words list. You could see that it is an intuitive and stronger indication for similarity between a listing and a product cluster as shown in the simple example below.

Once a listing is compared with each product cluster in a way we just described above, this building block assigns each listing to a particular best-fit product cluster. After that, it generates similar items recommendation for each each listing by randomly picking up some items out of the assigned product cluster.
Output of this building block is also the final outcome of this product which has exactly the same format with the final table we created for ‘popular’ listings in the Product I. In other words, with the combination of Product I and this product, we now have similar items recommendations for all inventory we have on our platform.
There is one more thing that I intentionally skipped by now but worth mentioning here. There are in total of around 100 product clusters in our case. While comparing a new listing with those clusters by computing weighted jaccard distance, it is likely to have some tie scores with a few different clusters. In order to break those ties to end up with only one single assigned product cluster, we also compare the region and the price of listing with region distribution and median price of those product clusters. You could think of this as a final small step to nudge the cluster assignment process.
The biggest advantage of this approach over other solutions for the Cold Start Problem is being based on user behavioural data rather than just making 1-to-1 attribute comparisons. We compare each listing with a group of products that is actually an outcome of user clicking behavioural analysis. This could be seen as more robust 1-to-many approach compared to the 1-to-1 methodologies that based on only attribute comparisons.
Showcase
The showcase is so similar to what we have seen in Product I. We feed the model with a newly posted listing as a seed input and get top-4 recommended similar items in return. As you can see from the picture above, our seed listing that we are looking for similar items is a robot vacuum cleaner. On the right hand side, you will see relevant recommendations to this particular product type. Three of them are different robot vacuum cleaner listings and one of them is a special filter for a robot vacuum cleaner. Actually, the last filter recommendation sort of sums up our approach and its advantage over other approaches.

Since these recommendations are coming from the assigned particular product cluster, it is expected to see different product types but all relevant in a way. This enables us to increase the variety in similar item recommendations for a certain listing. In addition to that, let’s take a quick look at the top right listing in the picture above. There is no common words or attributes sharing between that one and the seed listing. If one of those 1-to-1 attribute comparison approaches had been used, then there would have been no way to end up with such a recommendation. In general, we could say that this recommender for newly posted listings can capture product type correctly and then generate relevant recommendations accordingly.
PRODUCT IV: PERSONALISED ITEMS RECOMMENDER
At this stage of the project, we started thinking of finding a way to generate personalised product recommendations for our users. Our main objective for this product to show our users the most relevant product recommendations based on their recent browsing behaviour on the platform.
Remember that now we are able to produce similar item recommendations for all the listings on our platform with the help of Product I as a recommender for popular ads and the Product III as a recommender for newly posted fresh ads. Therefore, all we had to do is to place a new building block – ‘Weighted Random Selection’ – between last listings viewed by the users and the outcome combination of Product I and Product III. You could see these connections clearly in the flow chart below.

Building Block: Weighted Random Selection
This function takes last 10 listings viewed by a user as input and for each listing in this input, retrieve top-5 similar item recommendations from the previously developed recommenders. As you could imagine, there will be a recommendation pool consists of 50 listings in the end. These last 10 listings and top 5 recommendations numbers are arbitrary but determine size of the final recommendation candidates pool. In other words, you should set these numbers with respect to how much variety you would like to add to the final recommendations shown to users.
The next step for this building block is to give different weights to all listings in the pool according to recency of their respective seed ad and their order in the recommended similar items list. It distributes the weights in a way that the top recommended item of the last listing viewed by the user will have the highest weight and similarly the 5th recommended item of last 10th listing viewed by the user will have the lowest weight. This procedure is depicted in the picture below.

After weighting step, the last thing to do is a random selection out of the recommendation pool with respect to their given weights. Those selected listings will be served as personalised recommendations for our users. As you can see from the Product IV-Flow Chart above, the final outcome is a table consisted of 2 columns only: User Id and Recommended Product Ids.
Before the showcase step, I would like to put an emphasis on why we wanted to go with an approach like Weighted Random Selection. The main reason is that it is really important to come up with a variety of different recommended products when it comes to the Personalised Product Recommender. Rather than sticking with only one single product type or keep recommending the same items again and again, you should give a chance to different products to be picked as recommendations. In this way, you will also have a fair distribution of user traffic across many different products.
Showcase
On the top of the picture below, you will see a recent viewing sequence of a user. We list the last 10 items viewed by the user from the oldest to the most recent. As you can see, most of those listings are robot vacuum cleaners including the most recently visited ones especially. There are also some listings from different product types such as mobile air conditioners (1,3 and 7) and a split air conditioner (5).

Now, take a look at the bottom row where we show the top four recommendations generated for this specific user. As expected, two of the recommended listings are robot vacuum cleaners. It is because majority and the most recent of last listings viewed by the user were vacuum cleaners. Therefore, it makes sense to show more recommended similar items from that product type first. In the 2nd and 4th recommendation, you will see some listings coming from different product types as you previously have seen in the user’s recently viewed item sequence. It is an important indication of good variety in the user recommendations.
PRODUCT V: LISTING CATEGORY CORRECTOR
In a marketplace platform, since the content is completely dependent upon users, there could be many wrongly categorised listings on the platform. The reasons are worth mentioning here to understand the business problem deeply. First of all, there is always a possibility that a user could pick up a wrong category inadvertently for their item from the category list. Another reason could be not finding the specific category they are looking for and have to post their item under a next best category they find. This is just the one side of the problem and these are actually main reasons for the majority of wrongly categorised listings on our platform.
However, there is also another more harmful aspect of this problem which is related to freemium usage exploitation and fraud-like behaviours by the users. For instance, if you have 2 refrigerators to sell on our platform, you are entitled to post one of them for free under the refrigerator category but you have to pay a posting fee to be able to list the second refrigerator under the same category. Therefore, some users deliberately prefer posting those listings under some other irrelevant categories such as dishwashers just to avoid any payment.
This situation has two harmful effects on our platform. First, it causes the business to lose money and worse than that, it turns the marketplace platform into a mishmash of products where users have hard time to find what they are looking for.
For the solution to this problem, we will get back to the product clusters we generated before. Some product clusters might be consisted of different product types but most of the clusters tend to be more homogeneous. In those homogeneous clusters, majority of the listings belong to the same product category. Only a few of them are posted under other categories but fell into the same cluster with the rest because our model thinks that these products are also similar. In this context, we named this minority group of products as Usual Suspects. Next section, let’s talk about this building block in details.

Building Block: Usual Suspects
This building-block first finds the most common product category among the listings in these clusters. For now, let’s call this as dominant cluster category. Then, it compares actual category of each cluster member listing with the dominant category of the same cluster. If there is a mismatch between two, we labelled the cluster member as a usual suspect. Our model thinks that those usual suspects were possibly wrongly categorised and therefore they fell into a cluster where the dominant product category is different than their own. It’s likely that their correct category should be the dominant category of the cluster. Thus, the proposed new category for those listings by the model is the dominant category of their respective clusters.
Let’s take a look at the cars example in the Product V-Flow Chart above. The product cluster on top consists of mostly Audi cars and the other one at the bottom has Fiat brand as the dominant product category. As you see from the picture, there is a BMW listing in the Audi cluster and Ford, VW listings in the Fiat cluster. Those listings are sort of acting like outliers among other cluster members. Therefore, our model suspects that the BMW listing could be an Audi car in fact. Likewise, our model also suspects that those Ford and VW listings could be Fiat cars.
The final result set is as simple as a table having only three columns in total: Listing Id – Current Category – Proposed Category. It is a dynamic table that is updated periodically and captures more and more wrongly categorised listings in time. However, there is still a need for human power to validate the final result set. It is because there might be some false-positives among those usual suspects. In other words, there is still a possibility that our model might suspect a listing even though it is posted under the correct category. There is an intuitive and simple explanation for this situation.
Remember that the clustering algorithm groups similar listings into the same cluster and there is a possibility that those usual suspects might be similar listings from other different categories and not necessarily being wrongly categorised. It turned out that if the purity of a cluster is high (rate of dominant category), it is more likely that those usual suspects are wrongly categorised in reality. Therefore, you could change accuracy of the final result set by adjusting the threshold for dominant product category rate while choosing clusters in which we will look for the usual suspects. If you increase this threshold, you will probably end up with less number of usual suspects but with higher True Positive ratio. The threshold selection process for usual suspects analysis is described in the picture below.

Showcase
In the picture below. you will see only 3 examples out of many detected wrongly categorised listings. We draw a rectangle on the actual category of the listings that are posted from. On the bottom, you will see recommended or proposed correct categories for each listing. Since those screenshots are taken from our Dutch platform, all titles and categories are in Dutch language. Let me help you out to translate those into English. On the left most, as you see, it is a mini fridge listing but posted under the category: Fans and Air-conditioners. Our model proposes Refrigerator as the correct category. In the middle, it is a Nescafe Dolce Gusto coffee machine and listed under Taps category but our model recommends to change its category to the correct one: Coffee and Espresso Machines. On the right-most picture, it is a table clock and posted under Other White Goods category by its seller. In fact, we had an actual separate category for those items which is Alarm Clocks (Wekkers). Note that we tried out all products in a single main category (L1 level): White Goods and Equipments. Therefore, it says recommended L2 categories at the bottom since we aimed to detect L2 level wrong categories of the listings on our platform.

PRODUCT VI: USER INTEREST CLASSIFIER
After creating those product clusters using the product embeddings, we thought we could apply a similar clustering to our users as well. In that way, we will have a chance to group similar users into the same Interest Clusters with respect to their recent browsing history on our platform. Main objective of this product is that if we have user clusters showing similar characteristics, then we will be able to target different user groups in a more personalised way so as to doing more effective marketing.
Using the outcome of this product we would like to find out constantly changing interests of our users on time by answering the following 3 questions:
- Q1: Which specific product type the users are interested in at the moment?
- Q2: Are these users more into expensive listings or a cheaper seeker?
- Q3: Is the listing location a deal-breaker for these users?
First thing to do to enable us go for the further steps was finding a way to generate user embeddings. The building block – Weighted Mean – is added into the flow just for this purpose. Now, let’s see the details of our approach below.

Building Block I: Weighted Mean
While generating user embeddings, we wanted to make use of previously created product embeddings in the first product. We place a building block in the flow just to transform those product embeddings into new user embeddings. Our approach is so simple and clear so that even the name of the building block gives it away. Basically, we retrieve last 10 listings viewed by users and take weighted mean of listing embeddings in a way that the more recent a listing was viewed by the user the more weight its embedding will get in the computation.
In the Product VI-Flow Chart above, if we look at the our cars example again, since users tend to be more interested into certain brands or car types, we will end up with user embeddings exhibiting users’ recent product taste on the platform. Therefore, if two users have been visiting similar items recently, they will have similar embeddings at the end of this process. Now, it is time to apply a clustering algorithm on top of the user embeddings.
Building Block II: K-Means
This building block works exactly in the same way with the K-Means in Product II. We applied clustering on the user embeddings this time. This is basically the only difference between these two building blocks. Therefore, I am not gonna dive into the details here again. The outcome of this building block is a set of user clusters and they are passed to the next building block as inputs.
Building Block III: Define Clusters
This building block also works in a similar way with the Define Clusters building block in the Product II. Remember that we only used top words while describing product clusters. Here, we also got regions and price range into consideration while generating descriptive qualities of user clusters.
Outcome of this last building block is also final result set returned by this product. It is a table with 4 columns in total: Cluster Id – Top Words Distribution – Region Distribution – Median Price.
Cluster Id: The number that represents a user cluster
Top Words Distribution: It is a list of frequently used words in titles of recently viewed listings by the users of that particular cluster
Region Distribution: List of sorted postal codes that the interested listings are posted from
Median Price: The median price of all the interested listings by users of cluster
Showcase
In the picture below, you will see examples for 4 different user clusters out of many. Those 4 clusters kind of sum up all the qualities of this product once we compare them in pairs. Let’s start with comparing user clusters 1 and 2. First, look at the word distribution of these two clusters. Size of the circles signifies how frequent that word appears in the titles of interested listings.

It is clear that users in the cluster 1 are looking for coffee machines but they don’t have any clear preference in brand. They checked out coffee machines from different brands such as Jura, Siemens and Philips. Users in the cluster 2 are also looking for a coffee machine but they are into a single particular brand which is Jura. It can be said that users in cluster 2 are brand specific compared to the user cluster 1.
Now, let’s compare the region distribution of user cluster 3 with the first two clusters. As you see, the vast majority of interested listings of cluster 3 users are posted in region 1,2 and 3. Actually, this is a specific famous Amsterdam region in the Netherlands whereas you don’t see any region pattern for first two clusters. We could draw a conclusion here is that location of items is a matter for the users of cluster 3.
Finally, lets compare user cluster 3 and 4 in terms of both region distribution and median price. As in the case of cluster 3, we can see a region pattern for cluster 4 as well. These users are into listings posted from the north of the Netherlands (Groningen Region). What’s more interesting here is that the difference between median prices although these two user groups are into the same specific product type which is a Philips Senseo Coffee Machine. As you see, users in the 4th cluster are showing interests into more expensive listings compared to the other cluster. We believe that there is a hidden potential in this final result set for marketers to extract some useful insights like these about their users.
CONCLUSION – THE BIG PICTURE
If you managed to get here in the article, probably you already recognised that all the products are connected to each other somehow. As you see from the big picture below, the most astonishing aspect of this project is the fact that all the products stem from one single source: Ad Embeddings. Apart from that, output of one product is used as input for another product. Therefore, it can be seen as a development chain of multiple data products.

In order to have such a big picture and those 6 different data products for your business, all you need is to have a basic clickstream data of users’ product views and titles of those products. That’s all! It’s that simple to integrate this big picture into your business. Keep in mind that performance of all the products strongly depends on the quality of initial product embeddings. Therefore, if you could improve your product embeddings by using another approach or adding some extra logic layers on top of our approach, all your products will gradually give better results in time.
Well done! You made it till the end. This became the longest article I have ever written. Hope you enjoyed reading it in spite of its length. We welcome your feedback or comments about the article and the project. Also, feel free to share it on your profile if you find this work interesting and want to let others know about it.
