
Building a search engine system usually follows some well-known steps for the most part:
- Choose an appropriate database for the search operation.
- Save a set of documents.
- Apply transformations to the documents fields (remove accents, filter plural, break white spaces…).
- Create a querying rule for retrieving those documents.
These steps tend to be what is already necessary for implementing an effective enough search engine system for a given application.
Eventually, the requirement to upgrade the system to deliver customized results may arise. Doing so should be simple. One could choose from a set of machine learning ranking algorithms, train some selected models, prepare them for production and observe the results.
It can be that simple indeed except for one missing part: regardless of the chosen ranking model, we still need to somehow prepare a training dataset capable of teaching the algorithms on how to properly rank. There’s got to be, somehow, a dataset telling the algorithms what is a good document versus a bad one.
And that’s going to be the main focus of this post . In order to push search engines to the next level with AI algorithms, a training data that "teaches" algorithms on how to properly rank results is fundamental.
We’ll discuss how to use ClickModels, a field that uses Probabilistic Graphical Models (PGM), to develop a set of variables that explains the interactions between users and the retrieved documents. Those variables will provide us the concept of relevance (or judgment, as also known in the literature) which basically works as a marker of how much valuable a document is for a given search query.

Our challenge will be to create a probabilistic model that learns from users interactions how much relevant documents are for each query and context. Final code built on top of Cython can be found on pyClickModels repository (Cython is required as the optimization process of these algorithms may take a considerable amount of time to converge).
So, with no further ado, let’s see how to take search engines to the next level.
So, Relevance Why?
Figuring out what is relevant for people can be quite challenging. Imagine working on building a search engine for an eCommerce store.

In this environment (and we could argue that most retrieval engines works on similar context), observed data contains:
- What users are searching.
- Context of the search (region of the user, demographic data, their preferences and so on).
- Products they are clicking.
- Products they are purchasing after clicking.
We expect that products purchased from a search result page are quite relevant so they should have a good relevance score. But then we realize that there might be products that weren’t shown to the user (for instance, products at page 2 or higher) that they would end up enjoying even more!
Not only that, several patterns tend to be observed when ranking items to users. For instance, the first document on a result page tends to get more clicks than others, not necessarily because it’s an adequate product for the query but rather simply because it’s on the very first position of the catalog (also known as "position bias").
Well, now things start to get tricky.
One approach for finding what is relevant to people is simply asking them. And, interestingly enough, that’s actually one available solution used by some companies. But the downsides of this approach is that it doesn’t quite scale well and it tends to be less accurate in general.
Another approach, one that we are interested in, infers implicit relevance from ‘clickstream‘ data. Clickstream data is basically the recording of the browsing interactions between users and the search engine system itself. Here’s one example:
We can use this information in order to model how relevant each document likely is for each query.
ClickModels For The Rescue!
One way to extract relevance from clickstream data is by using ClickModels, which consists of PGMs, in order to infer the likelihood of users clicking on documents. We can assume that the higher the probability of a click is, the more relevant the document might be, accordingly.
From what follows, the ideas and concepts discussed here were extract from the book Click Models for Web Search by Aleksandr Chuklin and others.
We have numerous ways of modeling users interactions with a ranked list of documents. Here, the [Dynamic Bayesian Network](https://en.wikipedia.org/wiki/Dynamic_Bayesian_network#:~:text=A%20Dynamic%20Bayesian%20Network%20(DBN,other%20over%20adjacent%20time%20steps.) (DBN) approach will be the chosen one for the discussion and implementation. Keep in mind though, once you understand the concepts and theory behind the model, you can extrapolate this knowledge using whatever variables you seen fit. The book by Aleksandr discusses several other models for instance.
To begin with, here’s a graphical representation of the DBN used for extracting relevance from data:

All variables in this model follow a Bernoulli distribution, meaning they can be either 0 or 1 with some probability p. These are the main concepts to grasp from the net:
r
means the position associated to a given document in a search page;r=0
means first document,r+1
refers to the next document fromr
position.u
is an identifier of the document.E_r
is a random variable known as "Examination". As the name suggests, it determines whether the user examined a document (i.e, E=1) or not (E=0). A document can’t have clicks nor purchases if it wasn’t examined.A_u
indicates whether the correspondent documentu
is attractive for the result. If the document is attractive (A=1) then a click event is observed.S_ur
indicates whether the document at position r is satisfactory for customers. When they are satisfied (S=1), they stop examining new documents. Purchased documents are considered satisfactory already.C_r
is the observed click variable and comes from the clickstream dataset.P_r
relates to observed purchase events. If P=1 then S=1. This variable only makes sense when documents are for sale.- γ is the probability that users will continue examining documents at positions greater than
r
when not satisfied yet.
Using our previous example of smartphones, here’s how the DBN would be represented:

Notice that, for each document, we associate a set of latent variables (Attractiveness, Satisfaction, Examination) that models how and why users interact the way they do with the engine results. Clicks and Purchases are filled with gray as they are observed variables (clickstream data).
Here’s the catch: if we assign values that represent the probability for each latent variable, we could use those same values to extrapolate what the relevance of each document is for each search query.
For instance, let’s say we associate α to represent the probability that a given document is attractive and σ to the user being satisfied. If we use some optimization technique to learn from data what those values should be, we could find the relevance for each document as a consequence.
To do so, let’s begin describing our network with a set of rules (these rules will be used in the optimization process):

Here’s a summary explanation of each rule:
(1) means that the probability of a user examining a given document at given rank r
is zero case previous document at r-1
was not examined.
(2) α is the parameter that determines the probability of attractiveness being equal to 1.
(3) We must observe a click if a document is examined and is attractive.
(4) If a document was not clicked nor purchased then it can’t be considered satisfactory for the given query.
(5) The probability of document being satisfactory is σ if a document is clicked but not purchased.
(6) Satisfaction is 1 if click and purchase are observed.
(7) If previous document was satisfactory, users won’t continue examining new products.
(8) If a document is not satisfactory and previous document was examined then customer will continue examining new documents with probability γ.
(9) Probability of observing a click is given by the probability of click given examination times probability of examination (Bayes rule, which extends for the same as probability of being clicked and examined) given by α * ε.
Having the equations that governs the DBN model, some techniques could be used in order to find values for α, σ and γ that best explain the observed data.
First technique would be a straightforward implementation of the maximum likelihood estimation (MLE). It basically consists of finding an expression of the likelihood of the model (expressed in terms of log) which is the probability of observing the data given the model variables.
In our case, our DBN has the following log-likelihood expression:

- X is each Bernoulli variable used to describe the data (in our case, the "A", "S" and "E").
- C and P are observed clicks and purchases.
- ψ is the set of variables we are looking for to describe the data (α, σ and γ).
The summation happens over "s" which stands for session and it represents the user interactions with the search engine.
In MLE approach, we could simply take the derivative of this expression with respect to each latent variable and equalize it to zero. But this approach wouldn’t work very well. The problem is that the resulting expression would be intractable to solve due the latent variables.
This is a known problem actually. And as it turns out, besides MLE, another technique is the Expectation-Maximization algorithm (EM) (it works sort of like the popular kMeans algorithm does).
It starts by assigning random expected values to the latent variables α, σ and γ. It then proceeds to iterate an optimization process by using a specific cost function. New values are assigned for the variables and the process repeats itself until convergence or total iterations has been met.
The cost function used as reference for optimization is given by:

It’s similar to the log-likelihood expression as seen before but it replaces part of the expression with expected values of the random variables which turns the derivative much easier (but it also ends up being an approximation for what could be the optimum solution).
Here’s the cost function Q as defined in our model:

Par(X)=p refers to the parents of the variable X. For instance, in our DBN, the variable S (satisfaction) has Clicks and Purchases as its parents, given the edges connections of the network. The variable θ represents the latent parameters we are trying to find (α and so on). I is the indicator function and Z is just a reference for values not related to θ that will be discarded in the derivation process.
What we accomplish by using the expected value of the random variables is that now deriving this equation is much simpler. Here’s the result we get already isolating for θ:

That’s basically the essence of the Expectation-Maximization algorithm. We first replace in the log-likelihood expression the random variables by their expected value and then derive the equation equalizing it to zero. We obtain an optimization rule (equation 10) to be used for each variable which should, slowly, converge to local or global minima.
Let’s see how each variable gets adapted using equation 10 definition.
Attractiveness – α
To begin with, the attractiveness variable is defined as:

q extends for query and u represents the document. In our DBN representation, notice that α doesn’t have any parents so it’s directly defined.
To develop its optimization equation, first we need to describe the probability associated to the examination of documents as follows:

Where cr
is the conversion rate of the given document u
for given query q
, defined as total sales divided by total times the document appeared in results pages for a given query q. It’s only defined if documents are for sale otherwise it’s just considered to be zero.
So far we have that:

Where |S| is the cardinality of all sessions where the search results page for query q returned document u.
Notice that, given the architecture of our DBN, the variables C
and P
are independent given C
, which means we only need to account for the former:

Equation 13 can be further developed as follows:

C_{>r}
is 1 if any click is observed above r
and 0 otherwise.
For the numerator of 14, we can further develop it as:

And the denominator is a bit more complex:



Equation 14.4 should be implemented recursively. You can see how it was done on pyClickModels, here’s a sample:
Finally, we can use all these equations and blend them together to come up with the final rule for adapting the attractiveness variable:

You can see here how this rule was implemented on pyClickModels:
Satisfaction – σ
The satisfaction variable σ does have parents as defined in our DBN. Therefore, from equation 10, we need to use sessions only where the variable is defined, given appropriate parents.
Specifically, equation 5 already described it. We have that:

Using equation 10 we have, therefore:

S'_{uq}
encompass all sessions where document u
was clicked and not purchased.
The numerator of (16) can be further developed as:

Which resolves directly into the adaptation rule for satisfaction:

That was more straightforward than attractiveness. Here’s how this equation was implemented:
Persistence – 𝛾
Now comes the most challenging variable of them all.
Notice that, as defined in our DBN, persistence is globally defined, i.e., it doesn’t depend on either queries or documents to be defined.
As seen before, it’s given by:

In order to obtain the updating rule for 𝛾, we’ll be using the following Expected Sufficient Statistics (ESS):

Where E_r
and S_ur
are the parents of E_{r+1}
.
An auxiliary function Φ can be implemented in order to further develop 19. We’d have that:

Where:

The strategy is to divide equation 20 in 3 main parts:



The third one is developed as:

To develop the inner probability equation, we have that:

And also that:

Using 22 and 23 into 21 we finally get:

First one is developed as (notice the variable conversion rate has been called "w" here):

And, finally, the second equation is a direct implementation of equations from 1 to 9.
You can see how these 3 equations were implemented here:
Updating the parameter is then straightforward:
And finally, that’s all that it takes to implement the DBN as previously defined.
Well, now back to our main goal, how do we extract relevance from all that?
Finally…Relevance
Now it’s simple. After training the model with the clickstream data, we’ll obtain optimized values (remember those can still refer to local optima) for the attractiveness and satisfaction.
The ranking relevance, or judgments, will simply be the product of both, i.e.:

It’s possible in pyClickModels to export the values of judgments after fitting the model. A newline delimited JSON file is created for each search query /context and all respective documents.
Putting it All Together: Coding Time
Let’s see all these concepts in action now.
As it turns out, we can use public datasets available on Google BigQuery to test these ideas. In practice, we’ll be querying over a sample of Google Analytics where we can extract customers, what they searched, clicked and purchased (still, we only have access to very few data so it remains as a simple experiment after all).
Here’s the code necessary for our test:
It requires GCS client and pyClickModels, both can be installed with:
pip install pyClickModels google-cloud-storage
Just copy and paste the code to your local and run it with Python. A file will be saved at /tmp/pyclickmodels/model/model.gz
. Opening the file, you should see several JSON lines such as this one:
Notice that the library returns as key the whole context of the search (in this case, it uses as information which channel group the user came through to the website as well as their average spending ticket). Values are each document for that search context and the respective judgment associated, ranging from 0 to 1.
Conclusion
And that’s pretty much it!
By using PGMs we can use as many latent variables as we want in order to describe the behavior of the observed data. As we’ve seen, we do so in a way that we can extract concepts such as how attractive a document probably is as well as how satisfied users are when exposed to that document.
With these relevance values, we can train learning to rank models that bring search engines to the next level of operation making them capable of retrieving personalized results for each user and their respective context, as long as those have been trained in the model.
From that point on, more sophisticated algorithms can be tailored inside the search engine, ranging from simple decision trees up to deep learning nets fully capturing statistical nuances from data. Still, it can only be possible if the relevance data is available as otherwise there’s no way of teaching the learning algorithm on how to properly rank.
Full code implementation can be found in pyClickModels open-source repository:
References
[2] https://clickmodels.weebly.com/
[3] https://github.com/varepsilon/clickmodels
[4] https://www.cs.ubc.ca/~murphyk/Thesis/thesis.html
[5] https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
[6] https://medium.com/@jonathan_hui/machine-learning-expectation-maximization-algorithm-em-2e954cb76959
[7]https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46485.pdf