
Introduction
Recommendation engines are the solid pillar upon which our very experience as customers is built. From in-app push notifications to personalized mailing lists, we are accustomed to scroll through advices that intelligent systems devise to arouse our interest. Despite the complexity of the topic and the variety of approaches¹, the foundations of any recommendation system lie in, at least, three entities:
- Users
- Products
- Ratings
While users and products are straightforward concepts, ratings hide more subtleties, as their implementation depends on several factors, such as the business domain, context, user experience (UX) and data availability.
At high level, ratings can be categorized as either explicit (likes, scores, …) or implicit (purchases, clicks, …).
Now, let us imagine an e-commerce website selling a fairly diverse assortment of products. As the site collects browsing activity, the company would like to exploit it to derive interpretable insights on how users interaction with the web pages affect purchases. ** Moreover, the company would like to turn these interactions into ratings** through a scoring system, aiming to improve both the quality of the recommendations and the user experience.
This post describes how to achieve both objectives by leveraging Survival Analysis through a practical example.
The dataset
We import the needed libraries:
We load a sample dataset that was synthetically manufactured for the purpose of this analysis:

Each observation describes a web session, i.e. browsing activity captured from the landing to a product page up to an event, which can be either a purchase or a dropout. In particular:
- _userid: user identifier.
- _productid: product identifier.
- _added_towishlist: whether the product was added to a wishlist or not.
- _direct_orrecommend: whether the user landed on the product page directly or by selecting an internal recommendation from a previous page.
- _click_onreviews: if the user presses a "Reviews" button to read other custumers remarks on the product.
- _click_on_moredetails: if the user presses a "More Details" button to read further product specifics.
- _open_internallinks: if the user selects links from within the page to other internal resources, i.e. other products pages.
- _click_onphotos: whether the user clicks on the products photos or not.
- buy: whether the product is purchased or not.
- _time_toevent: elapsed time from the landing on the product page to the event, either a purchase or dropout, expressed in minutes.
We may imagine this data being collected starting from the interaction with web pages looking as follows:

Our dataset consists of 163 web sessions, 156 of which concluded with a purchase, while 7 ended in customer abandonment without buying any product.
We can display and compare summary statistics and covariates distributions grouped by the two event groups (0 = dropout, 1 = purchase):

Survival Analysis
Survival analysis encompasses a collection of statistical methods for describing time to event data. It is widely adopted in clinical studies, although it finds suitable applications in different fields (e.g. predictive maintenance).
In our case, we are interested in how the browsing activity affects the time to the event. We know that some users made a purchase, while other users abandoned the website after a certain amount of time (right censoring):

Kaplan-Meier estimator
We use the Kaplan-Meier curve² to estimate the survival function S(t). The survival function gives the probability that customers will survive ( = not buy) past a certain time after landing on the product page. In particular, the unconditional probability of survival up to time t is estimated as the product of conditional probabilities of surviving to the different event times:

where:
- S^(t): Kaplan-Meier estimate of the survival function.
- t(i): a time where at least one purchase (event) happened, as we observe events on a discrete time scale.
- nᵢ: number of customers that have not bought yet the product at time t(i), nor have left the website.
- dᵢ: number of events (purchases) at time t(i).
We can stratify the Kaplan-Meier curve for a given condition, and thus verify whether the condition affects the survival estimate or not.
For example, does saving the product into the wishlist affect the probability to buy it (event) over a given period of time?

We immediately identify the positive effect on purchases (or negative effect on the purchase-free survival) represented by adding the product to a wishlist at a glance.
The log-rank statistical test is performed to evaluate differences between the two curves, and the p-value allows us to reject the null hypothesis (H₀: the curves are identical), confiming a statistically significant difference between survival rates in presence vs. absence of wishlist use.
Similarly, we can assess the effect of clicking the "More Details" button:

Again, we have proof that users browsing behaviour can provide valuable insights on the appreciation of the viewed product.
But how can we build a model taking into account the effect of multiple variables? How can we turn the interactions with web pages elements into measures of hazard or ratings?
The Cox proportional hazards model
The Cox proportional hazards model³ can be used to assess the association between variables and survival rate, and is defined as:

The formula states that the hazard function h(t|xᵢ) is proportional to a baseline hazard function h₀(t) and relative risks exp(β’xᵢ).
As the form of the underlying hazard function h₀(.) is unspecified, the model is semi-parametric. Moreover, it is possible to estimate β without estimating h₀(.).
Furthermore, the Cox model provides the additional advantage of the interpretability of its coefficients. To make an example, we can consider only one covariate xᵢ as the click on the "Reviews" button, where xᵢ=1 indicates that the user accessed the product reviews, and xᵢ=0 that the user did not select the button, the Cox model can be expressed as h(t|xᵢ)= h₀(t)exp(βxᵢ), where exp(β) indicates the relative risk of purchase given by clicking the "Reviews" button over not clicking it:
- Risk given by clicking the "Reviews" button (xᵢ=1): h₀(t)exp(β⋅xᵢ) = h₀(t)exp(β⋅1) = h₀(t)exp(β)
- Risk given by not clicking the "Reviews" button (xᵢ=0): h₀(t)exp(β⋅xᵢ) = h₀(t)exp(β⋅0) = h₀(t)
- Relative risk = risk given by clicking the button / risk given by not clicking it = h₀(t)exp(β) / h₀(t) = exp(β)
We can fit the Cox model and inspect the obtained results:

For each covariate xᵢ, we can observe its relative risk exp(β), together with its confidence interval:
- exp(β)>1 (or β>0) indicates an increased risk of purchase.
- exp(β)<1 (or β<0) a reduced risk of purchase.
Some interpretations of features effects from the above table follow:
- Adding the product to the wishlist significantly (p-value <0.005) increases the risk of purchase from a minimum of 3.81 times (exp(β) lower 95% CI) to a maximum of 11.19 times (exp(β) upper 95% CI).
- Accessing the product page by internal recommendation significantly (p-value=0.02) reduces the risk of purchase from a minimum of 0.93 times (exp(β) upper 95% CI) to a maximum of 0.45 times (exp(β) lower 95% CI) over a direct access to the page.
- Clicking on the products photos does not significantly (p-value=0.94) increase nor reduce the risk of purchase: in fact, we can notice that the exp(β) CI oscillates around 1 (lower 95%=0.69, upper 95%=1.42).
We can further observe the coefficients β by plotting them:

Among Cox model assumptions, we should make sure to fall under the hypothesis of proportional hazards:

In conclusion, thanks to survival analysis we can gain interpretable hazard measures that provide insights on the association between customers interaction with the web pages and survival rate.
For a given web browsing session, we may calculate the partial hazards as exp(β’xᵢ), thus neglecting the baseline hazard h₀(t), and therefore estimate the relative risk associated to product purchase:

We can compare it with the hazard from a different web session on the same product page:

We notice that the first session is associated to a higher partial hazard (4.93) than the second (1.17), suggesting that the first user implicitly manifested a higher preference for that product than the second user, and that he may be more likely to buy it.
Indeed, we may use the partial hazards as a naive estimate for ratings, thus turning our browsing activity dataset into a triad of user, product and ratings, essential to any recommendation strategy:

Conclusions
In this post, we applied survival analysis techniques to browsing information collected by a fictional e-commerce website.
This approach may provide the following benefits:
- Incorporation of web browsing activity (users behaviour) in exhisting ratings calculations to improve recommendations.
- Interpretable hazards able to explain what kind of interaction with the UX may lead to an event (purchase).
- Web browsing activity is independent from user’s identity, which can remain unknown.
- Highlight potential areas of improvement for the existing UI (e.g. the interaction between customers and a new web page element is associated with a decreased risk of purchase).
It is important to remember that:
- In this post, we modelled an observation over a web session, defined as the time elapsed between the landing on the product’s web page to either a purchase or dropout. For example, we did not consider whether the same user landed on the same page multiple times in different occasions. We also did not consider if a user purchased a product from a previously recommended result (internal hops). A different approach to problem definition would lead to different assumptions and thus to the need to consider different survival analysis methods.
- We also assumed that our covariates do not change over time, as we are treating the web session data as a sort of "baseline condition" to understand customer preferences over a product and propension to purchase. Nevertheless, we may take into account time-varying covariance by modifying the Cox model⁴:

- In terms of implementation, as we used the lifelines⁵ Python package, adopting a Cox time-varying proportional hazard model would mean to fit the _CoxTimeVaryingFitter_⁶ object instead of the _CoxPHFitter_⁷. Although the object still exposes methods to calculate hazard values at known observations, the meaning of predictions is not straightforward: as covariates change over time, a prediction from an observation time t₁ to time t₂ would require to know the values assumed by covariates in t₂, which are, in principle, unknown.
- We chose to infer a rating measure by leveraging the partial hazards, exp(β’xᵢ). Other approaches could be pursued⁷.
- The Cox proportional hazards regression model may not be able to capture complex, non-linear relationships between data. For this purpose, different models, such as Survival Random Forests⁸ or Neural Networks⁹ (implementations are available in the scikit-survival¹⁰ and pycox¹¹ packages) could be investigated instead.
- Further information concerning recommendation strategies in conditions of implicit ratings can be found, for example, here¹².
References
[1] S. M. Al-Ghuribi, S. Noah, "A Comprehensive Overview of Recommender System and Sentiment Analysis", arXiv:2109.08794, 2021.
[2] E. L. Kaplan and Paul Meier, "Nonparametric Estimation from Incomplete Observations", Journal of the American Statistical Association, Vol. 53,n°282, pp. 457- 481, 1958.
[3] D. R. Cox, "Regression Models and Life-Tables", Journal of the Royal Statistical Society. Series B (Methodological), Vol. 34, n°2., pp. 187–220, 1972.
[4] Zhongheng Zhang, Jaakko Reinikainen, Kazeem Adedayo Adeleke, Marcel E. Pieterse, Catharina G. M. Groothuis-Oudshoorn, "Time-varying covariates and coefficients in Cox regression models", Annals of Translational Medicine, 6(7): 121, 2018.
[5] https://lifelines.readthedocs.io/en/latest/
[6] https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxTimeVaryingFitter.html
[7] https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html
[8] Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, Michael S. Lauer, "Random Survival Forests", Annals of Applied Statistics, Vol. 2, n°3, 841–860, 2008.
[9] Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger, "Deepsurv: personalized treatment recommender system using a Cox proportional hazards deep neural network", BMC Medical Research Methodology, 18(1), 2018.
[10] https://scikit-survival.readthedocs.io/en/stable/user_guide/random-survival-forest.html
[11] https://github.com/havakv/pycox
[12] Yifan Hu, Yehuda Koren, Chris Volinsky, "Collaborative Filtering for Implicit Feedback Datasets", Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy, 2008.