Five tips on survival analysis for a data scientist

A guide on how to leverage survival analysis in business

Gosia Komor
Towards Data Science

--

Photo by Jill Heyer on Unsplash

Survival analysis predicts time to an event

A number of analytical problems require prediction of the time until an event will occur. For example, an internet provider needs to know when a customer will terminate their contract to act on time and prevent the churn. Challenges like these can be solved with survival analysis.

The goal of survival analysis is to predict time until an event happens and estimate the “survival probability”. Survival analysis originates from medical research where patients are followed for a long time until an event happens, e.g. death, the reason why it is called “survival analysis”. It is used on datasets with right-censoring, where for a subset of the samples the occurrence of an event is unknown at the time of the study. For example, a patient entered a study only 2 months ago and, at the time of the study, we do not know if he will survive 5 months or 10 years. Next to survival time prediction, the effect of single factors on survival can be delineated. For instance, the effect of a treatment on the disease-free survival of a certain cancer type is often studied with this methodology.

As medium itself offers a wide range of blogs on the theoretical background of survival analysis, I recommend looking at those if you are not familiar with this technique[1–3]. While in this article, I will share additional tips on survival analysis from my experience as a data scientist.

Classical regression or classification are not meant for right-censored data

The most important reason to choose survival analysis is the presence of right-censored samples in your dataset. It is important that right-censored data is not analyzed with regression or classification techniques, as such analyses do not take censoring or time to an event into account[4]. This can result in drawing incorrect conclusions. Here is an (over-simplified) example how things can go wrong:

Imagine you work for a hospital X where you want to delineate features influencing survival of cancer patients. You have data of patients diagnosed between years 2005–2015 and followed their survival until 2020, with standard patient characteristics like age, gender etc. However, in 2010 hospital X introduced DNA sequencing to administer their patients personalized cancer therapy according to their DNA profile (see Figure 1 for illustrative study overview). You perform a regression analysis and draw the following conclusion: “Patients with DNA sequencing have worse survival than patients without”. That is because your patients with the DNA testing have a survival of max. 10 years (2010–2020), while the patients with no DNA testing survive up to 15 years (2005–2020) in your study, which has an impact on average survival per group. Given the known success of several DNA-guided cancer therapies, this conclusion is unexpected and incorrect, and moreover, could spread false information. This is what can happen if right-censoring is ignored.

Image by Author

Going beyond Cox regression unravels extra opportunities

The go-to model for multivariate survival analysis is Cox regression. It is commonly used in medical research and easy to interpret. It is implemented in programming languages or even software packages for statistical analysis like SPSS, which is still a favorite of many clinicians. However, as for regression or classification problems, you probably don’t restrict yourself to linear or logistic regression, for survival analysis, you should also consider approaches beyond Cox regression. The most important limitation of Cox regression is that it is a linear modelling technique requiring several assumptions[5]. If you aim for high predictive performance with the use of complex datasets that may include non-linear relationships, more suitable methodologies exist.

An example of an alternative algorithm is Random Forest Survival Analysis. Random Forest is still a favorite of many data scientists as, in comparison to classical regression, it has plenty of superlatives. Random Forest modelling has many superlatives, including the ones listed below:

  • It identifies non-linear relations between variables
  • It requires less data transformations than methodologies restricted by their assumptions (e.g. linear, logistic or Cox regression)
  • It accounts for interactions between features
  • It performs well with a high number of features
  • It is a feasible solution in limited timeframes

Using Random Forest in Survival Analysis is very similar to Random Forest models used for classification or regression, i.e. many trees are trained based on a subsample of the dataset and the predictions are aggregated based on the trees’ votes. The survival-specific part is that the model bases on e.g. Log-Rank test, instead of the Gini impurity, to evaluate each split according to the difference between observed and expected number of events in each daughter node[6].

Image by Author

Other interesting alternative algorithms are survival models based on e.g. Penalized Cox Models, Gradient Boosting, Support Vector machine, models using Bayesian statistics and many more (see details in the next paragraph).

Selecting the right programming language provides more flexibility to the analysis

Using Python or R is an ongoing discussion in the data science field, with each language having their specific advantages over the other [7–9].

Python is a general-purpose programming language, as it offers better stability, modularity, and code readability. It is preferred in industry due to its deep learning libraries and being more suitable for model deployment in production environments.

Survival analysis using Python is less prevalent than in R and therefore Python provides less freedom in the choice of survival models. Nevertheless, even though limited in number, there are great packages for survival analysis available in Python:

  • lifelines is a package for parametric models, incl. implementation of univariate models as well as Cox regression.
  • sklearn-survival includes more complex or non-linear models, like Cox Regression with possible L1 or L2 regularization, Random Forest, Gradient Boosting or Support Vector Machine.
  • pysurvival implements more than 10 models with very useful model evaluation visualizations, unfortunately, it is currently only available on Linux.

On the other hand, R is specialized in statistical computing and graphics, and therefore it is currently very popular in academia. As survival analysis originates from and is frequently performed in medical research (academic environment), most of the innovation is initiated by this field. Therefore, R offers a wide variety of CRAN packages with plenty of implementations of survival models, incl. alternative algorithms (e.g. randomForestSRC for survival modelling using Random Forest or mboost for Gradient Boosting algorithms), extensions to Cox regression (e.g. glmnet for Cox modelling with regularization or BMA for Bayesian Model Averaging of Cox regression models) or packages for model evaluation (e.g. timeAUC), giving a lot of flexibility to the user. If integration into industrialized production environment is needed, R can be run at scale using solutions like Azure Databricks.

Therefore, before selecting the programming language for the analysis, consider the business question, the data and the final product that needs to be delivered.

Survival analysis can be applied across different industries

In data science, regression and classification are “must-know” data analysis techniques. Unfortunately, survival analysis is not, and it is unclear why. Personally, I believe part of the reason is in the name, “Survival Analysis”. It indicates the analysis can be used only for survival prediction and therefore it is useful in medical research alone.

This is not the case, as survival analysis is actually a very versatile and powerful analysis technique. Next to plenty of health analytics cases, survival analysis can be leveraged across many industries. Within our team, we performed survival analysis in several different projects:

We conducted a health analytics project on cancer patients’ survival prediction. In contrast to many cancer survival studies, this project involved integration of survival analysis with natural language processing. We analyzed free text from medical patient records to extract valuable information, such as specific radiology scores, using regular expressions, as well as to estimate the overall sentiment of the doctor filling in the document. Combining these variables together with the standard clinical data allowed for strong predictive performance of the final model.

In the past, we also applied survival analysis in customer lifetime value (CLV) calculation. CLV estimates how much revenue a business can expect from a customer until their retention, and it can be used to adjust marketing campaigns to target the right customers at the right time. In this project we calculated CLV as a product of revenue per customer per period and predicted customer retention rate for that period, where the latter was modelled using survival analysis. This enabled the business to focus on retaining their most valuable customers, which were at risk of churning in the future.

In another project, we used survival analysis to predict popularity of used cars. Using brand and age of the car we modelled its popularity, which was defined as inventory days until the car was purchased. Knowing the effect of brand and age on car popularity enabled the business to determine the optimal selling price for each car.

Other examples of applying survival analysis outside of health analytics are:

  • Similar to the customer lifetime value calculation, there is a well-known case about customer churning prediction at a Telco[10], where survival time corresponds to customer tenure and event is a churn. In the same fashion, survival analysis can be used to predict time until an employee will terminate their contract.
  • In finance, survival analysis can be used for probability of default calculation in credit risk.
  • In industrial companies, survival analysis can be used for prediction of machine lifetime.

Final remarks

In summary, these are the five things a data scientist should know about survival analysis:

  1. It is a technique to predict time to an event.
  2. Using other data modelling techniques, like regression or classification, on right-censored data can lead to biased results.
  3. Next to Cox regression, there is a wide variety of models that can be used for survival analysis depending on your dataset and business questions.
  4. For now, R offers more flexibility in your analysis than Python.
  5. Survival analysis is very useful, also outside of health analytics.

Finally, it is important to mention that this article is a subjective opinion. In practice, these are the five tips about survival analysis I would have liked to know at the time I was handed my first right-censored dataset. I hope these tips can help other data scientists in their work.

References

[1] S. Dhamodharan, Survival Analysis | An Introduction (2020), https://medium.com/analytics-vidhya/survival-analysis-an-introduction-87a94c98061

[2] T. Zahid, Survival Analysis — Part A (2019), https://towardsdatascience.com/survival-analysis-part-a-70213df21c2e

[3] E. Lewinson, Introduction to Survival Analysis (2020), https://towardsdatascience.com/introduction-to-survival-analysis-6f7e19c31d96

[4] K. Sawarkar, Survival Analysis- What is it? (2019), https://medium.com/inside-machine-learning/survival-analysis-cb5832ffcd78

[5] A. Kassambara, Cox Model Assumptions, http://www.sthda.com/english/wiki/cox-model-assumptions

[6] H. Ishwaran, U.B. Kogalur, E.H. Blackstone and M.S. Lauer, Random Survival Forests (2008), The Annals of Applied Statistics 2008, Vol. 2, №3, 841–860

[7] R. Cotton, Python vs. R for Data Science: What’s the Difference? (2020), https://www.datacamp.com/community/blog/when-to-use-python-or-r

[8] Data Driven Science, Python vs R for Data Science: And the winner is.. (2018), https://medium.com/@datadrivenscience/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197

[9] B. Karakan, Python vs R for Data Science (2020), https://towardsdatascience.com/python-vs-r-for-data-science-6a83e4541000

[10] A. Pandey, Survival Analysis: Intuition & Implementation in Python (2019), https://towardsdatascience.com/survival-analysis-intuition-implementation-in-python-504fde4fcf8e

--

--

Data scientist at Deloitte Netherlands with a PhD in bioinformatics, focused on leveraging advanced statistics in data analytics projects