The world’s leading publication for data science, AI, and ML professionals.

A Guide to Model Selection For Survival Analysis

How to examine and evaluate models for time-to-event data

Photo by m. on Unsplash
Photo by m. on Unsplash

Selecting a suitable model for the time-to-event data of interest is an important step in survival analysis.

Unfortunately, with the sheer number of models available for survival analysis, it is easy to become inundated by all of the available information.

While it might be tempting to meticulously study every modeling approach, drowning yourself in statistical jargon will only hamper efforts to properly find the best model.

Here, we explore a few popular models in survival analysis and then discuss the factors that should be considered when identifying the best model for the time-to-event data of interest.

Key Terminology

Before covering any of the models, it is important to become familiar with some of the measures that are of interest in survival analysis.

  1. Survival function

The survival function is denoted by S(t), where:

Survival function (Created By Author)
Survival function (Created By Author)

It shows the probability of a subject surviving (i.e., not experiencing the event) beyond time t. This is a non-increasing function.

2. Hazard function

The hazard function, also known as the hazard rate, is denoted by h(t), where

Hazard function (Created By Author)
Hazard function (Created By Author)

It shows the instantaneous probability of the event occurring at time t, given that the event has not already occurred. The hazard function can be derived from the survival function, and vice versa.

3. Cumulative hazard function

The cumulative hazard function is a non-decreasing function that shows the total accumulated risk of an event occurring at time t. In mathematical terms, it is the area under the hazard function.

4. Hazard Ratio

The hazard ratio is the ratio of the hazard rate between two groups. This is a quantified measure of how a covariate affects the survival duration of the subjects in the study.

Picking the Right Model

There are many models that can be leveraged for survival analysis.

However, each model is unique in the terms of:

  • the assumptions they make
  • the information they provide

Therefore, picking the right model comes down to identifying the one that makes the assumptions that befit the time-to-event data and enables users to obtain the desired information with the subsequent analysis.

Popular Models in Survival Analysis

Out of the many models that can be used to analyze time-to-event data, there are 4 that are most prominent: the Kaplan Meier model, the Exponential model, the Weibull model, and the Cox Proportional-Hazards model.

In the following demonstration, each modeling approach will be explored with Python’s lifelines module. The models will be used to examine time-to-event data from a built-in dataset.

Below is a preview of the dataset:

Code Output (Created By Author)
Code Output (Created By Author)

The week column shows the survival duration and the arrest column shows whether or not the event (i.e., arrest) has occurred.

1 – Kaplan Meier Model

The Kaplan-Meier model is arguably the most well-known model in survival analysis.

It is classified as a non-parametric model, meaning that it does not assume the distribution of the data. It generates a survival function with only the information provided.

The Kaplan Meier model computes the survival function with the formula:

Survival function (Created By Author)
Survival function (Created By Author)

Here’s the Kaplan Meier curve built with the time-to-event data:

Code Output (Created By Author)
Code Output (Created By Author)

The survival functions of different groups can be compared with the log-rank test, a non-parametric hypothesis test.

As an example, we can compare the survival functions of subjects with and without financial aid to see if financial aid affects the survival duration using the logrank_test method.

Code Output (Created By Author)
Code Output (Created By Author)

The benefit of the Kaplan Meier model is that it is intuitive and easy to interpret. As it makes few underlying assumptions, this model is often used as a baseline in survival analysis.

Unfortunately, due to the model’s minimal complexity, it can be difficult to draw meaningful insights from it. It can not be used to compare risk between groups and compute metrics like the hazard ratio.

2 – Exponential Model

The Exponential model is another popular model in survival analysis. Unlike the Kaplan Meier model, it is a parametric model, meaning that it assumes that the data fits a specific distribution.

The survival function of the Exponential model is derived from the formula:

Survival function (Created By Author)
Survival function (Created By Author)

The hazard rate of the Exponential model is derived from the formula:

Hazard function (Created By Author)
Hazard function (Created By Author)

The exponential model assumes that the hazard rate is constant. In other words, the risk of the event of interest occurring remains the same throughout the period of observation.

Below is the plotted survival curve from the Exponential model:

To determine the hazard metrics, one can compute the λ value in this distribution:

Code Output (Created By Author)
Code Output (Created By Author)

One can also directly get the hazard metrics with the properties offered in the lifelines package, which reports the hazard rate and the cumulative hazard function in a data frame:

Code Output (Created By Author)
Code Output (Created By Author)

Based on the output, the hazard rate remains constant, which is in line with the nature of the Exponential model.

Overall, the Exponential model provides substantial information on the survival function and the hazard function. Moreover, it can be used to compare the hazard rates of different groups.

However, it makes the strong assumption that the hazard rate is constant at any given time, which may not suit the time-to-event data of interest.

3 – Weibull Model

Another parametric model one can consider is the Weibull model.

The survival rate in a Weibull model is determined by the following formula:

Survival function (Created By Author)
Survival function (Created By Author)

The hazard rate in a Weibull model is determined by the following formula:

Hazard rate (Created By Author)
Hazard rate (Created By Author)

The data distribution in the Weibull model is determined by the two parameters: λ and ρ.

The λ parameter indicates how long it takes for 63.2% of the subjects to experience the event.

The ρ parameter indicates whether the hazard rate is increasing, decreasing, or constant. If ρ is greater than 1, the hazard rate is constantly increasing. If ρ is less than 1, the hazard rate is constantly decreasing.

In other words, the Weibull model assumes that the change in hazard rate is linear. The hazard rate can always increase, always decrease, or stay the same. However, the hazard rate can not fluctuate.

Below is a plotted survival curve derived from the Weibull model:

Code Output (Created By Author)
Code Output (Created By Author)

The lifelines package can be used to obtain the λ and ρ parameters:

Code Output (Created By Author)
Code Output (Created By Author)

Since the ρ value is greater than 1, the hazard rate in this model is always increasing.

We can confirm this by deriving the hazard rate and cumulative hazard function.

Similar to the Exponential model, the Weibull model is capable of computing many of the relevant metrics in the survival analysis.

However, its results are based on the strong assumption that the hazard rate changes linearly across time, which may not suit the time-to-event data in question.

4 – Cox Proportional-Hazards Model

Although the Exponential model and the Weibull model can evaluate covariates, they can only examine each covariate individually.

If the goal is to conduct a survival analysis that examines time-to-event data with respect to multiple variables at once, the Cox Proportional-Hazards model (also known as the Cox model) can be the preferable alternative.

The Cox model implements survival regression, a technique that regresses covariates against the survival duration, to give insight into how the covariates affect survival duration.

This model is classified as semi-parametric since it incorporates both parametric and non-parametric elements.

It is based on the following hazard rate formula:

Hazard rate formula (Created By Author)
Hazard rate formula (Created By Author)

The Cox model allows the hazard rate to fluctuate, as opposed to the parametric models where the hazard rate adheres to a fixed pattern.

The model is, however, dependent on the proportional hazards assumption. It assumes that the hazard ratios between groups remain constant. In other words, no matter how the hazard rates of the subjects change during the period of observation, the hazard rate of one group relative to the other will always stay the same.

The lifelines module allows users to visualize the baseline survival curve, which illustrates the survival function when all the covariates are set to their median value.

Code Output (Created By Author)
Code Output (Created By Author)

In addition, the Cox model can quantify the strength of the relationships between the covariates and the survival duration of the subjects with survival regression. The summary method in the lifelines module shows the results of the regression.

The output provides a lot of information, but the greatest insights can be obtained with the exp(coef) column and the p column (p-value).

The p-value indicates which covariates have a significant effect on the survival duration. Based on the results, the fin, age, and prio covariates are statistically significant predictors for determining survival duration, given their coefficient’s small p-values.

Moreover, the hazard ratio for each covariate is equivalent to e to the power of the covariate’s coefficient (eᶜᵒᵉᶠ), which is already provided in the exp(coef) column.

As an example, the eᶜᵒᵉᶠ value of the fin covariate, which represents financial aid, is 0.68.

Mathematically, this can be interpreted as the following:

Hazard ratio of fin (Created By Author)
Hazard ratio of fin (Created By Author)

This means that being getting financial aid changes the hazard rate by a factor of 0.68 (i.e., a 32% decrease).

Of course, the Cox model is only suitable if its assumption of proportional hazards befits the time-to-event data. To justify the use of the model, one can test this assumption using the check_assumptions method.

How to pick the right model

Naturally, there are many other models outside of these 4 that are worth considering, such as the Log-logistic model and the Accelerated Failure Time model (AFT). As a result, you may find yourself with too many models to pick from.

To simplify your search for the best model, consider the following factors:

  1. The objective

The first thing you should decide is what information you are looking to obtain from your survival analysis. With a specific objective in mind, it will be easier to hone in on the ideal model.

2. Domain knowledge

There might be multiple models that can be used to derive the desired metrics for the time-to-event data in question. In such a case, utilizing the domain knowledge of the subject can help filter out inadequate models.

For instance, if you are conducting a survival analysis to study machine failure, you should account for the fact that machines are subject to wear and tear. Due to the wear and tear, the risk of machine failure should increase with time. For such a case, the Exponential model, which assumes that the hazard rate is always constant, is not appropriate and should be removed from consideration.

3. Model performance

If you find that there are multiple models that help achieve the aim of the study and conform to the domain knowledge of the subject, you can identify the ideal model by gauging the models’ performance with evaluation metrics.

One suitable metric is the Akaike Information Criterion (AIC), which is an estimate of the prediction error of the model. A lower AIC score corresponds to a better model.

As an example, if we are choosing between the Exponential model and the Weibull model for analyzing the time-to-event data, we can determine the superior model by computing the AIC scores for both models.

Code Output (Created By Author)
Code Output (Created By Author)

Based on the AIC metric, the Weibull model is a better fit for analyzing the time-to-event data.

Conclusion

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

To sum it up, every model is unique in terms of the assumptions it makes and the information it provides.

If you understand the nature of the time-to-event data of interest and know what you wish to obtain from the survival analysis, finding the ideal model should be a relatively straightforward task.

If you are familiar with the methodology of survival analysis but wanted to know why it can trump other types of analysis on time-to-event data, check out the following article:

Survival Analysis: A Brief Introduction

I wish you the best of luck in your Data Science endeavors!


Related Articles