
Selecting a suitable model for the time-to-event data of interest is an important step in survival analysis.
Unfortunately, with the sheer number of models available for survival analysis, it is easy to become inundated by all of the available information.
While it might be tempting to meticulously study every modeling approach, drowning yourself in statistical jargon will only hamper efforts to properly find the best model.
Here, we explore a few popular models in survival analysis and then discuss the factors that should be considered when identifying the best model for the time-to-event data of interest.
Key Terminology
Before covering any of the models, it is important to become familiar with some of the measures that are of interest in survival analysis.
- Survival function
The survival function is denoted by S(t), where:

It shows the probability of a subject surviving (i.e., not experiencing the event) beyond time t. This is a non-increasing function.
2. Hazard function
The hazard function, also known as the hazard rate, is denoted by h(t), where

It shows the instantaneous probability of the event occurring at time t, given that the event has not already occurred. The hazard function can be derived from the survival function, and vice versa.
3. Cumulative hazard function
The cumulative hazard function is a non-decreasing function that shows the total accumulated risk of an event occurring at time t. In mathematical terms, it is the area under the hazard function.
4. Hazard Ratio
The hazard ratio is the ratio of the hazard rate between two groups. This is a quantified measure of how a covariate affects the survival duration of the subjects in the study.
Picking the Right Model
There are many models that can be leveraged for survival analysis.
However, each model is unique in the terms of:
- the assumptions they make
- the information they provide
Therefore, picking the right model comes down to identifying the one that makes the assumptions that befit the time-to-event data and enables users to obtain the desired information with the subsequent analysis.
Popular Models in Survival Analysis
Out of the many models that can be used to analyze time-to-event data, there are 4 that are most prominent: the Kaplan Meier model, the Exponential model, the Weibull model, and the Cox Proportional-Hazards model.
In the following demonstration, each modeling approach will be explored with Python’s lifelines module. The models will be used to examine time-to-event data from a built-in dataset.
Below is a preview of the dataset:

The week
column shows the survival duration and the arrest
column shows whether or not the event (i.e., arrest) has occurred.
1 – Kaplan Meier Model
The Kaplan-Meier model is arguably the most well-known model in survival analysis.
It is classified as a non-parametric model, meaning that it does not assume the distribution of the data. It generates a survival function with only the information provided.
The Kaplan Meier model computes the survival function with the formula:

Here’s the Kaplan Meier curve built with the time-to-event data:

The survival functions of different groups can be compared with the log-rank test, a non-parametric hypothesis test.
As an example, we can compare the survival functions of subjects with and without financial aid to see if financial aid affects the survival duration using the logrank_test
method.

The benefit of the Kaplan Meier model is that it is intuitive and easy to interpret. As it makes few underlying assumptions, this model is often used as a baseline in survival analysis.
Unfortunately, due to the model’s minimal complexity, it can be difficult to draw meaningful insights from it. It can not be used to compare risk between groups and compute metrics like the hazard ratio.
2 – Exponential Model
The Exponential model is another popular model in survival analysis. Unlike the Kaplan Meier model, it is a parametric model, meaning that it assumes that the data fits a specific distribution.
The survival function of the Exponential model is derived from the formula:

The hazard rate of the Exponential model is derived from the formula:

The exponential model assumes that the hazard rate is constant. In other words, the risk of the event of interest occurring remains the same throughout the period of observation.
Below is the plotted survival curve from the Exponential model:
To determine the hazard metrics, one can compute the λ value in this distribution:

One can also directly get the hazard metrics with the properties offered in the lifelines package, which reports the hazard rate and the cumulative hazard function in a data frame:

Based on the output, the hazard rate remains constant, which is in line with the nature of the Exponential model.
Overall, the Exponential model provides substantial information on the survival function and the hazard function. Moreover, it can be used to compare the hazard rates of different groups.
However, it makes the strong assumption that the hazard rate is constant at any given time, which may not suit the time-to-event data of interest.
3 – Weibull Model
Another parametric model one can consider is the Weibull model.
The survival rate in a Weibull model is determined by the following formula:

The hazard rate in a Weibull model is determined by the following formula:

The data distribution in the Weibull model is determined by the two parameters: λ and ρ.
The λ parameter indicates how long it takes for 63.2% of the subjects to experience the event.
The ρ parameter indicates whether the hazard rate is increasing, decreasing, or constant. If ρ is greater than 1, the hazard rate is constantly increasing. If ρ is less than 1, the hazard rate is constantly decreasing.
In other words, the Weibull model assumes that the change in hazard rate is linear. The hazard rate can always increase, always decrease, or stay the same. However, the hazard rate can not fluctuate.
Below is a plotted survival curve derived from the Weibull model:

The lifelines package can be used to obtain the λ and ρ parameters:

Since the ρ value is greater than 1, the hazard rate in this model is always increasing.
We can confirm this by deriving the hazard rate and cumulative hazard function.

Similar to the Exponential model, the Weibull model is capable of computing many of the relevant metrics in the survival analysis.
However, its results are based on the strong assumption that the hazard rate changes linearly across time, which may not suit the time-to-event data in question.
4 – Cox Proportional-Hazards Model
Although the Exponential model and the Weibull model can evaluate covariates, they can only examine each covariate individually.
If the goal is to conduct a survival analysis that examines time-to-event data with respect to multiple variables at once, the Cox Proportional-Hazards model (also known as the Cox model) can be the preferable alternative.
The Cox model implements survival regression, a technique that regresses covariates against the survival duration, to give insight into how the covariates affect survival duration.
This model is classified as semi-parametric since it incorporates both parametric and non-parametric elements.
It is based on the following hazard rate formula:

The Cox model allows the hazard rate to fluctuate, as opposed to the parametric models where the hazard rate adheres to a fixed pattern.
The model is, however, dependent on the proportional hazards assumption. It assumes that the hazard ratios between groups remain constant. In other words, no matter how the hazard rates of the subjects change during the period of observation, the hazard rate of one group relative to the other will always stay the same.
The lifelines module allows users to visualize the baseline survival curve, which illustrates the survival function when all the covariates are set to their median value.

In addition, the Cox model can quantify the strength of the relationships between the covariates and the survival duration of the subjects with survival regression. The summary
method in the lifelines module shows the results of the regression.

The output provides a lot of information, but the greatest insights can be obtained with the exp(coef)
column and the p
column (p-value).
The p-value indicates which covariates have a significant effect on the survival duration. Based on the results, the fin
, age
, and prio
covariates are statistically significant predictors for determining survival duration, given their coefficient’s small p-values.
Moreover, the hazard ratio for each covariate is equivalent to e to the power of the covariate’s coefficient (eᶜᵒᵉᶠ), which is already provided in the exp(coef)
column.
As an example, the eᶜᵒᵉᶠ value of the fin
covariate, which represents financial aid, is 0.68.
Mathematically, this can be interpreted as the following:

This means that being getting financial aid changes the hazard rate by a factor of 0.68 (i.e., a 32% decrease).
Of course, the Cox model is only suitable if its assumption of proportional hazards befits the time-to-event data. To justify the use of the model, one can test this assumption using the check_assumptions
method.
How to pick the right model
Naturally, there are many other models outside of these 4 that are worth considering, such as the Log-logistic model and the Accelerated Failure Time model (AFT). As a result, you may find yourself with too many models to pick from.
To simplify your search for the best model, consider the following factors:
- The objective
The first thing you should decide is what information you are looking to obtain from your survival analysis. With a specific objective in mind, it will be easier to hone in on the ideal model.
2. Domain knowledge
There might be multiple models that can be used to derive the desired metrics for the time-to-event data in question. In such a case, utilizing the domain knowledge of the subject can help filter out inadequate models.
For instance, if you are conducting a survival analysis to study machine failure, you should account for the fact that machines are subject to wear and tear. Due to the wear and tear, the risk of machine failure should increase with time. For such a case, the Exponential model, which assumes that the hazard rate is always constant, is not appropriate and should be removed from consideration.
3. Model performance
If you find that there are multiple models that help achieve the aim of the study and conform to the domain knowledge of the subject, you can identify the ideal model by gauging the models’ performance with evaluation metrics.
One suitable metric is the Akaike Information Criterion (AIC), which is an estimate of the prediction error of the model. A lower AIC score corresponds to a better model.
As an example, if we are choosing between the Exponential model and the Weibull model for analyzing the time-to-event data, we can determine the superior model by computing the AIC scores for both models.

Based on the AIC metric, the Weibull model is a better fit for analyzing the time-to-event data.
Conclusion

To sum it up, every model is unique in terms of the assumptions it makes and the information it provides.
If you understand the nature of the time-to-event data of interest and know what you wish to obtain from the survival analysis, finding the ideal model should be a relatively straightforward task.
If you are familiar with the methodology of survival analysis but wanted to know why it can trump other types of analysis on time-to-event data, check out the following article:
I wish you the best of luck in your Data Science endeavors!