Survival analysis is a tried and trusted method for obtaining insights from time-to-event data.
Unfortunately, despite spending a significant time learning about survival analysis, I neglected the technique when pursuing my own projects even when its use was fitting.
This is most likely because many of the resources I used to learn about the technique went straight to explaining the methodology without spending enough time emphasizing the nature of time-to-event data.
I wish to save others from making the same mistake by providing an overview of the survival analysis while focusing on the features of time-to-event data that make this technique indispensable.
Survival Analysis
Survival analysis is a branch of statistical methods that focuses on analyzing time-to-event variables. The name stems from its original use for examining mortality rates from illness and disease.
That being said, this analysis’ usability extends well beyond healthcare and is also present in sectors like manufacturing and finance. Businesses today leverage this technique to evaluate a variety of events such as machine failure, customer churn, and loan repayment.
The word "survived" in this context describes subjects in the study that did not experience the event in question. For instance, in a study of machine failure, a machine that "survived" is one that is still functioning.
With this technique, you can answer questions like:
- What is the probability of a subject surviving after X years?
- Do different groups of subjects exhibit different rates of survival?
- Which features influence the survivability of subjects?
You may look at the list of questions and think: there’s nothing special about this analysis. One could answer the same questions with traditional statistical approaches, right?
Well, not necessarily.
Why Use Survival Analysis?
The benefits of survival analysis lie in the nature of time-to-event variables.
Due to the way that time-to-event data is procured, the resulting dataset usually contains censoring.
Censoring is a phenomenon that occurs when a study fails to accurately capture the survival time of a subject.
There are three types of censoring in total: right censoring, left censoring, and interval censoring.
Right censoring occurs when the true survival time of a subject is greater than the recorded survival time. Left censoring occurs when the true survival time is less __ than the recorded survival time. Interval censoring occurs when the true survival time lies within a certain range.
Censoring is the product of the inherent flaws of a data collection procedure. Depending on the circumstance, censoring can be difficult (if not impossible) to prevent, which is why it is so prevalent in time-to-event data.
The Risk of Censoring
Censoring in datasets warrants attention for a few reasons.
- Censored data skews results
Since censored data does not properly capture the survival time of subjects, its inclusion can produce misleading values.
As an example, suppose you are conducting a study in which you examine machine failures. In the study, you observed 5 machines and recorded the amount of time it took for them to reach failure. However, due to the constraints of the experimental design, you have to stop the observation even though some machines are still operational. In the end, you obtain the following durations from the 5 machines:
Machine 1: 5 hours (Failed)
Machine 2: 10 hours (Operational)
Machine 3: 12 hours (Failed)
Machine 4: 6 hours (Operational)
Machine 5: 15 hours (Operational)
Since 3 out of the 5 machines have not experienced failure and are still operational, the recorded durations are less than the real durations of these machines. This is an example of right censoring.
If you calculate how long a machine lasts on average with basic aggregation, you would derive an average duration that is an underestimation of the real average duration.
Similarly, if you utilized machine learning and built a regression model to predict how long a machine will function, the model would be trained with inaccurate durations and, as a result, generate unreliable predictions.
For these reasons, performing traditional Data Science techniques on censored data may not be feasible.
2. Censored data is difficult to detect
Unfortunately, unlike missing data, which is very easy to spot, censored data is obscure and can easily evade detection.
After all, raw data usually doesn’t provide a feature that directly labels censored or non-censored data. This piece of information usually has to be procured from existing data.
Recognizing censored records in a dataset alone is a significant step. Even if one were to perform basic aggregation on such data, factoring in the presence of censoring can lead to more rational conclusions.
I’ve made the mistake of examining time-series data containing time-to-event variables without realizing that some of the subjects had not even experienced the event in question.
3. Censored data can not be remedied
Even if users recognize censored records in their data, they can sabotage their efforts by addressing them with the wrong approach.
It might be tempting to treat censored data like missing data and simply remove it from the dataset.
Unfortunately, while including unaccounted censored data is detrimental to a study, removing censored data is not feasible either. Since subjects are typically not censored at random (e.g., machines that don’t reach failure may have longer survival times), excluding them from the analysis will undoubtedly beget biased results.
What Survival Analysis Does in a Nutshell
The appeal of survival analysis lies in its ability to handle censoring.
Although it does not directly fill in the missing durations or omit the irrelevant durations, it is capable of accounting for censored data.
The survival analysis accomplishes this by modeling time-to-event data with a probability function called the survival function.
In mathematical terms, a survival function can be represented by the following formula:

The survival function, denoted by S(t), represents the probability of the subject surviving past time t.
By representing the time-to-event data with a model, users can make predictions on the survival of the subjects or identify factors that affect survivability.
Conducting a survival analysis requires two key pieces of information:
- If the event has occurred for each subject (censored or not censored)
- The survival duration of each subject
Case Study
A case study is best for showing how the survival analysis performs with censored data.
This demonstration will utilize the lifelines package, the go-to module for conducting survival analysis in Python.
It provides a built-in dataset that records the number of weeks it takes for subjects to get arrested.

For each row, we can see how many weeks the subject has been under observation, if the subject has been arrested, and if the subject has received financial aid.
Here, the "arrest" feature labels the records as censored or uncensored. Subjects that have experienced the event (i.e., been arrested) are assigned the value 1, while subjects that have not experienced the event are assigned the value 0.
We can use the Kaplan-Meier estimator, a popular tool in survival analysis, to estimate the survival function.
For convenience, we can visualize the survival function for this data by building a survival curve.

With the survival function, we can easily determine the estimated survival rate of the subjects at any given point in time.
For instance, we can derive the survival rate of the subjects after 30 weeks.

Based on the results, 86.11% of the subjects will not be arrested after 30 weeks.
In addition, we can determine if providing financial aid to the subjects influences their chances of surviving.
We can first visualize the contrast in survival rates between the two groups by plotting the survival curve for each group.

From the visualization alone, it seems that those with financial aid have a better chance of not getting arrested than those without financial aid. However, it isn’t clear if this difference is statistically significant.
This observation can be validated with a hypothesis test. We can use the log-rank test, a hypothesis test that compares the survival distribution of two groups.
Let’s conduct the test and see the results.

Given that the p-value is higher than 0.05, we do not have sufficient evidence to prove that financial aid affects the survivability of the subjects.
Conclusion

Hopefully, in addition to briefly explaining the "how" behind survival analysis, I have also explained the "why".
I know that many are eager to dive straight into the ins and outs of survival analysis (as I was), but I believe that understanding the nature of time-to-event data, which often comprises censored records, is just as important.
The survival analysis is capable of dealing with censoring, which is what makes it so practical. However, failing to identify censored records in your data may lead you to stick with other statistical methods for obtaining insights.
Only when you identify the potential shortcomings of your time-to-event data can you then recognize the need for survival analysis and take advantage of its various modeling methods.
I wish you the best of luck in your data science endeavors!