Introduction to Survival Analysis

In Cows using R

Dr. Marc Jacobs
Towards Data Science
6 min readOct 8, 2021

--

My background is in psychology and the life sciences which means that sooner or later you will deal with survival analysis. Especially during my PhD in which I evaluated hundreds of papers on cancer interventions the number one outcome of interest was survival — overall survival, disease-free survival, recurrent survival.

In agriculture, it was a bit different in which I seldom saw survival estimates being performed, although there is a lot of room for it. In essence, survival analysis is a sexier term for event-analysis and that is really all it is — analysis of events.

So, to be able to start performing an event-analysis — or survival analysis — all you need are two variables:

  1. A numeric metric of time such as number of days.
  2. A categorical identifier of event, often labeled as 0 or 1.

Combined, these two variables can indicate:

  1. If and when the subject of interest has met the event of interest (can be absolutely anything!) or not.
  2. If and when the subject of interest entered the study.
  3. If and when the subject of interest left the study

And this is where we enter the realm of censoring, which is in event-analysis perhaps the most tricky part to comprehend. When the event is observed during the observation frame (end-start of study) we have for that subject of interest both the time variable and the event variable. However, for others, we may not have the exact start date and / or we may never observe the event. Both issues lead to left and right censoring, respectively. Right censoring is much more common, and for brevity sake we will leave interval censoring for now.

Just keep in mind that when you want to conduct event-analysis, your dataset needs to have columns for event-time (or event-free-time) and event. The algorithm will figure out the censoring for you!

Lets load in the necessary packages to perform survival analysis.
And then conduct the necessary wrangling of data, including standardized covariates.
Here you see the dataset used to perform Survival analysis. For each cow, we have a SurvivalTime metric and a Vstatus (event) metric. The rest are covariates of interest and potential worth.

Important to note is that in event-analysis it is the number of events that counts, not the number of observations nor the amount of time spent observing. If there are no events of interest, the analysis makes no sense at all!

First batch of plots to look for differences in survival times across potential factors of interest.
Events start happening after some time, and there seems to be differences between these variables. Image by author.

Now, the most straightforward way to start analyzing events is by using the Kaplan-Meier method. It is a non-parametric method which basically means it does not really care about any underlying model assumptions. As long as the dataset has what it needs (observed time and events) it is good to go. Here, we start asking for life-tables and simple event-analysis. Combined, they will you the overal probability of surviving a particular time-point based on the time-event combinations observed.

Life tables and event curves to look at the probability of surviving a particular time-points. Image by author.
After one year, nine events happened leading to a survival probability of 64% [48% — 86%].

The life-table is a good way to assess if there are enough events to even conduct an event-analysis. If so, you can move to the analysis of interest, which is the comparison of event-curves between treatments. In a Kaplan-Meier, comparisons are made by the log-rank test — another non-parametric method.

There seems to be a statistically significant difference between treatments, but the huge confidence intervals for treatment-2 do not place great trust in the worth of the p-value (which in general deserves a lot of distrust). Image by author.

Event-analysis becomes really interesting once you leave Kaplan-Meier behind and go for Cox-regression — also called the proportional hazards model. Cox regression is a semi-parametric method able to include multiple predictors. Semi-parametric means it cares about assumptions, but does not need an intercept — or baseline hazard. The problem with the Cox regression is that it is quite susceptible to breaking its assumptions, especially the one that has given it its name — the assumption of proportional hazard.

But, we have not even talked about what a hazard is, yet. A hazard is in general a risk, and more specific the risk of moving from non-event to event at ANY given time-point. Hence, it is the risk of encountering the event of interest, regardless of time. THIS is what makes the Cox Regression almost unrealistic to use, but lets go for it anyhow. In a future post, I will discuss more about ways to apply parametric models and deal with this assumption.

The code to conduct a Cox regression is not difficult.

The summary below is regression output, no doubt about it. We see that we have 43 subjects of interest and have observed 22 events. For each factor of interest we get several pieces of output:

  1. coef — estimated hazard coefficient on the log scale
  2. exp(coef) — the hazard ratio (HR). Here, 4.9 means 490%. So, TREAT2, compared to TREAT1, increase the hazard by almost 500%!
  3. exp(-coef) — inverse hazard ratio, which is the HR for TREAT1.

The concordance statistic is an ROC-statistic for event-analysis. The Likelihood, Wald, and Score test indicate that this model is better than a model with no predictors included. This message agrees with the previous exploratory plots.

Request for Variance Inflation Factors (VIF) to detect heavy cross-correlation / mutual dependency leading often to exploded standard errors or non-convergence.
Hazard Ratios, Confidence Intervals and P-values. Image by author.

So, up until now all seemed to be very straightforward which is actually the biggest problem in Survival Analysis — it really is not that difficult. But, we also said that by moving from a non-parametric to a semi-parametric method, we have some assumptions to look into, which we will do now. The most important assumption to check is the Proportionality of Hazards (PH) assumption. We will do this by looking at Schoenfeld residuals, and by plotting cumulative log-log curves, per predictor.

What you are looking for are:

  1. Proportional distance between curves by cumulative log log.
  2. Homogeneity of residuals.
The PH assumption seems to be upheld for TREAT, INF, and Ca4, but troubles me for the other two. Schoenfeld residuals look clean. Image by author.

Next up, we will take a look at the linearity assumption by plotting Martingale residuals.

Nothing strange to see here. Image by author.

Then, we will take a look at influential subjects of interest.

Some observations seem to be influential, but with just 43 subjects of interest, I am hard-pressed to delete any subject that does not show biologically incompatible data. Image by author.

The Cox regression and assessment of assumptions does seem to hint at two predictors that may cause problems. What I will do now is use L1 and L2 penalized methods to see how much they actually do contribute to the data. The codes are already 5 years old, so I am sure there are more automated ways of doing it by now, but they still run!

Quite some code to get the penalization methods going.
Penalization hints at three predictors of interest. Image by author.
A second method agrees with the first method. Image by author.
The coefficients not deemed of interest turn out to be also the ones that did not meet the PH assumption. It make sense — any predictor that make the model unstable will be booted our in resampling methods focused on limiting the possibility of overfitting.

Last but not least, lets go a bit crazy and refit the Cox regression with just the three predictors of interest, and resample the analysis multiple time. This way, we can get a good glimpse of the stability of the hazard ratios above and beyond the profiled confidence intervals we got from the Cox Regression itself (which btw already showed some heavy confidence intervals).

Lots of code to get going but it works. In the end, I asked R to resample the Cox Regression and store all its output in a dataframe suitable for plotting.
Skewed Distribution of Hazard Ratios based on Bootstrapping. Hence, the point-estimate of each HR should only be very carefully considered. Image by author.
Jack-knife estimates looking at the individual contributions of the subjects of interest. You want the plots to follow the horizontal lines. Image by author.

I hope this post will help you in your road to event-analysis by including Kaplan-Meier, Cox Regression, Penalized Regression, and Bootstrapping techniques to assess the worth of your dataset.

Remember, it is about the events, not the number of observations!

Enjoy!

--

--

Scientist. Builder of models, and enthousiast of statistics, research, epidemiology, probability, and simulations for 10+ years.