Inside AI

Understanding Kaplan-Meier Estimator (Survival Analysis)

An introduction to one of the Survival Analysis techniques.

Pratik Kumar
Towards Data Science
4 min readJun 20, 2020

--

Photo by Markus Spiske on Unsplash

What is Survival Analysis?

It is a set of statistical operations for data analysis for which the outcome variable of interest is time until an event occurs. The events could be death, disease incidence, customer churn, recovery, etc.

It is used to estimate the lifespan of a particular population under study.

It is also called ‘Time to Event’ analysis as the goal is to estimate the time for an individual or a group of individuals to experience an event of interest. Survival analysis is used to compare groups when time is an important factor. Other tests, like simple linear regression, can compare groups but those methods do not factor in time. It focuses on two important information parts, first, whether or not the participant suffers the event of interest during the period of study; second, the follow-up time for each individual being followed.

The survival analysis consists of the following parts:

  1. Survival Data
  2. Survival Function
  3. Analysis Method

What exactly is Kaplan-Meier Estimator?

Kaplan-Meier analysis measures the survival time from a certain date to time of death, failure, or other significant events. It is also known as the product-limit estimator, which is a non-parametric statistic used to estimate the survival function from lifetime data.

For example, it can be used to calculate:

  • How long people remain unemployed after a job loss?
  • How long it takes for couples undergoing fertility treatment to get pregnant?
  • Time-to-failure of machine parts.
  • Survival time after treatment. (in Medicine Practices)

A graph of the Kaplan Meier estimator is a series of decreasing horizontal steps, which approaches the true survival function for that population given a large enough sample size. Kaplan-Meier estimate is often used due to it’s assumed ease of use.

Understanding with an example

For example, we will be investigating the lifetimes of political leaders around the world. A political leader, in this case, is defined by a single individual’s time in office who controls the ruling regime. The birth event is the start of the individual’s tenure, and the death event is the retirement of the individual.

Censoring can occur if they are,

  1. Still in offices at the time of dataset compilation (2008)
  2. Die while in power (this includes assassinations).

Consider the following data (first 20 observations, from 1808 observations),

To estimate the survival function, we first will use the Kaplan-Meier Estimate, defined:

where ‘d’ are the number of death events at the time ‘t’, and ’n’ is the number of subjects at risk of death just prior to the time ‘t’.

Survival Function

The above plot shows the survival function using the Kaplar-Meier estimator for political leaders. The y-axis represents the probability a leader is still around after ‘t’ years, where ‘t’ years are on the x-axis. We see that very few leaders make it past 20 years in office.

Further, we may also segment the data into political regimes, as shown by the following graph,

Global regimes

It is incredible how much longer these non-democratic regimes exist! We can also understand by the following country to country comparison,

Important!

While performing Kaplan-Meier analysis, to avoid common mistakes one can keep in mind the following,

  1. To make inferences about these survival probabilities we need the log-rank test.
  2. Dichotomize the variable so that values are classified as low or high. The median cutpoint is often used to separate the low and high groups to avoid problems like the log-rank test only compares survival between groups.
  3. Kaplan Meier is a univariable method. This means Kaplan Meier's results are easily biased, exaggerating prognostic importance, or missing the signal entirely.
  4. One should investigate the added value of new prognostic factors, quantifying how much new markers improve predictions.

Conclusion

Kaplan-Meier estimator is widely used because of its simplicity and ease of access. But one should take care while implementing it, because it might lead to wrong results, with wrong assumptions.

References

  1. https://lifelines.readthedocs.io/en/latest/Quickstart.html
  2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3932959/
  3. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3059453/
  4. Complete Python Analysis,

Other Posts:

--

--