The Mathematical Relationship between the Survival Function and Hazard Function

Dr. Dennis Robert MBBS, MMST
Towards Data Science
6 min readDec 8, 2021

--

Who is an intended reader of this article?

This article is intended for those who are already familiar with at least the basics of survival (time-to-event) analysis and Cox proportional hazards model. Those who are new to survival analysis and/or Cox proportional hazards are strongly encouraged to skip reading this article before getting the basics right as otherwise it might very well be a waste of your time. It is also of my opinion that you do not require to know this advanced mathematical proof of relationship between survival and hazard function in order to model survival data efficiently. So, in a nutshell, this article for the curious minds who want to dig deep into the math, but this is by no means a pre-requisite for survival analysis.

Introduction

Writing this article was motivated by the fact that I used to spend significant amount of time in modelling survival data (time-to-event data) using methodologies such Kaplan-Meier Survival Analysis (KM Analysis) and Cox Proportional Hazards Models (CPHM) but got intrigued by the thought of what exactly is the connection between these two. Regardless of the methodology used, the dependent variable of interest in survival analysis is the survival time. However, in the KM analysis (probably the most commonly applied methodology) what we are estimating is the survival probability and in CPHM the estimation is revolving around the concept of hazard rates and hazard ratios. I wasn’t very inquisitive about the exact connection between these two similar but conceptually and mathematically different estimates, but gradually the curiosity increased and hence tried to dig a bit deep into the connection from a pure mathematical point of view. The intuition is that if CPHM is estimating hazard rates and KM analysis is estimating survival probability (survival rate), both being on the survival data, there must be a clear connection between the two estimates, right? We will derive a fundamental mathematical relationship between survival function and hazard function and show that these two are closely connected.

Into the Math….

Without further ado, I will start with the notations…

Let T be a random variable representing the survival time, which is nothing but the time-to-event

Let S(t) be the survival probability, the probability that an event has NOT occurred until time ‘t’.

Let F(t) be the failure probability, the probability that the event occurred by time ‘t’

S(t) and F(t) can thus be represented mathematically as:

Note that ‘t’ is a specific value of interest for ‘T’, the random variable.

S(t) is nothing but the survival function (a.k.a. survivor function)

Some properties of S(t)

  • S(t=0) = 1 (everyone is surviving at start of time, typically the study start time or index date in epidemiological studies)
  • S(t = Inf) = 0 (no one survives at Infinity time)
  • 0 ≤ S(t) ≤ 1
  • S(t) is a non-increasing function. That is, S(t1) ≥ S(t2) for t1 ≤ t2

In KM analysis the objective is to estimate S(t). One of the best articles out there in detailing the steps of estimating S(t) non-parametrically by KM analysis is documented here. CPHM model is used where you are interested to adjust for covariates also like in a multiple regression model.

Hazard Rate and Hazard Ratio

CPHM aims to estimate the hazard ratio (HR), which is the ratio of two hazard rates. What is ‘two’ here is not within the scope of this article. Hazard rate is the probability that an event has occurred during a very small time interval ∆t between t and ∆t, given that the individual did not have an event until ‘t’.

Thus h(t) is an “instantaneous” rate or probability. Now comes another related concept which is the cumulative hazard function which is usually denoted by H(t). Both h(t) and H(t) can be mathematically represented using the concepts of limits and integral calculus as follows.

Equation (1) is simply the mathematical translation of the definition of hazard rate h(t). It is a conditional probability (note the ‘|’ sign) of an event occurring in a small time interval between t and t+∆t given that the individual survived until ‘t’. Integrating h(t) over ‘t’ thus gives you the cumulative hazard function H(t).

Our goal is to find the mathematical relationship between h(t), H(t) and S(t)

From (1), let’s forget about the limit part and concentrate on the numerator shown below

This conditional probability can be transformed based on the familiar conditional probability theorem

Applying the same transformation above, the Equation (1) becomes

Equation 4 is indeed derived from two basic facts. The probability of hazard occurring P(t<T≤t+∆t) and P(T >t) is actually P(t<T≤t+∆t). Also, P(T>t) is nothing but our S(t) which we defined earlier! Now we can see that something is getting unraveled already as we managed to get S(t) into the mathematical representation of h(t) as shown in Equation 4.

We are now going to re-arrange Equation 4 again into something more useful…

Each equation is derived from the previous equation and we also use the concept of failure probability that we defined earlier. Equation 8 gives us a beautiful and simple relationship between hazard rates and survival rates! But we are not done yet! Can we aim to transform equation 8 into something of an equation containing only h(t) and S(t)? Yes, we can. For this, we need to apply the chain rule of differentiation.

Applying the above rule to a composite function over log function..

Chain rule example using composite function over natural logarithm function

Let’s park equation 9 and 10 for now and re-write equation 8 having only S(t)

If you look at equation 12 and equation 10, they seem to be very similar, right? Indeed! So we can actually reduce equation 12 to this:

Relationship between hazard rate and survival rate

That’s it. We can now say from equation 13 that hazard rate is simply the negative natural logarithm of survival rate (survival probability) differentiated over the time. Wow!

It’s not over yet! We can also find the relationship between cumulative hazard function H(t) and S(t). H(t), as per equation 2, is simply the integral of h(t) over time. So all we need to do is integrate equation 13 over time and that gives us a beautiful terse equation:

Relationship between cumulative hazard function and survival function

Yes, the cumulative hazard rate (cumulative hazard function) at time t is the negative logarithm of survival rate at time t ! Isn’t this beautiful? I think its Walter Lewin who once said that ‘this equation is so beautiful that it makes you cry.’

--

--

Healthcare Data Science Professional, Physician (currently not practising). Alumni of IIT Kharagpur & Medical College Kottayam. Khorana Scholar, AIPMT Top 150