The world’s leading publication for data science, AI, and ML professionals.

Survival Analysis: Optimize the Partial Likelihood of the Cox Model

Finding the coefficients that maximize the log-partial likelihood in Python

Negative log-partial likelihood of the Cox model with local optimum. Image by author.
Negative log-partial likelihood of the Cox model with local optimum. Image by author.

Table of contents

  1. Introduction
  2. The Cox proportional hazard model
  3. Optimization problem
  4. Implementation
  5. Conclusions
  6. References

1. Introduction

Survival analysis encompasses a collection of statistical methods for describing time to event data.

In this post, we introduce a popular survival analysis algorithm, the Cox proportional hazards model¹. Then, we define its log-partial likelihood and the gradient, and optimize it to find the best set of model parameters through a practical Python example.

2. The Cox proportional hazard model

We define the survival rate as the percentage of patients who have not experienced the adverse event (e.g. death) after a certain period of time.

The Cox proportional hazards model can assess the association between variables and survival rate. Given a set of covariates x, it defines the hazard function as:

Image by author.
Image by author.

From the formula, we observe that the hazard function h(t|x) is proportional to a baseline hazard function h₀(t) and relative risks exp(βx).

The underlying hazard function h₀(t) does not depend on the covariates. As the form of h₀(.) is unspecified, the model is semi-parametric.

Let us explain the meaning of the model coefficients through a simplified scenario with one covariate only. Let us consider a risk factor xᵢ, for example smoke, as binary variable (0: non smoker vs. 1: smoker). The Cox model can be expressed as h(t|xᵢ)= h₀(t)exp(βxᵢ), where exp(β) indicates the relative risk of adverse event given by smoking over not smoking:

  • Risk given by smoking: (xᵢ=1): h₀(t)exp(β⋅xᵢ) = h₀(t)exp(β⋅1) = h₀(t)exp(β)

  • Risk given by not smoking: (xᵢ=0): h₀(t)exp(β⋅xᵢ) = h₀(t)exp(β⋅0) = h₀(t)

  • Relative risk = risk given by smoking / risk given by not smoking: h₀(t)exp(β) / h₀(t) = exp(β)

The relative risk exp(β) – also called hazard ratio – is constant and does not depend on time.

3. Optimization problem

In Data Science, the task of "fitting" a model to a dataset indicates the search for the set of model parameters that optimizes a certain objective function. Some common examples are the minimization of a loss function or the maximization of a log-likelihood.

In our case, we need to estimate β without knowing h₀(.). To this aim, Cox proposed to maximize the partial likelihood²:

Image by author.
Image by author.

In the previous equation:

  • K is the set of chronologically ordered event (death) times: t₁ < t₂ < ... <tₖ.
  • R(tⱼ) identifies the set of subjects at risk at time tⱼ.

Intuitively, the partial likelihood is a product of the conditional probabilities of seeing the adverse events over the set of observed event times, given the set of patients at risk at those times and under the assumption of proportional hazards.

We can observe that:

  • L(β) is independent from ho(t), that can remain unspecified.
  • L(β) does not account for the actual event times, but only their order.

We could derive the log-partial likelihood as:

Image by author.
Image by author.

In the previous equation:

  • N is the number of subjects.
  • θ = exp(βx).
  • δⱼ indicates the event (1: death, 0: otherwise).

To fit the Cox model, it is necessary to find the β coefficients that minimize the negative log-partial likelihood.

We recall that the negative partial likelihood is, in most cases, a strictly convex³ function. Thus, it has a unique global minimizer.

4. Implementation

Let us import the needed libraries:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from sksurv.datasets import load_whas500

We load the _Worcester Heart Attack Study dataset_⁴ available in the scikit-survival⁵ package. In particular:

  • We focus on two covariates:
  • afb: atrial fibrillation (0: no, 1: yes)
  • mitype: MI type (0: non Q-wave, 1: Q-wave)
  • We adjust the data to account for ties, i.e. those patient who experienced the adverse event at the same time. Due to the assumption of a continuous hazard, the Cox model does not admit ties. For the sake of simplicity, we add a small amount of random noise to each event date to remove them.
  • We order the dataset by date, as the partial likelihood requires ordered event times.
# load the whas500 dataset
X, target = load_whas500()

# let us consider two covariates
cols = ["afb", "mitype"]

df = X[cols].
        rename(columns={cols[0]: "v1", 
                        cols[1]: "v2"}).
        astype(float)

# extract events and respective times
df["event"], df["time"] = [list(i) for i in zip(*target)]

# add random noise to the event time to avoid ties
df.time = df.time.apply(lambda x : x + np.random.normal(2, 1))

# sort observations by descending event time
df = df.sort_values("time", ascending=False).reset_index(drop=True)

# inspect first rows
df.head(10)
Starting dataset. v1, v2: covariates; event: True/False (death/no event); time: time to event. Image by author.
Starting dataset. v1, v2: covariates; event: True/False (death/no event); time: time to event. Image by author.
v = df[["v1", "v2"]].to_numpy()
time, event = df.time.to_numpy(), df.event.to_numpy().astype(int)

Now we need to define the objective function for the optimization task, i.e. the negative log-partial likelihood, which we are going to minimize:

Negative log-partial likelihood. Image by author.
Negative log-partial likelihood. Image by author.

Note: in standard Machine Learning problems, X generally describes the input features. In our case, instead, the unknown variable is β, for which we are trying to find the best set of values.

def get_theta(x):
    '''
    Return theta as per negative log-partial likelihood
    of the Cox model and its gradient.
    It assumes input features v to be ordered by event time.

    Args:
        - x: beta coefficients 

    Output:
        - theta_l: cumulative theta <numpy.ndarray>
        - theta_l_v: cumulative theta by features <numpy.ndarray>
    '''
    theta = np.exp(np.dot(v, x))
    theta_l = np.cumsum(theta)
    theta_l_v = np.cumsum(v * theta.reshape(-1,1), axis=0)
    return theta_l, theta_l_v

def objective_function(x):
    '''
    Return the negative log-partial likelihood 
    of the Cox model

    Args:
        - x: beta coefficients <numpy.ndarray>
    Output:
        - Negative log-partial likelihood of the Cox model
    '''
    theta_l, _ = get_theta(x)
    return -np.sum(event * (np.dot(v, x) - np.log(theta_l)))

We derive the gradient of the objective function, i.e., its derivative with respect to β, as follows:

Gradient of the negative log-partial likelihood. Image by author.
Gradient of the negative log-partial likelihood. Image by author.
def gradient(x):
    '''
    Return the gradient of the negative log-partial
    likelihood of the Cox model

    Args:
        - x: beta coefficients <numpy.ndarray>
    Output:
        - Gradient of the objective function
    '''
    theta_l, theta_l_v = get_theta(x)
    return -np.sum(event.reshape(-1,1) * (v-(theta_l_v/theta_l.reshape(-1,1))), axis=0)

We can now initialize β and perform the minimization task. We use the _Newton-CG_⁶ algorithm and the scipy package:

# starting values for beta
beta = np.array([1, 1])

opt_result = minimize(
    objective_function,
    beta, 
    method = "Newton-CG",
    jac = gradient
)

opt_result
Output of the optimization task. The β coefficients are stored in x. Image by author.
Output of the optimization task. The β coefficients are stored in x. Image by author.

The results are:

  • β₁ = 0.5293
  • β₂ = -0.6541

We can fit the Cox model on the same input data using a library and verify that we would obtain the same set of values for β:

from sksurv.linear_model import CoxPHSurvivalAnalysis

model = CoxPHSurvivalAnalysis()
model_fit = model.fit(
    df[["v1", "v2"]], 
    np.array(list(zip(df.event, df.time)), dtype=np.dtype("bool, float")))

model_fit.coef_
Image by author.
Image by author.

Indeed, the β coefficients are almost identical, with small differences after the seventh decimal place.

Let us plot the estimated optimum and the objective function:

def objective_function_in_x(x0, x1):
    '''
    Return the negative log-partial likelihood 
    of the Cox model evaluated in the given beta

    Args:
        - x0: input beta_0 <numpy.ndarray>
        - x1: input beta_1 <numpy.ndarray>
    Output:
        - objective function in beta_0, beta_1 <numpy.ndarray>
    '''
    v0, v1, l = v[:,0], v[:,1], v.shape[0]
    theta = np.exp(x0*v0 + x1*v1)
    return -np.sum(event * ((x0*v0 + x1*v1) - np.log(np.cumsum(theta).reshape(-1, l))))

def get_plot_data(inf=-5, sup=5, size=10):
    '''
    Return a three-dim square box with the evaluation
    of the negative log-partial likelihood of the Cox model

    Args:
      - inf: min value of the box, default: -5 <int>
      - sup: min value of the box, default: 5 <int>
      - size: size of the output coordinates arrays, default: 10 <int>
    Output:
      - x0: input beta_0 <numpy.ndarray>
      - x1: input beta_1 <numpy.ndarray>
      - z: objective function in beta_0, beta_1 <numpy.ndarray>
    '''
    x0, x1 = np.linspace(inf, sup, size), np.linspace(inf, sup, size)
    x0, x1 = np.meshgrid(x0, x1)
    z = np.zeros((size, size))
    for i in range(0, x0.shape[0]):
        for j in range(0, x0.shape[1]):
            z[i][j] = objective_function_in_x(x0[i][j], x1[i][j])
    return x0, x1, z

def get_min_obj_function(model):
    '''
    Return coordinates of local optimum found with Optimization

    Args:
      - model: instance of <scipy.optimize._optimize.OptimizeResult>
    Output:
      - x0: optimum for beta_0 <numpy.ndarray>
      - x1: optimum for beta_1 <numpy.ndarray>
      - z: objective function in the optimum <numpy.ndarray>
    '''
    x0, x1 = np.array(model.x[0]), np.array(model.x[1])
    z = objective_function_in_x(x0, x1)
    return x0, x1, z

# generate data
x0, x1, z = get_plot_data(-5, 5, 10)
x0_min, x1_min, z_min = get_min_obj_function(opt_result)

# plot the objective function and the local optimum
fig = plt.figure(figsize=plt.figaspect(0.4))

# left subplot
ax = fig.add_subplot(1, 2, 1, projection="3d")
ax.contour3D(x0, x1, z, 100, cmap="viridis")
ax.scatter(x0_min, x1_min, z_min, marker="o", color="red", linewidth=50000)
ax.set_xlabel("$β_1$")
ax.set_ylabel("$β_2$")
ax.set_zlabel("- Log-Partial Likelihood")

# right subplot
ax = fig.add_subplot(1, 2, 2, projection="3d")
ax.contour3D(x0, x1, z, 100, cmap="viridis")
ax.scatter(x0_min, x1_min, z_min, marker="o", color="red", linewidth=50000)
ax.set_xlabel("$β_1$")
ax.set_ylabel("$β_2$")
ax.set_zlabel("- Log-Partial Likelihood")
ax.view_init(10, 30)
fig.suptitle("Negative log-partial likelihood of the Cox model with local optimum", fontsize=10);
Image by author.
Image by author.

Note: the optimization problem with the previously defined functions can be solved with any number of input variables v. Nevertheless, the 3-D plot requires to consider only two. In fact, a 3-D plot can only display one β coefficient for each axis.

From the plot, it is visible that the negative log-partial likelihood is a convex loss function.

5. Conclusions

In the context of Survival Analysis, we introduced the Cox proportional hazard model and fit it on input data. In particular, we wrote the negative log-partial likelihood and its gradient in Python. Then, we minimized it to find the best set of model parameters.

6. References

[1] D. R. Cox, Regression Models and Life-Tables, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 34, n°2., pp. 187–220, 1972.

[2] https://en.wikipedia.org/wiki/Likelihood_function

[3] C. M. Winson et al., Fenchel duality of Cox partial likelihood with an application in survival kernel learning, Artificial Intelligence in Medicine, vol. 116, 102077, 2021.

[4] S. Pölsterl, scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn, Journal of Machine Learning Research, vol. 21, no. 212, pp. 1–6, 2020 (Package website).

[5] https://scikit-survival.readthedocs.io/en/stable/api/generated/sksurv.datasets.load_whas500.html

[6] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-newtoncg.html#optimize-minimize-newtoncg

Note: the dataset whas500 is freely available for use from the scikit-survival package. The scikit-survivalpackage is licensed under the GPL v3.


Related Articles