
In my last post, I walked through examples using Random Forest, Gradient Boosted, and SVM survival models. Today, I will show you how you can approach the same type of problems using deep neural networks. Specifically, we will go through examples using the continuous-time model (DeepSurv) and the discrete-time model (DeepHit) using pycox based on the PyTorch environment.
Since I covered the fundamentals of time-to-event analyses in my last post, I will omit that today to avoid any redundant information. Please check out the link above if you are interested in more detail.
Before I begin, I encourage everyone to check out the pycox repo linked above. It is filled with useful information along with additional examples and resources. You can even load several time-to-event datasets directly, one of which I will be using today.
Data
For todays examples, I used the Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT) dataset, which investigates survival time for seriously ill patients in the hospital. It contains data from 8,873 patients, where there were 14 predictor variables and 2 outcome variables (duration and event). Below, you can preview our dataset.
For a quick description of the variables present in our dataset, refer to the figure below. When I opened the dataset, the features were named x0 through x14. I renamed them according to the information found here.
For a complete list of the libraries we will be using, please check out my notebook here.
Next, we are going to divide our dataset into training, validation, and test sets using the code below. Alternatively, you could use sklearn’s train test split.
Preprocessing
Before we begin, we will need to perform some data preprocessing based on the variable types:
- Numeric variables: Standardize
- Binary variables: No preprocessing necessary
- Categorical variables: Create embeddings
Preprocessing our data can be accomplished using the code in figure 4. There’s quite a bit happening here, so let’s unpack it. In the first block, we are simply creating lists to separate the different variable types because we will be applying different transformations to each.
The second block are the transformations we are going to perform. Specifically, we are going to use StandardScaler for the numeric variables, nothing will be done to the binary variables, and _OrderedCategoricalLong_ will be used to transform our categorical variables.
The DataFrameMapper in the next block is part of the _sklearnpandas package and allows us to easily apply sklearn transformations to our pandas dataframe.
Finally, the last two blocks are applying the transformations to our datasets. You will notice that the results from _x_fittransform and _xtransform are wrapped in a tuple using tt.tuplefy. This tuple, called a tuple tree is used to train PyTorch models, and it can work with data arranged in nested tuples.
In the following section, we will start to explore DeepHit, which is a discrete-time model, and DeepSurv, which is a continuous-time model and see how they perform on our dataset.
DeepHit
DeepHit is a Deep Learning model adapted to survival (time-to-event) analyses. It can be modified for use investigating single risks or competing risks. Since this is a discrete-time model, the first step in our investigation is to define discrete times to evaluate.
There are two ways we can approach this. We can either create equally spaced (equidistant) discrete time intervals or quantiles using the code below. Here we have defined 10 equally spaced intervals, however, by changing the scheme we can define intervals based on quantiles. It’s important to note that we do not need to transform the labels on the test set.
To get an idea of what this looks like, refer to the figure below which demonstrates the differences between equidistant (left) and quantile (right) discretization. In each, there are 10 intervals of time defined; however, in the quantile discretization, intervals are defined by the proportion of events (deaths). Where there are many events, there are more intervals. In our dataset, we see that more events occur earlier than later. If you’re interested in going a little deeper, please refer to this paper.

If you are following along on my notebook, you will see I have done several experiments looking at the effect of modifying different parameters. I will present the model which performed best under the conditions that I tested. It’s important when you are running these experiments to try modifying different parameters to see what effect it has on your model’s performance.
Before we run our model, we have to define the number and dimension of our embeddings. As you can see in the code below, each category is represented by a vector that is equal to half the number of levels. The top performing DeepHit model consisted of 2 multi-layer perceptrons, each consisting of 64 nodes. Batch normalization was performed following each layer as well as a 20% dropout.
The optimizer selected was the cyclic Adam(WR), which is a weight decay regularized version of the Adam optimizer.
In the event you may be using a different dataset without the need for categorical embeddings, you would require some modifications to your code. First, you would not require the first two lines defining our embeddings. Second, you would need to remove _numembeddings and _embeddingsdims from the third block of code. Finally, you would need to substitute VanillaMLP for MixedInputMLP in the third block of code. The MixedInputMLP applies entity embeddings to the categorical variables and concatenates them with our other variables.
Finding the appropriate learning rate is a common problem when training a model. You will often find a wide range in the literature. A very helpful tool available in pycox is the learning rate finder. Although it may not provide the best learning rate, it gives you a good place to start your experimentation. Implementation can be performed using the following code, which gave us a value of 0.04.
Now that we have defined all of the model parameters, we fit it using the following code.
Evaluating the model
To evaluate our model, we can begin by looking at some predictions on our test data. The code for this (figure 10) plots the predicted survival for five patients. The interpolate function on the first line creates a linear interpolation to smooth out the steps across time (remember this is a discrete-time model). The plot can be seen directly below the code.

To evaluate how well the model performed on our test set, there are several metrics available. I will present the inverse probability of censoring weighting (IPCW) Brier score. This calculates the mean squared error (MSE) for the predicted outcome with the actual outcome at each discrete time interval we defined. Since the MSE of a correct prediction would be 0, a lower score indicates a better performance.
We can plot the performance over time using the following code. The first block allows us to plot the Brier score across time (figure 13). The integrated score can also be calculated (line 6), which resulted in a value of 0.21 (79% accuracy).

Let’s see how well this compares to the results from DeepSurv in the following section.
DeepSurv
DeepSurv is another option for incorporating the potential of deep Neural Networks in time-to-event analyses. The main difference is that this a continuous-time model, meaning we won’t need to define discrete intervals as we did with DeepHit. Here, we simply transform our labels as seen below.
The rest is similar to above, where we define our model and fit it to our data. As you can see (figure 15), the main difference is on line 3 where _outfeatures = 1. This, once again, is due to the fact that we are working with a continuous-time model, whereas with DeepHit this was set to the number of discrete intervals that we defined. One thing worth noting is that the learning rate for this model was 0.07 (versus 0.04 for DeepHit) as determined by the learning rate finder.
Evaluating the model
Finally, we evaluate the model in precisely the same way as we did above. Below, you will find a plot for the predictions of 5 patients. You may notice that the lines are smoother than those plotted for DeepHit because we didn’t require interpolation between intervals.

The IPCW Brier score (figure 17) performs slightly better than the DeepHit model. The integrated score confirms this with a value of 0.18 (82% accuracy).

Summary
Today we worked through a couple of examples on how you can use deep neural networks for your time-to-event analyses. I hope you give them both a try on your data and see how they both compare. Once again, I recommend you visit the pycox repo to see all of the options available. They have adapted several discrete-time and continuous-time neural networks for survival analyses. Thanks for reading!