Model-Based Learning; Hands-On Tutorial

Model-Based Learning on Cancer Data

Tutorial on deploying model-based AI on imbalanced and missing health data

Raveena Jay

Published in

Towards Data Science

9 min readJan 24, 2022

Photo by National Cancer Institute on Unsplash

In this article, I’ll be giving a start-to-finish tutorial overview of applying model-based learning to real-world health data — in my case, carcinoma data. At the end of the article I hope you’ll gain a sense of why model-based learning can be a helpful general framework to apply to many different datasets, and how it can help identify the kinds of specific classification decisions to make based on data domains like healthcare.

For example, in the healthcare domain, predicting someone is healthy when in fact, they might need surgery or treatment — can dave deadly costs, as their underlying condition might not be seen by a human doctor until it’s too late since they’ve been deemed healthy by the automated system. Whereas, the model predicting cancer for the patient while they’re actually healthy — isn’t as bad of a thing, since the patient can just get regular checkups anyway through their insurance. And if a condition does develop their doctor will be the first to know.

The dataset I’ll be working with is real data from the Coimbra Hospital and University Centre in Portugal, and consists of 165 patients with “Hepatocellular Carcinoma”, a type of common but deadly liver cancer often associated with Hepatitis A & B.

Initial Data Analysis (EDA)

Fixing the Missing Health-Record Data

First, we need to download the data from the UC Irvine Machine Learning Repository at this link. You’ll notice that in the link, it says that about 10% of the data is missing, with only 8 patient records having complete data. So our first step will be to fix that.

Image Credit: Author. The missing values are labeled with a question mark “?”.

Before we do any feature selection or machine learning, we first have to fix the missing data. Usually, pandas library will provide some type of option, such as pandas.fillna which basically fills in the missing values in each feature column with either the mean, median or nearest-neighbor (if the column is numeric), or mode (if the column is categorical).

In this case however for our carcinoma data, the authors of the paper (who kindly published the dataset for free on UCI) outlined a missing-data scheme based on a patient similarity metric called “Heterogeneous Euclidean-Overlap Metric” (HEOM), which according to the authors works well for filling missing values in datasets with continuous and discrete types. The code contains details in Python on how to implement this; you can view it here.

Finding the Best Features

Now that we’ve filled in our missing data, let’s figure out how to pick our features. I decided to not spend too much time on feature selection. There are many ways to achieve this: using recursive feature elimination (RFE), mutual information, chi-squared tests for independence, and stuff. I wouldn’t recommend RFE usually because with enough features, the number of combinations you need to test grow very fast.

In this case, for the health records, I’m going to do something else entirely: I’ll use a random forest to determine the features. You may be asking, “why the hell would I use a random forest to figure that out??” but it turns out random forests have a built-in attribute called feature_importances that is determined by the collection of decision trees used. (By the way, if you would like a quick review article on Random Forests, check out my previous post on them, with an application to real-world diabetes data!)

Using a random forest, we find that the top 3 features predictive of patient survival are, in order: Alpha-Fetoprotein (ng/dL), Haemoglobin (g/dL), and Alkaline Phosphatase (U/L). (These health-record attributes are also listed on the UCI database where this dataset is found.)

Image credit: Author. The top 3 features correspond to the highest 3 bars.

Model-Based Learning

Now comes the juicy section I’d like to put my energy and attention toward: model-based learning.

We’ve filled in our missing data and found our best informative features, so now let’s take a look at the shape of the dataset. We have 165 total data records, with 102 cases of patient survival and 63 cases of patient death. Already, we run into two major “things” –– our dataset sample size is not that large for machine learning, and we have class imbalance.

Now you could say, “Raveena, why don’t you just oversample the smaller class, or downsample the bigger class?” Indeed, I could! I could oversample the “patient-death” class with synthetic sampling, called SMOTE. (here’s a great tutorial on it from sci-kit learn) However, I’m trying to incorporate model-based learning, which means I’ll be going along the more Bayesian route — I’ll explain that shortly below.

The basic idea is just in the name: what I’ll do is create a model of the top-3 features for patients who survived, and a model for those who unfortunately passed from the carcinoma. The hope is, once I have these two models, for a new test-patient, I can compare the judgements from each model and decide if the test patient will survive, die, or come to an indecision due to big uncertainty.

The Bayesian step includes one more thing before the final decision: I’ll incorporate our prior knowledge of the patients and multiply it to the model judgements. What’s the prior knowledge, you ask? Well, in the training set there are 73 patients who survive, and 42 who die — because of the “class-imbalance” I mentioned earlier. And it might’ve sounded like a bad thing. But in model-based learning, we can actually use it to our advantage! What I’ll do is — combine the prior differences between the surviving and dying patients, together with the model judgements — and use those two to come to a final decision.

Creating the Models

To create the actual models, I’ll be using a type of model known as a Gaussian Mixture. In the easiest way to describe it, a Gaussian Mixture model looks at a bunch of data, and tries to decompose the distribution of data as a basic sum of individual bell curves, or Gaussian normal distributions. Each simpler normal distribution has a specific weight associated with it — the weight tells the model how much importance to give that specific Gaussian. Here’s a GIF of a Gaussian mixture model fitted to data linked here, and one screenshot below:

Image Credit: Wikipedia. Screenshot of a Gaussian Mixture Model and its components. You can see the individual red, yellow, purple, green and teal normal distributions, and each one is added with a specific weight to give the overall thin darker blue curve

Gaussian Mixture models also are less prone to overfitting the training data than kernel density estimators, which usually try to maximize the fit of a distribution to training data, making the risk of overfit.

# Gaussian Mixture model for patients who survived carcinoma
life_model_2feature = GaussianMixture(n_components=6, random_state=0).fit(X_train_2life)# Gaussian Mixture model for patients who died from carcinoma
death_model_2feature = GaussianMixture(n_components=4, random_state=0).fit(X_train_2death)

We’ll look at the models of survived and died patients based on the top 2 features; the top-3 features is detailed in my code.

Image credit: author. This is the internal-model the computer has for survived patients. It’s done this by generating 100 synthetic samples of survived patients based on its model

Image credit: author. This is the internal-model the computer has for died patients. It’s done this by generating 100 synthetic samples of died patients based on its model

Model Predictions

Finally, it’s time to see how the model will predict patient survival on test patient data. I want to discuss one important thing before jumping right in though.

We Can’t Risk False-Positives Here

If we are predicting whether a patient will survive this specific carcinoma based on their Alpha-Fetoprotein and Hemoglobin content, it would be extremely dangerous and unethical to accidentally predict a patient would survive carcinoma, when they might have a high chance of dying. This is known in statistics as “false positives”, where in this situation, positive = survival and negative = death. If we’re going to make a final decision about test patients, we need to put more weight if we make a false-positive choice — we need to “punish” that decision so it happens less, and we don’t accidentally lie to patients by telling them they’ll survive the carcinoma, when their chance of dying without proper treatment is much higher.

Say we want to make false-positive decisions 5 times the weight of false-negative decisions. If we want to punish false-positive decisions 5 times as much, then we can only accept a decision to predict the patient as “survived” if the final odds-ratio is above a certain amount. More specifically:

Image credit: author. This is basically Bayes Theorem, and to punish false-positive decision 5 times more more heavily, we make the decision stakes higher.

(Just as a important reminder, in this project, Y=1 means the patient will survive, and Y=0 means the patient will not survive the carcinoma.)

(The details to this are in the code, so feel free to check it out yourself here). But here’s a sample of code if you’re interested:

threshold = 5 for odds in log_posterior_odds:    if odds > np.log(threshold):
        decisions.append(1)
    else: 
        decisions.append(0)

So…What does the Model Predict?

Implementing this decision rule, we get the following confusion matrix results. (A confusion matrix is just a fancy term for a 2x2 table which shows how the model performs on test data, with false positives, false negatives as well — not just accuracy. Accuracy can sometimes be misleading!)

The model didn’t do too bad! –– around 36 correct out of 50, which gives 72% accuracy. The dark purple box in the top-right is the false-positives — the model has predicted for 5 test patients that they would be healthy, but actually would not survive the cancer.

But I want to draw your attention to something more important than accuracy for this model-based learning framework — because remember, accuracy isn’t everything. Remember how I said we would punish the model for making more false-positive mistakes (the model lies to the patient that they will survive the carcinoma) by setting the ratio threshold at 5?

Well, watch what happens to the number in the top-right purple box when I increase the threshold to 10, so we put 10 times more weight:

By increasing the threshold to 10, the model made one less false-positive mistake, at the cost of making a false-negative mistake.

But watch what happens if the threshold is very high, say at 20. So we put 20 times more weight on false-positive classification mistakes than false-negatives.

Now the model has really started to minimize false-positives, which is great — but it comes at the huge cost of increasing false-negatives. This means that too many patients who might not need checkups, have been sent by the model to their medical providers for checkups. And unless their insurance can cover regular checkups, this would not be good money-wise on the patient side. So as you can see, setting the threshold is a balancing game for the model.

Final Words

I hope this article has given you a sense into the flexibility of model-based learning, and how it can be applied to real-world data in the health domain,

One of the major advantages of model-based learning, in my opinion is that this framework can be applied to many types of tabular data — not just health records. This type of machine learning can even be applied to vision and speech-recognition domains, as long as the right types of features are extracted from the data, and effective probabilistic generative models are used.

Another advantage I argue, is that the final analysis we did on false-positives, false negatives, and compensating for either mistake based on the domain of our data, is pretty straightforward using model-based learning. All we have to do is decide how much weight we put toward punishing false-positives or false-negatives, and set that as a threshold as we did. It just takes a few lines of code in Python, but this analysis is extremely important in for example, the healthcare domain where making a false-positive mistake might be more costly or deadly than a false-negative, regardless of accuracy.

Thank you so much for reading, and I’ll see you in my next article!

You can check out my LinkedIn here, if you’d like to learn more about me, what I’m interested in and what I’d like to pursue in machine learning and data science!