The world’s leading publication for data science, AI, and ML professionals.

Using Sparse Matrices in XGBoost

An alternative way for dealing with high cardinality

Empty space is an opportunity. Photo by Nasa on Unsplash
Empty space is an opportunity. Photo by Nasa on Unsplash

You don’t have to be involved in Data Science long before hearing about the algorithm XGBoost, including all the Kaggle competitions it has been used in, with great success. There is also no shortage of great tutorials online (including on Towards Data Science) on how to get started using this algorithm. However, there is an amazing feature of XGBoost that is often overlooked and is sadly missing from most tutorials: XGBoost’s ability to take a sparse matrix as an input. If you’re unfamiliar with this data structure, why it is so useful, or how to use it within XGBoost, you’ve come to the right place! Using sparse matrices can be intimidating at first, but by the end of the article, I’m going to show you just how easy it can be. I’m sure you’ll be using them in your own data science projects in no time, especially if you have data sets with high cardinality.

What is a Sparse Matrix?

A sparse matrix is a data structure that is very efficient at storing tabular information where many of the columns contain NULL values (you will see other definitions that say sparse matrices are comprised mostly of 0s, which is the usual definition, but in this article, the matrices I am creating technically have NULL or missing values instead). This might seem very odd at first. After all, why would you even want to use a data set with lots of NULL values in the first place? It all comes down to high-cardinality categorical variables, and how we have to transform them before we can use XGBoost.

Cardinality is a $10 word that describes a categorical (non-numerical) data feature that has a lot of possible values. Let’s say you had data on where people lived, including their zip/postal code. Even though zip codes contain 5 digits in the United States, this really is a categorical data feature (e.g., 73301 isn’t "less than" 73302 because of the value of the zip code itself). There are thousands of different zip codes, so this is indeed a high-cardinality data feature. One strategy to make this feature readable by XGBoost (which only takes numeric features) would be to one-hot encode it. This is where you create an indicator variable for every possible zip code (e.g., one column called _lives_in73301, another called _lives_in73302, etc.), and populate those columns with 1s or 0s, where a 1 represents that a person lives in that zip code. You will soon realize that this results in a data structure that is populated mostly with 0s. Repeat this exercise with a few other columns with high cardinality, and now you should realize that your data will become highly sparse in no time.

However, if you try to store this sparse data in a pandas data frame (or any other kind of dense tabular structure for that matter), you can run into memory problems very quickly. So quickly that it’s going to make you want to start over and try to figure out a way to reduce the cardinality of your columns. But don’t give up! There could be very valuable information embedded in those columns that you are throwing away for the sake of reducing cardinality and memory usage. Don’t get me wrong, pandas data frames are great, and I use them frequently in my own work. However, they are not good at storing sparse data, especially when the data set in question is quite large or has a lot of features.

Enter the sparse matrix data structure. Specifically, the CSR matrix, which stands for "compressed sparse row" (there are other ways to sparse encode a matrix, but the pipeline below uses the CSR structure). This data structure takes advantage of the fact that most of the matrix is going to have missing information. It works by only storing the non-NULL elements along with their location. When most of the information in the matrix is missing, the CSR matrix structure will use up far less memory than its dense counterparts, in which the missing values and their location are stored as well.

It should be noted that there are drawbacks to using sparse matrices, the main one being that they are not nearly as easy to use as other data structures, like pandas data frames for exploratory data analysis. Their main purpose is to store sparse data efficiently. Therefore, my recommendation is to rely on other tools for exploring and understanding your data set. When you’re ready to build a model pipeline, only then transition to using a sparse matrix, if appropriate.

Use Case with a High-Cardinality Data Set

Not this kind of cardinal. Patrice Bouchard on Unsplash
Not this kind of cardinal. Patrice Bouchard on Unsplash

For this exercise, we will use the "Diabetes 130-US hospitals for years 1999–2008 Data Set", and we are going to build a model to predict readmission within 30 days. You can also download this data set yourself and execute the Python code below to follow along (after installing the requisite packages, if you haven’t already). Many of the fields in this data set are potentially useful, but we are going to mostly focus on a few of the high-cardinality ones here. Specifically, we are going to include:

  • _encounterid (a unique identified for each hospital admission)
  • readmitted (used to develop the target variable)
  • gender (admittedly low cardinality, but gender is often an excellent predictor for many medical conditions, so we will keep it).
  • age (the 10-year age bracket of the patient, again low cardinality, but like gender, age is often an excellent predictor for medical conditions).
  • _diag1 (primary diagnosis).
  • _diag2 (secondary diagnosis, if applicable).
  • _diag3 (tertiary diagnosis, if applicable)

Undoubtedly the highest cardinality feature will be the diagnosis data. Even if you’re unfamiliar with diagnosis codes, just think of all the ways a medical professional could diagnose you during a visit, and you will soon realize there are a lot of possibilities for this field. Since there are three fields for diagnosis codes, we are going to have three features with high cardinality that we have to deal with.

All that said, the data set used in this article isn’t particularly large, and you can probably fit all of this data into memory using your trusty old pandas data frame and your usual methods. However, that’s not the point of this tutorial. I am intentionally using a smaller, easy-to-work with data set so that later, when you are working with a very large data set with a large number of high-cardinality columns that takes a long time to run, you will be comfortable working with the sparse matrix data structure. Later, while your other colleagues will be struggling to get their models to work within memory constraints or reducing valuable features for the sake of reducing memory overhead, it will be smooth sailing for you if you use this data structure, gaining insights that will be missing from other models.

Now, let’s import our data and only keep the columns we want for this exercise.

import pandas as pd, xgboost as xgb
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
from sklearn.model_selection import train_test_split
from xgboost.sklearn import XGBClassifier
import shap
df = pd.read_csv('diabetic_data.csv', low_memory = False)
ex_df = df[[
    'encounter_id'
    ,'readmitted'
    ,'gender'
    ,'age'
    ,'diag_1'
    ,'diag_2'
    ,'diag_3'
]]
ex_df.readmitted.unique()

Exploring the readmitted column, we see there are three possible values: NO, >30 and <30, representing no readmission, a readmission after 30 days, and a readmission before 30 days, respectively (it’s unclear to me how this data set handles cases where readmission happens at exactly 30 days, but we will ignore this scenario and press on). We will create a new column called __admit_lt30days, an indicator variable that will be populated with a 1 if the patient was later admitted within 30 days of being discharged, and 0 otherwise.

ex_df = ex_df.assign(_admit_lt_30days = 0)
ex_df.loc[ex_df.readmitted == '<30','_admit_lt_30days'] = 1
ex_df = ex_df.drop('readmitted', axis = 1)

Why the underscore in front of the variable name __admit_lt30days? We will get to that soon, promise!

Converting Data to Tall Format

Soon, you're going to create data sets even taller than this. Jeff Tumale on Unsplash
Soon, you’re going to create data sets even taller than this. Jeff Tumale on Unsplash

Next, we are going to organize our data in a way that is going to seem like a step in the wrong direction. We will restructure our data in a tall format, where the data is represented by three columns:

  • _encounterid (the encounter or admission ID).
  • variable (the name of the variable being stored).
  • value (the value of the variable in question).

For example, for the gender column, we will create two variables, _genderf and _genderm. The value of these columns is going to be 1 if the patient during the admission is female or male, respectively.

This might seem counter to everything you’ve ever learned about Machine Learning algorithms based on structured data. After all, aren’t you’re supposed to have a single row for each observation? With data in the tall form, we’re going to have five variables for each encounter. However, this is just an intermediate step to help us cast a wide and sparse matrix later, where each admission will indeed be represented by a single row.

For now, we structure the data in this tall format, taking advantage of the built-in melt function in pandas.

numeric_df = ex_df[[
    'encounter_id', '_admit_lt_30days'
]].melt(id_vars = 'encounter_id')
text_df = ex_df[[
    'encounter_id', 'gender', 'age', 'diag_1', 'diag_2', 'diag_3'
]].melt(id_vars = 'encounter_id')
text_df.variable = text_df.variable + '_' + text_df.value
#Remove special characters from variable names
text_df.variable = 
    text_df.variable.str.replace('[|)', '', regex = True)
text_df = text_df.assign(value = 1)
tall_df = numeric_df.append(text_df)

Casting Data as Wide and Sparse Matrix

By first creating a tall data frame, we make it easier to later cast a wide and Sparse Matrix. If we instead tried to create a wide data set in pandas and then convert it to a sparse matrix, the wide and dense data set in pandas would use up a lot of memory, defeating the purpose of the sparse matrix data structure to begin with.

To convert the tall data into wide and sparse data, we will apply a transformation to the tall data set that is very similar to the pivot function within pandas, though we will be converting the result to a CSR matrix data structure instead. Each entry in the variable column will become its own separate column. The value of each element within the CSR matrix will be the number within the value column and the _encounterid row. The final column in the data set will be _encounterid, with a single row for each ID. Note that for some combinations of _encounterid and variable, there will be no corresponding value within the new sparse matrix. This will be especially true for the columns involving diagnosis codes. In these situations, the value within the CSR matrix is essentially a NULL value, though in practice nothing is stored at all within the CSR matrix. After this transformation, each observation will have a single row, and the data is will be ready for the XGBoost algorithm.

encounter_c = 
    CategoricalDtype(sorted(tall_df.encounter_id.unique()), ordered=True)
var_c = 
    CategoricalDtype(sorted(tall_df.variable.unique()), ordered=True)
row = 
    tall_df.encounter_id.astype(encounter_c).cat.codes
col = 
    tall_df.variable.astype(var_c).cat.codes
sparse_matrix = 
    csr_matrix(
        ( tall_df["value"], (row, col) )
        , shape = ( encounter_c.categories.size, var_c.categories.size )
    )
#Everything after the first column is a feature
X = sparse_matrix[:,1:]
#The first column is the target variable
Y = pd.DataFrame(sparse_matrix[:,0].astype(int).todense())
X_train, X_test, Y_train, Y_test = 
    train_test_split(X,Y, test_size=0.2, random_state=888)

Note that when creating _varc, we sort the columns before applying the CategoricalDtype function. This has the effect of alphabetizing all of the variable names. Recall that we placed an underscore in front of our target variable, and an underscore is sorted before any lowercase letter. So, the very first column of our sparse matrix contains our target variable, and the rest of the matrix contains our features. This is why we placed an underscore in front of our target variable, so that we immediately know where to look for it when we cast the data in a sparse matrix (i.e., the very first column of the CSR matrix), even if we add or remove other features to the model later.

Building an XGBoost Model

We are now ready to build our Xgboost model. We’re going to skip hyperparameter tuning and dive right into building the model, since the focus is just to get started using sparse matrices in XGBoost.

xgb_model = 
    xgb.XGBClassifier(
        objective='binary:logistic'
        ,booster='gbtree'
        ,tree_method='auto'
        ,eval_metric='logloss'
        ,n_jobs=4
        ,max_delta_step=0
        ,random_state=888
        ,verbosity=1
    )
xgb_model.fit(X_train, Y_train)

We can also get feature importance by gain. We see that the gain is quite high for certain types of diagnoses. That is, these features are considered very important when it comes to predicting the probability of readmission.

#Get the feature importance by gain and sort
#This has "dummy" names like f0, f1, etc. that have little meaning
#However, we will update them with meaningful names later.
d2 = xgb_model.get_booster().get_score(importance_type='gain')
feature_by_gain_noname = 
    pd.DataFrame(
        data=list(d2.items())
        ,columns=['feature','importance_by_gain']
    ).sort_values('importance_by_gain', ascending = False)
#Get the index values of the features
to_index = 
    feature_by_gain_noname['feature'].str.slice(1,).astype(int).tolist()
#Create a data frame with the feature names and sort
names_in_order = var_c.categories[1:][to_index].to_frame()
names_in_order.columns = ['feature']
names_in_order.index = range(len(names_in_order.index))
#Create a data frame that does not have the dummy name column
by_gain = feature_by_gain_noname['importance_by_gain'].to_frame()
by_gain.columns = ['importance_by_gain']
by_gain.index = range(len(by_gain.index))
#Join the data frame with names to the gain values
feature_by_gain = names_in_order.join(by_gain)
feature_by_gain.head(10)
Top 10 Variables by Gain
Top 10 Variables by Gain

Insights from Model using SHAP Values

While gain is a valuable metric, it doesn’t tell us if the variable is a positive or negative predictor. For example, does the primary diagnosis V58 increase or decrease the probability of a readmission? This is impossible to say using the gain metric alone.

By using the SHAP package, we can then look at which variables were most impactful in predicting 30-day readmissions. The downside is that SHAP cannot take a sparse matrix as an input (remember how I said pandas data frames are often easier to work with?). As a workaround, we can reduce the number of dimensions in our original data set by only including those features that were selected using the sparse matrix implementation. Another option to consider for much larger data sets would be to limit SHAP analysis to only a subset of data.

Let’s say we want to calculate the SHAP values for our holdout data. We can do this by limiting our tall data set to only include those columns that were kept when fitting our model on the training data set. We then cast that new data as a dense pandas data frame (note that we reduce the memory overhead by removing unnecessary columns that were not selected by XGBoost during the first round of model creation), refit the model using the old training data, and finally calculate the SHAP values for the holdout data.

filter_df = 
    tall_df.loc[
        tall_df.variable.isin(feature_by_gain.feature)|(tall_df.variable == '_admit_lt_30days'),
    ]
filter_df_pivot = 
    filter_df.pivot_table(
        index='encounter_id'
        ,columns='variable'
        ,values='value'
    ).rename_axis(None, axis=1)
df = filter_df_pivot.reset_index()
df = df.drop(columns='encounter_id')
feature_final = 
    list(df.columns[~df.columns.isin(['_admit_lt_30days'])])
X = df.loc[:,feature_final]
Y = df.loc[:,'_admit_lt_30days']
X_train, X_test, Y_train, Y_test = 
    train_test_split(X,Y, test_size=0.1, random_state=652)
model = xgb_model.fit(X_train, Y_train)
explainer = shap.Explainer(model, X_test)
shap_values = explainer(X_test)

A very natural and common question is to explain why the model scored certain observations a particular way. For example, let’s look at patients who, according to the model, had a probability of readmission greater than 50%. We might want to understand why the model assigned these admissions such a high probability of readmission.

score_pred_score = xgb_model.predict_proba(X_test)
score_pred_score_df = 
    pd.DataFrame(score_pred_score, columns=['proba_0', 'proba_1'])
score_pred_score_df.loc[score_pred_score_df.proba_1 > .5,]
Patients with a probability of readmission >50%
Patients with a probability of readmission >50%

The observation with the highest probability of readmission has an index value of 8782 and a chance of readmission of approximately 65%, according to our model. We can look at the SHAP waterfall plot to get some insight into why this admission was scored this way.

SHAP Waterfall Plot that Explains High Readmission Probability Calculation
SHAP Waterfall Plot that Explains High Readmission Probability Calculation

Based on the SHAP plot above, this particular observation has a very high probability of readmission due to the presence of these three conditions:

Thanks to using the sparse matrix data structure, we were able to gain some valuable insight that would have been very difficult otherwise. We can now see which diagnoses drive higher or lower readmission probabilities among particular admissions, information that would have otherwise been lost had we reduced the cardinality of the diagnosis data. This model can be analyzed further to help identify which diagnoses tend to indicate a high probability of readmission, and it can even be used as a framework to develop a predictive model to identify patients at risk of readmission without intervention. Armed with this information, a hospital or health plan could develop a plan to help prevent readmission among diabetic patients with certain diagnoses.

Further Steps and Other Tips

It's your turn, you got this! Ashkan Forouzani on Unsplash
It’s your turn, you got this! Ashkan Forouzani on Unsplash

I have kept this example intentionally simple for the sake of this tutorial. By using this pipeline or some modification of it, you will be able to use high-cardinality fields in a unique way that that reduces memory overhead without loss of information. But this just scratches the surface of the power of this technique.

Let’s say you wanted to go back to include the medical specialty feature in this model to see if that helped predict readmissions. Luckily, that’s very easy to do! Just go up to the section where you converted the data into the tall format and add medical specialty into the list of variables included, using the existing code as a template. That new item will flow through the rest of your model, all the way down to fitting the model in XGBoost and creating the SHAP output. Let’s say you wanted to create a variable based on the first three characters of the diagnosis code (which groups similar diagnoses together, though note that the sample data set used may have already done this for non-diabetic diagnoses). Again, very easy! You only need to go up to the section where you convert the data into the tall format and create a column based on the first three characters of the diagnosis code. In general, adding new features into this pipeline or removing unnecessary features from the pipeline is straightforward. You just need to figure out an observation index (_encounterid in this article), a variable name, and a value to input. You then append this new information into your existing tall data frame, and the rest flows from there.

As a final note, recall that we one-hot encoded the diagnosis information for this example. However, you are free to put other values into the value column besides 1. Let’s say that, instead of the limited information available to us in this data set, we knew the number of times a patient received a diagnosis in the previous year. If the patient was diagnosed with diagnosis code 250.8 ten times in the past year, value could be populated with 10 instead of 1. You could include all kinds of values in here that you think might be predictive, from dollars spent, to days since the diagnosis was made, to visit count, and on and on. By doing this, you embed valuable information into your sparse matrix that XGBoost can then use to build a better model (e.g., maybe patients who frequently receive a certain diagnosis are more at risk of readmission, and perhaps the model can pick up on this pattern, etc.).

I hope you found this helpful, and I wish you the best of luck as you incorporate sparse matrices into your own models!


Related Articles