The world’s leading publication for data science, AI, and ML professionals.

Women in Data Science (WiDS) Datathon on Kaggle

My experience of participation in WiDS and the promising attempt to predict diabetes.

Photo by Christina @ wocintechchat.com on Unsplash
Photo by Christina @ wocintechchat.com on Unsplash

4th Annual WiDS Datathon focuses on social impact, namely on patient health, with an emphasis on the chronic condition of diabetes.

The competition is organized by WiDS Worldwide team at Stanford, West Big Data Innovation Hub, WiDS Datathon Committee and launched on Kaggle.

Dataset for the competition is provided by MIT’s GOSSIS (Global Open Source Severity of Illness Score) initiative.

Additionally, an online conference is going to take place on March 8, 2021. Top Data Science female voices will take the scene and provide their insights for a variety of topics.

Just to be on the same page

APACHE – Acute Physiology, Age, and Chronic Health Evaluation, a severity score and mortality estimation tool developed in the United States. ICU – Intensive Care Unit, a special department of a hospital or health care facility that provides intensive care medicine. BMI – the body mass index, a measure of body fat based on height and weight.

Problem description

The goal of the competition is to determine whether a patient admitted to an ICU has been previously diagnosed with Diabetes Mellitus (a particular type of diabetes) or not.

You should build a model for Diabetes Mellitus prediction, using the data gathered during the first 24 hours of patient’s intensive care.

The chances are that you are not from a medical background, so I would recommend reading a short overview of disease here (The American Diabetes Association).

Data exploration

I start data exploration from DataDictionaryWiDS2021.csv file, with a detailed description of all features.

There are 181 features divided into several categories:

  • demographics
  • APACHE covariates
  • APACHE comorbidity
  • vitals
  • labs
  • labs blood gas
  • Diabetes Mellitus flag – target variable

Each category provides details of each feature, i.e. unit of measure, description, example, etc.

In the demographic group of features, I decided to have a look at each one and decide if it will be useful for model building.

  • Age feature will be useful, but we need to transform it further in the feature engineering stage.
  • IBM feature also can be quite interesting for us because according to the ADA anyone with a body mass index higher than 25, regardless of age, who has additional risk factors should be screened for diabetes.
  • Ethnicity should not impact diabetes and also it looks like mostly it contains only one ethnicity, so this one will not be used further.
Number of patients by ethnicity
Number of patients by ethnicity
  • Gender can be useful but it needs to be transformed using encoding because the initial dataset contains only "F" and "M" values.
  • Height and weight are already included in IBM calculations so they are not needed further.
  • I will not use any features related to hospital or ICU type, ** besides ICU id and hospital admit sourc**e.

APACHE covariate, vitals, labs, and labs blood gas groups are related to results of various medical tests, hence contain lots of features inside. I decided to check the correlation between them and the target variable instead of checking them one by one using correlation matrixes.

For model training, I will use features with a correlation more than 0.1 threshold. In most cases, selected features are related to glucose concentration and blood pressure.

Correlation matrix for some features from labs feature group
Correlation matrix for some features from labs feature group

APACHE comorbidity group holds information from some questionnaire on whether the patient has some diagnosis or conditions. None features from this group are selected as there is no linear dependency with diabetes.

Correlation matrix for APACHE comorbidity group of features
Correlation matrix for APACHE comorbidity group of features

It is also quite important to explore how many NA values we have in the dataset. In total, there are ca. 130k observations and it makes no sense to use features with more than 50k NAs. There is no point in handling missing values for them, as it would create 50k observations with the same value and make the model even more biased.

Dataset is half-empty.

Going deeper into this topic, 7 features with significant correlation have more than 70k NAs each, so we will drop them. For reference, they are

  • h1_bun_min and *_max,
  • h1_creatinine_min and *_max,
  • h1_glucose_min and *_max,
  • h1_diasbp_invasive_min.

There are also observations with four or more missing features. During missing value handling, you will get observations that are just a copy of other observations. We have ca. 19k of such observations which I drop to keep the original distribution in the dataset.

Gender feature missing values are replaced with mode. Other NAs will be replaced with a median.

Handling the missing values should be done after train/test split because train dataset could impact testing one.

Feature engineering

I chose the final set of features using common sense, correlation with target variable and NAs threshold. They are

  • age,
  • bmi,
  • icu_id,
  • hospital_admit_source,
  • gender,
  • d1_bun_min and *_max,
  • d1_creatinine_min and *_max,
  • d1_glucose_min and *_max,
  • h1_glucose_min and *_max,
  • arf*, bun, glucose_ and creatinine_apache.

As shown in the graph, there is a dependency between age and percent of people diagnosed with diabetes.

Age feature can be also grouped into bins. However, it is not clear how many bins should be created and which to choose – fixed-width binning or adaptive binning. So you need to experiment with it and find the most appropriate form. I added the age feature without any transformations.

All categorical features should be transformed using encoding. For instance, the age feature with "F" and "M" values should be transformed into a feature with 0 and 1 values instead. It can be achieved using one of the scikit-learn encoders.

Model training

In order to find the best parameters, I used GridSearch with two scoring metrics and optimised the following parameters: n_estimators, max_depth, learning_rate, and alpha. GridSearch also uses cross-validation, so I set the number of folds to 3 (cv=3). It is quite useful because then you do not need to run cross-validation separately.

scor = {'AUC': 'roc_auc', 
        'Accuracy': metrics.make_scorer(metrics.accuracy_score)}
grid_param = {"colsample_bytree": [1],
              "learning_rate": [0.01, 0.1, 0.5],
              "n_estimators": [250, 300, 500],
              "alpha": [0.1, 0.5, 1],
              "max_depth": [2, 4, 5, 10]}
model = xgb.XGBClassifier(learning_rate = 0.1, alpha = 0.1)
grid_mse = GridSearchCV(estimator=model, param_grid=grid_param,
                        scoring=scor, cv=3, verbose=1, refit='AUC')
grid_mse.fit(x_train, y_train)
print("Best parameters found: ", grid_mse.best_params_)

Sometimes the parameter’s best value might be a max or min of my grid values. So I run GridSearch again with different values for this parameter. Aditionally, you can specify several scoring metrics for more comprehensive evaluation.

There are also other alternatives available like GridSearch ,e.g. RandomizedSearch or any other method.

Model evaluation

Having a look at learning curves, it’s clear that the gap between curves is small. It indicates a low variance of our model. Error in the training set is not really high, because the model has an accuracy of ~0.85. Though, we want it to be better.

Adding more observations would not help us in this case, because the model has low variance. Looks like adding more complexity to the model should improve it, i.e. adding new features or using a more complex algorithm.

Additional complexity can help if learning curves has a small gap between and you want accuracy to be higher.

ROC UAC confirms the conclusion made by exploring learning curves. The area under the curve is 0.85, which is a good result, but it can be improved.

Despite being a good starting point, the model by itself definitely cannot give us the desired level of accuracy. It can be used as one of the classifiers in the ensemble model going forward.

Summary

In general, participation in WiDS competition and workshops gives you an opportunity to learn new approaches, ask questions and discuss solutions with like-minded professionals.

Compete. Learn. Grow.

It does not matter if you are a novice or a veteran of data science, you can find several tutorials from the WiDS Datathon Committee to help you get started.


Related Articles