The world’s leading publication for data science, AI, and ML professionals.

Pump it Up with CatBoost

Data Mining and a Simple Starter Model

Photo by sofiya kirik on Unsplash
Photo by sofiya kirik on Unsplash

Introduction

This article is based on the competition Driven Data® had published about water pumps in Tanzania. The competition information was obtained by the Tanzania Ministry of Water using an open-source platform called Taarifa. Tanzania is the largest country in East Africa, with a population of about 60 million. Half of the population does not have access to clean water, and 2/3 of the population suffers from poor sanitation. In poor households, families often have to spend several hours walking to obtain water from water pumps. Billions of dollars in foreign aid are being provided to Tanzania to tackle the freshwater problem. However, the Tanzanian government cannot solve this problem. A significant part of water pumps is entirely out of order or practically does not function; the others require repair. Tanzania’s Ministry of Water Resources agreed with Taarifa, and they launched the DrivenData competition.

Data

The data has many characteristics associated with water pumps. Data related to geographical locations, organizations that create and manage them, and some data about the region, local government areas. Also, there is information on the types of checkouts, types and number of payments. The water supply points were divided into functional, non-functional and functional but in need of repair. The goal of the competition is to build a model that predicts the functionality of water supply points.

The modelling data has 59400 rows and 40 columns without the label that comes in a separate file.

The metric used for this competition is the classification rate, which calculates the percentage of rows where the predicted class in the submission matches the actual class in the test set. The maximum is 1, and the minimum is 0. The goal is to maximize the classification rate.

Exploratory Data Analysis

The following set of information about waterpoints is presented for analysis:

  • amount_tsh – Total static head (amount water available to waterpoint)
  • date_recorded – The date the row was entered
  • funder – Who funded the well
  • gps_height – Altitude of the well
  • installer – Organization that installed the well
  • longitude – GPS coordinate
  • latitude – GPS coordinate
  • wpt_name – Name of the waterpoint if there is one
  • num_private – No information
  • basin – Geographic water basin
  • subvillage – Geographic location
  • region – Geographic location
  • region_code – Geographic location (coded)
  • district_code – Geographic location (coded)
  • lga – Geographic location
  • ward – Geographic location
  • population – Population around the well
  • public_meeting – True/False
  • recorded_by – Group entering this row of data
  • scheme_management – Who operates the waterpoint
  • scheme_name – Who operates the waterpoint
  • permit – If the waterpoint is permitted
  • construction_year – Year the waterpoint was constructed
  • extraction_type – The kind of extraction the waterpoint uses
  • extraction_type_group – The kind of extraction the waterpoint uses
  • extraction_type_class – The kind of extraction the waterpoint uses
  • management – How the waterpoint is managed
  • management_group – How the waterpoint is managed
  • payment – What the water costs
  • payment_type – What the water costs
  • water_quality – The quality of the water
  • quality_group – The quality of the water
  • quantity – The quantity of water
  • quantity_group – The quantity of water (duplicates quality)
  • source – The source of the water
  • source_type – The source of the water
  • source_class – The source of the water
  • waterpoint_type – The kind of waterpoint
  • waterpoint_type_group – The kind of waterpoint

First of all, let’s look at the target – the classes don’t have an even distribution.

It is worth noting the small number of labels for water pumps in need of repair. There are several ways to mitigate the issue of class imbalance:

  • under-sampling
  • over-sampling
  • do nothing and use the capabilities of the libraries to build models

Let’s see how the water pumps are distributed across the territory of the country.

Some features contain empty values.

We can see that there are very few rows with missing values, with _schemename having the largest number.

The following heatmap represents the presence/absence relationships between variables. It is worth paying attention to the correlation between permit, installer and funder.

Let’s see the general picture of the relationships on the dendrogram.

In the characteristics of water pumps, there is one that shows the amount of water. We can check how the amount of water is related to the pumps’ condition (_quantitygroup).

It can be seen that there are many wells with sufficient water that are not functioning. From the point of view of investment efficiency, it is logical to focus on repairing this particular group in the first place. Also, it is observed that most dry pumps are not working. By finding a solution to fill these wells again with water, they can probably be functional.

Does water quality affect the condition of the water pumps? We can see the data grouped by _qualitygroup.

Unfortunately, this graph is not very informative, since the number of sources with good water prevails. Let’s try to group only for sources with less quality water.

Most pumps with an unknown _qualitygroup are non-functional.

There is another attractive characteristic of waterpoints – their type (_waterpoint_typegroup).

Analysis of the data by waterpoints shows that the group with other types contains many inoperative pumps. Are they outdated? We can check how the year the pump was constructed affects.

A reasonably expected result – the older the waterpoint, the higher the probability that it is not functioning, mostly before the 80s.

Now we will try to get insights from the information about the funding organizations. The condition of the wells should be correlated with funding. Consider only organizations that fund more than 500 waterpoints.

Danida – cooperation between Tanzania and Denmark on wells funding, and although they have many working waterpoints, the percentage of broken ones is very high. Similar situation with RWSSP(Rural Water Supply and Sanitation Program), Dhv and a few more. It should be noted that most of the wells financed by the German Republic and by Private Individuals are mostly in working state. In contrast, a large number of wells that are financed by the state are not functioning. Most of the waterpoints established by the central government and district council are also not working.

Let us consider the hypothesis that the water’s purity and the water basin to which the well belongs can influence the functioning. First of all, let’s look at the water basins.

Two basins stand out strongly – Reuben and Lake Rukwa. The number of broken waterpoints there is the majority.

It is known that some of the wells are not free. We can assume that payments can positively affect keeping the pumps in working order.

The hypothesis is fully confirmed – payment for water helps to keep the source in a working state.

In addition to categorical parameters, the data contains numeric information that we can look at and maybe find something interesting.

Part of the data was filled with 0 value instead of real data. We can also see that _amounttsh is higher in workable waterpoints (label = 0). Also, you should pay attention to the outliers in the _amounttsh feature. As a feature, one can note the difference in elevation and the fact that a significant part of the population lives 500 meters above the mean sea level.

Data Cleaning

Before starting to create a model, we need to clean and prepare the data.

  • The installer feature contains a lot of repetitions with different cases, spelling errors and abbreviations. Let’s put everything in lowercase first. Then, using simple rules, we reduce the number of mistakes and do the grouping.
  • After cleaning, we replace any items that occur less than 71 times (0.95 quantiles) with ‘other’ item.
  • We repeat by analogy with the funder feature. The cut-off threshold is 98.
  • The data contains features with very similar categories. Let’s choose only one of them. Since there is not much data in the dataset, we leave the feature with the smallest set of categories. Delete _scheme_management, quantity_group, water_quality, payment_type, extraction_type, waterpoint_type_group, regioncode.
  • Replace the latitude and longitude values of outliers with the median values of the corresponding _regioncode.
  • A similar technique for replacing missing values is applicable for subvillage and _schemename.
  • Missing values in _publicmeeting and permit are replaced with median values.
  • For subvillage, _publicmeeting, _schemename, permit, we can create different binary features that show missing values.
  • The features _schememanagement, _quantitygroup, _waterquality, _regioncode, _paymenttype, _extractiontype, _waterpoint_typegroup, _daterecorded, and _recordedby can be deleted is either duplicate information or it is useless.

Modelling

The data contains a large number of categorical features. The most suitable for obtaining a base-line model, in my opinion, is CatBoost. It is a high-performance, open-source library for gradient boosting on decision trees.

We will not select the optimal parameters, let it be homework. Let’s write a function to initialize and train the model.

def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        max_ctr_complexity=5,
        task_type='CPU',
        iterations=10000,
        eval_metric='AUC',
        od_type='Iter',
        od_wait=500,
        **kwargs
    )
return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=1000,
        plot=False,
        use_best_model=True)

For the evaluation, AUC was chosen because the Data is highly unbalanced, and this metric is the best for such cases.

For the target metric, we can write our function.

def classification_rate(y, y_pred):
    return np.sum(y==y_pred)/len(y)

Since there is little data, it is not great to split the dataset into train and validation parts. In this case, it is better to use OOF (Out-of-Fold) predictions. We will not use third-party libraries; let’s try to write a simple function. Please note that splitting the dataset into folds must be stratified.

def get_oof(n_folds, x_train, y, x_test, cat_features, seeds):
    ntrain = x_train.shape[0]
    ntest = x_test.shape[0]  

    oof_train = np.zeros((len(seeds), ntrain, 3))
    oof_test = np.zeros((ntest, 3))
    oof_test_skf = np.empty((len(seeds), n_folds, ntest, 3))
    test_pool = Pool(data=x_test, cat_features=cat_features) 
    models = {}
    for iseed, seed in enumerate(seeds):
        kf = StratifiedKFold(
            n_splits=n_folds,
            shuffle=True,
            random_state=seed)          
        for i, (train_index, test_index) in enumerate(kf.split(x_train, y)):
            print(f'nSeed {seed}, Fold {i}')
            x_tr = x_train.iloc[train_index, :]
            y_tr = y[train_index]
            x_te = x_train.iloc[test_index, :]
            y_te = y[test_index]
            train_pool = Pool(data=x_tr, label=y_tr, cat_features=cat_features)
            valid_pool = Pool(data=x_te, label=y_te, cat_features=cat_features)
model = fit_model(
                train_pool, valid_pool,
                loss_function='MultiClass',
                random_seed=seed
            )
            oof_train[iseed, test_index, :] = model.predict_proba(x_te)
            oof_test_skf[iseed, i, :, :] = model.predict_proba(x_test)
            models[(seed, i)] = model
oof_test[:, :] = oof_test_skf.mean(axis=1).mean(axis=0)
    oof_train = oof_train.mean(axis=0)
    return oof_train, oof_test, models

To reduce the dependence on splitting randomness, we will set several different seeds to calculate predictions.

The learning curve of one of the folds
The learning curve of one of the folds

The learning curves look incredibly optimistic, and the model should look good.

Having looked at the importance of the model’s features, we can make sure that there is no obvious leak.

The feature importance in one of the models
The feature importance in one of the models

After averaging the predictions:

balanced accuracy: 0.6703822994494413
classification rate: 0.8198316498316498

This result was obtained when uploading predictions on the competition website.

Considering that the top5 result was only about 0.005 better at the time of this writing, we can say that the base-line model is good.

Summary

In this story, we:

  • got acquainted with the data and looked for insights that can lead to thoughts for feature generation;
  • cleaned up and prepared the provided data to create the model;
  • decided to use CatBoost, since the bulk of the features are categorical;
  • wrote a function for OOF-prediction;
  • got an excellent result for the base-line model.

The right approach to data preparation and choosing the right tools for creating a model can give great results even without making additional features.

As a homework assignment, I suggest adding new features, choosing the model’s optimal parameters, using other libraries for gradient boosting, and building ensembles from the resulting models.

The code from the article can be viewed here.


Related Articles