Using the Tanzania water pump dataset compiled from Taarifa and the Tanzanian Water Ministry data to explore KNN and predict the functionality of water wells in Tanzania.

Data Science has many wonderful applications, and one of them is its use for social good – to analyze and find innovative, data-driven solutions to complex problems.
This analysis showcases one such example. Through the data pre-processing, cleaning, exploratory data analysis and modelling stages, I will explore the use of K-Nearest Neighbours in a tertiary Classification analysis of the functionality of water pumps in Tanzania. Specifically, predicting whether they are functional, non-functional, or functional but in need of repair.
A little context on the case
Tanzania is an East African country that today is one of several in the world experiencing a severe water Crisis, with millions of its people suffering due to their lack of access to safe, clean water and sanitation, despite the country’s ample natural water resources.
With an overall population of 58 million, this crisis currently affects an estimated 24.6 million people, representing a staggering over 40% of the population, and has lead to a wide array of severe consequences notably the abundance of cholera, typhoid and other waterborne infections and diseases. UNICEF estimates that approximately 70% of Tanzania’s health budget is spent on preventable diseases directly tied to the widespread lack of common access to clean water and improved sanitation.
Thus, the objective of this analysis is to dive into the problem, focusing on an analysis of water pump functionality as water pumps remain a primary source of access to water for a large portion of the population in rural areas, and build a classification model using K-Nearest Neighbours to identify possible avenues that could be explored to improve the situation for millions of Tanzanians.
K-Nearest Neighbour (KNN) is an effective and straightforward supervised learning algorithm which can be used for classification and regression problems. It functions by calculating the distance between each data point presented and generates a prediction by taking the common class of the K-nearest points.
Below is an extract of the dataset used.

Looking at the shape of the data frame, there are 59,400 data entries (rows) and 40 independent variables (columns). A quick run-through to sum up the duplicated entries confirms that each entry is unique.
Preprocessing the Data
The first step I generally take when cleaning data is to check for null values, because, assuming that I am interested in keeping the majority, or all of the columns, thinking of how to deal with the missing values gives me a better understanding of the dataset and an idea of which columns can potentially be dropped.

Data cleaning will often take a substantial portion of the time a Data Scientist works. It can often be characterised as the most tedious step as most of us are impatient to start tinkering with the model and creating visualizations to see if we can extract any useful insights for the problem we are trying to solve.
However, I have always thought of data cleaning as an art form, especially when dealing with missing data as the decisions taken will directly affect the modelling stage. What are the best decisions that you can make in order to maximize the preservation of your dataset?
In this case, I have decided to deal with it in the following ways (check out Github for the full code):
- Sub village – This column has many unique categories so replace the 371 missing values with the new ‘Unknown’ category
- Public Meeting – This column seems to be open to some degree of interpretation, but the large majority of entries are ‘True’ (51011 v. 5055). Assuming the interpretation is that the area where the Water pump is located in a place for public meetings, the null values can likely be assumed to be ‘True’ as well for most of the cases.
- Permit – Amongst the 3065 missing values, 2424 are from unknown installers and/or funders, so for these ones, I assumed that they don’t have permits. But the 370 remaining – either their operator or management is parastatal or the water authority, in which case the permit is assumed to be ‘True’ given the connection with authorities.
- Funder & Installer – Compared both columns along the lines below:

Since there are more cases where both the entries for funder and for installer differ (36,762 rows when neither entry is null), it is safer to fill the missing values of both columns with the new ‘Unknown’ category.
- Management & Scheme Management – both these columns seem to have similar entries overall, as seen below when compared:

There are no null entries for the management column therefore, it is not too farfetched to replace the 3,877 missing values in scheme_management with their equivalent in management.
- Scheme Name – Many missing and full-on unique values so better to drop this column, since it is unlikely to provide any value in the analysis.
In the next step, I selected some additional columns to remove, since they presented identical information (e.g. ‘payment_type’ and ‘payment’).
Finally, the data preprocessing is over.
Prepping for Modeling
- Defining the dependent and independent variables
The first step in prepping the data is defining the independent and target variables X and y. Since the model is looking to predict pump functionality, the target y is ‘status_group’.

- Address Categorical variables
The Tanzania water dataset has many categorical variables which need to be addressed and dummified prior to modelling. However, a select number of these variables have too many unique values including ‘funder’ (1,897), ‘installer’ (1,935), ‘subvillage’ (19,288) and ‘ward’ (2,092).
Keeping these variables and creating their dummies would lead to a very complex dataset with thousands of dimensions. Therefore, for the purposes of this analysis, I have decided to remove these columns instead.

- Finalizing the DataFrame
After creating dummies for all the variables and their respective unique values, the final step is to finalize X for modelling by putting the categorical features (now a DataFrame with 240 columns) and continuous features (previously 6 identified columns) back into the single DataFrame X.

Its final shape is 59,400 rows and 246 columns. The data is ready for modelling.
Data Modeling using KNN
Now that the data preprocess cleaning and prepping is done, modelling the data is as easy as instantiating the model, fitting it on the train set, and predicting.

In this case, I am instantiating with the default parameters, where k (n_neighbors) = 5. For the full detail of available parameters, check out the scikit-learn KNN documentation.
Furthermore, by defining a function to compare the accuracy scores of the model on the train and test set, it is possible to get a better understanding of how well fitted this model is to the data.

In this case, the accuracy is 82.94% on the training set compared with 76.32% on the test set, which suggests some degree of overfitting. One possible avenue to reduce overfitting could be to increase the k-value.
Otherwise, it appears the class that has the lowest precision, recall and f1 score is the ‘functional need repair’. When considering the distribution between each category, this class has by far the lowest number of cases in the dataset, as seen in the chart below so it does make sense.

Conclusion
KNN is a versatile model that can be used for classification problems, however the larger the dataset, the exponentially more required calculations, which makes computational complexity an important feature to consider when choosing this model.
Overall, this model appears to perform moderately well, particularly when predicting the non-functionality of water wells. This could thus provide crucial information to Tanzanian authorities on which areas to prioritize for water & sanitation development projects in order to ease access for communities suffering from limited access to water.
In this case, given the precision and recall of water pumps that function but are in need of repair, this model may benefit from a binary division instead between functional and non-functional, in order to triage the areas in most critical need of improved water access in the country.