Python Water Quality — Baseline Classification Model

Assess feature importance when estimating water quality using a reference baseline model

James McNeill
Towards Data Science

--

Photo by Unseen Studio on Unsplash

Understanding what can be used to classify the water quality can be a challenge. Having expert knowledge of different regions can provide local insights into what helps to see how water flows best. Without the time to fully review these details, it reduces the possibility of learning from mistakes to benefit others. A quantifiable approach can be taken by collecting datasets of the features that impact water. With quantification it allows the user to apply computer science techniques to gain data-driven insights. For this article, we are aiming to apply a baseline Machine Learning classification model to help highlight key features. Model predictions will be produced with unseen data, commonly referred to as test data to validate model performance.

For prior details on the initial exploratory data analysis performed on the input dataset the article “Python water quality EDA and Potability analysis” is shared at the end.

Dataset

For this piece of analysis, the Water Quality dataset has been taken from Kaggle¹.

A jupyter notebook instance written with Python code was used for processing.

import sys
print(sys.version) # displays the version of python installed

After running the script above an output would show that version 3.7.10 of Python was used. To be able to replicate the results that follow, users should ensure that their working environment has Python 3.

# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

To begin the process steps several Python libraries are needed. Each library shown above contains a range of methods, functions, and outputs that have been developed to aid data analysis. For user knowledge both sets of libraries have been built on top of each other i.e., one is taken as the base and used to produce additional outputs. Pandas is built on top of NumPy to produce data analysis. With Seaborn built on top of matplotlib to aid data visualization.

A common initial step to begin working with Pandas for data analysis is to import a CSV file. The code shown below references the folder containing the file for review.

# Import the dataset for review as a DataFrame
df = pd.read_csv("../input/water-potability/water_potability.csv")

Pre-processing data for modeling

Before beginning to produce classification models we need to understand and pre-process the data. As mentioned earlier the EDA (Exploratory Data Analysis) was assessed in a previous article. Using the knowledge gained a pre-processing data pipeline can be produced. Creating a data pipeline has a range of benefits. Firstly, the steps that have been tested on sample data are used to automate processing for future iterations. Secondly, it allows other users to quickly begin working with a cleaner version of the dataset and avoid having to review the same initial steps. Lastly, the data pipeline can be copied or forked by other users and new additions can be added without impacting the initial pipeline.

With ML modelling the main aim of analysis is to have a robust data pipeline that contains data pre-processing to allow other users to test different ML model algorithms. A great book called Effective Pandas by Matt Harrison shows how a chaining method can aid code legibility. We will show two methods that aim to process similar tasks. Readers are welcome to follow either approach.

A common data pre-processing step is to review missing data values. How these ultimately impact the model is unknown until the testing phase. It is advised to retain the raw data variable and develop a new variable to allow for later comparison. For the steps below we will leave this step out and aim to update the raw data variable.

Firstly we need to review the volume of missing values associated with each variable.

# Understand missing values per variable within DataFrame
(
df
.isnull().sum()
)
Output 1.1 Missing value per column within DataFrame

The output highlights three variables with missing values. However, there is a wide variation in the total missing. Sulfate shows the highest proportion with lower values for the other variables. When larger proportions of missing values are present, we must be careful when applying adjustments. By applying methods that remove the underlying characteristics of the missing values, final results could provide estimates that do not align with expectations. Having expert knowledge within the dataset domain can help to understand different options.

Method #1

# Apply mean value to the missing values
df['ph'].fillna(df['ph'].mean(), inplace=True)
df['Sulfate'].fillna(df['Sulfate'].mean(), inplace=True)
df['Trihalomethanes'].fillna(df['Trihalomethanes'].mean(), inplace=True)
df.isnull().sum()

Application of a mean value from all non-missing values provides a good first approximation. Multiple variables required updating and Key Word (KW) parameters within the method fillna cater for in-line updates. Including the KW inplace will apply the method to the input dataframe df without requiring a copy.

Method #2

# Make updates with chaining method, allows for use of comments to update the columns.
# A new dataframe variable (df_1) can be assigned this output
df1 = (
df
# .isnull().sum()
.assign(ph=lambda df_:df_.ph.fillna(df_.ph.mean()),
Sulfate=lambda df_:df_.Sulfate.fillna(df_.Sulfate.mean()),
Trihalomethanes=lambda df_:df_.Trihalomethanes.fillna(df_.Trihalomethanes.mean())
)
)

# Confirm that the columns have been updated
df1.isnull().sum()

The second method seeks to use the chaining method to perform variable updates. A mean value adjustment is still being applied. Using the assign method with a single-line lambda function allows for greater readability. Another important aspect is that previous lines can be (un)commented out. If a review of pre- and post-processing was required it would be a simple step to uncomment and comment lines within the code. The output below highlights that all missing values have been updated.

Output 1.2 Post-processing update made to resolve missing values shows zero null values

With pre-processing complete we are now able to split the dataframe into dependent (target or y) and independent (X) variables.

# Separate into X and y variables
X, y = df1.drop(['Potability'], axis=1), df1['Potability'].values

# Show that only independent variables have been retained
X.head()

Python allows for multiple variables to be produced on the left-hand side of the formula within the same line of code. Adding a comma between the variables on each side of the formula, Python interprets that two new variables are being created.

Output 1.3 Top 5 rows from the DataFrame show only independent variables

The top 5 rows have been shown with the head method. A numpy array variable contains the y binary values.

Classification model — Baseline

A common Python library used to develop ML models is scikit-learn. Within the library repository a wide range of techniques aid with model development. Many years of development have resulted in a mature library that continues to progress.

When building a classification model many users dive straight into development with the latest ML techniques. However, a better approach is to first develop a baseline model. It can act as a point of reference with any model estimates below this baseline showing less effective techniques. A good first approximation can be produced before attempting to make adjustments to the model hyper-parameters. With hyper-parameters being KW variables that can be adjusted to improve the ML model performance.

Scikit learn contains a dummy classifier algorithm that can provide a baseline model. With the model output, it can be compared against more complex classifiers.

# Dummy classifier - create a baseline accuracy score
from sklearn.dummy import DummyClassifier

# Define the reference model
dummy_clf = DummyClassifier(strategy='most_frequent')

# Fit the model
dummy_clf.fit(X, y)

# Predict the model
dummy_clf.predict(X)

# Evaluate the model
score = dummy_clf.score(X, y)
print(score)

# Print statement displayed value
0.6098901098901099

The model steps above create a model classifier that can then be fit to the input data. A prediction of the target (y) is produced using the predict method. Finally, scoring will show the accuracy of the model.

As the dummy classifier applied the most frequent value we are effectively predicting that the target value is 0 for each. It should be noted that applying this method can provide good context for future predictions.

# Review the dependent variable frequency and percentage
(
df1
.Potability
# .value_counts()
.value_counts(normalize=True) # display frequencies as a percentage
)

To validate the output of the predicted value of 0.60989 we can perform a value_counts of the target variable. The output below displays that the same percentage is shown as the scored prediction.

Output 1.4 Display the percentage portion of the binary target variable

Therefore, should any future classification model result in a lower estimated score we should discount this model as not producing better results.

Classification model — Complex approach

Now let's attempt to produce a more complex model to understand the ML challenge. It is with Gradient Boosted Models (GBM) that we will look for improved performance. A GBM is a tree-based model that allows for the development of multiple trees. With each tree, the input data is assessed to understand how model features predict the target variable. For this exercise, we will use a light GBM classifier. Alternatives such as XGBoost, which stands for Extreme Gradient Boosting, can be used in future developments.

# Lets try a Light GBM
from lightgbm import LGBMClassifier

# ML Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# ML Performance metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

For the ML model above common library imports are shown. Each section shows the relevant steps used during model build and testing. Pre-processing aims to ensure that a pipeline of steps can be constructed to aid future developments.

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=2, stratify=y)

# Instantiate the LGBM
lgbm = LGBMClassifier()

# Fit the classifier to the training data
lgbm.fit(X_train, y_train)

# Perform prediction
y_pred = lgbm.predict(X_test)

# Print the accuracy
print(lgbm.score(X_test, y_test))

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Firstly, we need to split the input data into training and testing samples. Having this split reduces the chance of overfitting the data. The goal is to create a model with the best performance on unseen data i.e., how models are used in the wild on real-life data. Using the test data to review how the trained model performs with unseen values aims to showcase areas for improvement. Including the keyword parameter stratify ensures that the target variable distribution is aligned across train and test data. Applying this step aims to ensure that the characteristics of the underlying variable distribution are not lost. Model predictions should align with what is observed within the data.

Applying the classifier to the variable lgbm allows the user to work with all of the methods (functions) and attributes (data) of the Python object. Standard steps to train the model and score on the test data are followed.

The results displayed below highlight how the model performance has increased relative to the baseline model. Applying to the test data provides comfort that predictions are working well. Accuracy has been produced with the scoring method. It displays the total number of correct predictions over all possible outcomes.

Using the classification report highlights the classification metrics of most interest. Precision shows how well the True Positive values have been predicted relative to all positive predictions (True Positive and False Positive). Having too many False Positive values results in a Type 1 error i.e., misclassifying an instance as positive, such as medical screening produces misdiagnosis. Recall assesses the True Positive values relative to all positive actuals (True Positive and False Negative). With too many False Negative values it produces a Type II error i.e., misclassifying an instance as negative, such as Fraud detection can result in financial loss.

Output 1.5 Provides details on accuracy and classification metrics

When working with ML algorithms a set of default key parameter values are included to produce baseline results. It is the optimization of these initial parameters that will produce better model predictions.

# Lets understand the baseline params
lgbm.get_params()

Using the method get_params for the lgbm variable will display the output shown below. For further details on what each variable means users can review the documentation online.

Output 1.6 Default keyword parameter values

The next important step in ML model development is to review the hyperparameter space of potential values. By performing hyperparameter tuning across a relevant space of options it is possible to efficiently produce improved predictions.

# Setup the pipeline
steps = [('scaler', StandardScaler()),
('lgbm', LGBMClassifier())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {
'lgbm__learning_rate':[0.03, 0.05, 0.1],
'lgbm__objective':['binary'],
'lgbm__metric':['binary_logloss'],
'lgbm__max_depth':[10],
'lgbm__n_estimators':[100, 200, 300]
}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate the GridSearchCV object
cv = GridSearchCV(pipeline, parameters, cv=3)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = cv.predict(X_test)

Introducing a pipeline that scales numeric variables to align to similar scales will reduce the potential for variables with larger numeric ranges to dominate. Two steps have been included within the pipeline to produce scaled independent variables and then apply the LGBM classifier. By producing code in this format it aids other users understanding of pre-processing steps. A pipeline can include many more steps depending on the complexity of the steps required.

A parameter dictionary has been produced to allow for a mixture of hyperparameter inputs to be tested. Including the variable referencing the lgbm model with a double underscore will let Python recognize that the lgbm variable is to be adjusted.

The gridsearchCV method will review each of the input parameters in combination to produce models for all combinations. By including a CV (cross-validation) parameter, it will perform three cross-validation procedures. Each validation run will select a different sample to train the model. The aim is to ensure that a model does not overfit a unique aspect shown within only one sample of the input independent variables.

# Display best score and params
print(f'Best score : {cv.best_score_}')
print(f'Best params : {cv.best_params_}')

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))

Once processing has been completed, the best score and hyperparameters from the lgbm can be reviewed. As we have only reviewed a small number of potential hyperparameters, users could have identified this best model using a more brute-force manual approach. However, the real benefit of the gridsearch would be the inclusion of a much larger hyperparameter input space.

Output 1.7 Results from the hyperparameter tuning of the LGBM classifier

After selecting the best parameters for the lgbm we can see an improvement in the model accuracy. It is this parameter selection that could be applied when new data is available to make future predictions.

Further steps to improve model performance could include a review of the independent variables relationships via correlation analysis. Also assessing if variables require more refined pre-processing of missing values could be assessed.

Conclusion

Within this article, we have reviewed how the inclusion of a baseline ML model can help when assessing how well models are making predictions. Using a model accuracy metric will determine if alternative approaches have provided improvements. Instead of making a blind assessment of the model performance, there is a data-driven approach in place. We also reviewed how pipeline steps and hyperparameter tuning can aid with ML model performance.

Thanks very much for reading! If you have any comments I would appreciate these as well.

--

--