Build Fish Weight Prediction Model

· Introduction · Part 2.1 Build Machine learning Pipeline ∘ Step 1: Collect the data ∘ Step 2: Visualize the data (Ask yourself these questions and answer) ∘ Step 3: Clean the data ∘ Step 4: Train the model ∘ Step 5: Evaluate ∘ Step 6: Hyperparameter tuning using hyperopt ∘ Step 7: Choose the best model and prediction · Part 2.2 Analyze ML algorithms ∘ What is a Decision Tree? ∘ What is Random Forest? ∘ What is Extreme Gradient Boosting? (XGBoost) ∘ Decision Tree vs Random Forest vs XGBoost ∘ Linear Models vs Tree-Based models. · Conclusion
Introduction
As I explained in my previous post, a real data scientist thinks from a problem/application perspective and finds an approach to solve it with the help of programming languages or frameworks. In Part 1 fish weight estimating problem was solved using linear ML models however, today I will introduce tree-based algorithms such as Decision Tree, Random Forest, XGBoost to solve the same problem. In the first half of the article Part 2.1, I will build a model and in the second half Part 2.2, I will explain each algorithm theoretically, compare them to each other and find its advantages and disadvantages.
YouTube video of a Decision Tree!
Part 2.1 Build Machine learning Pipeline
To build an ML model we need to follow the pipeline steps below for almost all kinds of models.

Since the problem that we are solving is the same as before, some pipeline steps will be the same such as 1. collect the data and 2. visualize the data. However, there will be some modifications to other steps.
Step 1: Collect the data
The data is the public dataset that can be downloaded from the Kaggle.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from itertools import combinations
import numpy as np
data = pd.read_csv("Fish.csv")
Step 2: Visualize the data (Ask yourself these questions and answer)
- How does the data look like?
data.head()

- Does the data have missing values?
data.isna().sum()

- What is the distribution of the numerical features?
data_num = data.drop(columns=["Species"])
fig, axes = plt.subplots(len(data_num.columns)//3, 3, figsize=(15, 6))
i = 0
for triaxis in axes:
for axis in triaxis:
data_num.hist(column = data_num.columns[i], ax=axis)
i = i+1

- What is the distribution of the target variable(Weight) with respect to fish Species?
sns.displot(
data=data,
x="Weight",
hue="Species",
kind="hist",
height=6,
aspect=1.4,
bins=15
)
plt.show()

Target variable distribution with respect to species shows that there are some species such as Pike that have huge weight compared to others. This visualization gives us additional information on how the "species" feature can be used for prediction.
Step 3: Clean the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
data_cleaned = data.drop("Weight", axis=1)
y = data['Weight']
x_train, x_test, y_train, y_test = train_test_split(data_cleaned,y, test_size=0.2, random_state=42)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
# label encoder
label_encoder = LabelEncoder()
x_train['Species'] = label_encoder.fit_transform(x_train['Species'].values)
x_test['Species'] = label_encoder.transform(x_test['Species'].values)

We are using tree-based models, therefore we do not need feature Scaling. In addition, to convert text into numbers, I just assigned unique numerical values to each fish species using LabelEncoder.
Step 4: Train the model
def evauation_model(pred, y_val):
score_MSE = round(mean_squared_error(pred, y_val),2)
score_MAE = round(mean_absolute_error(pred, y_val),2)
score_r2score = round(r2_score(pred, y_val),2)
return score_MSE, score_MAE, score_r2score
def models_score(model_name, train_data, y_train, val_data,y_val):
model_list = ["Decision_Tree","Random_Forest","XGboost_Regressor"]
#model_1
if model_name=="Decision_Tree":
reg = DecisionTreeRegressor(random_state=42)
#model_2
elif model_name=="Random_Forest":
reg = RandomForestRegressor(random_state=42)
#model_3
elif model_name=="XGboost_Regressor":
reg = xgb.XGBRegressor(objective="reg:squarederror",random_state=42,)
else:
print("please enter correct regressor name")
if model_name in model_list:
reg.fit(train_data,y_train)
pred = reg.predict(val_data)
score_MSE, score_MAE, score_r2score = evauation_model(pred,y_val)
return round(score_MSE,2), round(score_MAE,2), round(score_r2score,2)
model_list = ["Decision_Tree","Random_Forest","XGboost_Regressor"]
result_scores = []
for model in model_list:
score = models_score(model, x_train, y_train, x_test, y_test)
result_scores.append((model, score[0], score[1],score[2]))
print(model,score)
I trained decision trees, random forest XGboost and stored all the evaluation scores.
Step 5: Evaluate
df_result_scores = pd.DataFrame(result_scores,columns ["model","mse","mae","r2score"])
df_result_scores

The result is really fascinating, as you remember linear models achieved much lower results (also shown below). So before we do any kind of hyperparameter tuning we can say that all tree-based models outperform linear models in this kind of dataset.

Step 6: Hyperparameter tuning using hyperopt
Today we use hyperopt to tune hyperparameters using the TPE algorithm. Instead of taking random values from the search space, the TPE algorithm takes into account that some hyper-parameter assignments (x) are known to be irrelevant given particular values of other elements. In this case, the search is effective than random search and faster than greed search.
from hyperopt import hp
from hyperopt import fmin, tpe, STATUS_OK, STATUS_FAIL, Trials
from sklearn.model_selection import cross_val_score
num_estimator = [100,150,200,250]
space= {'max_depth': hp.quniform("max_depth", 3, 18, 1),
'gamma': hp.uniform ('gamma', 1,9),
'reg_alpha' : hp.quniform('reg_alpha', 30,180,1),
'reg_lambda' : hp.uniform('reg_lambda', 0,1),
'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
'n_estimators': hp.choice("n_estimators", num_estimator),
}
def hyperparameter_tuning(space):
model=xgb.XGBRegressor(n_estimators = space['n_estimators'], max_depth = int(space['max_depth']), gamma = space['gamma'],
reg_alpha = int(space['reg_alpha']) , min_child_weight=space['min_child_weight'],
colsample_bytree=space['colsample_bytree'], objective="reg:squarederror")
score_cv = cross_val_score(model, x_train, y_train, cv=5, scoring="neg_mean_absolute_error").mean()
return {'loss':-score_cv, 'status': STATUS_OK, 'model': model}
trials = Trials()
best = fmin(fn=hyperparameter_tuning,
space=space,
algo=tpe.suggest,
max_evals=200,
trials=trials)
print(best)

Here is the result of the best hyperparameters found by the algorithm after 200 trials. However, if the dataset is too large, the number of trials can be reduced accordingly.
best['max_depth'] = int(best['max_depth']) # convert to int
best["n_estimators"] = num_estimator[best["n_estimators"]] #assing value based on index
reg = xgb.XGBRegressor(**best)
reg.fit(x_train,y_train)
pred = reg.predict(x_test)
score_MSE, score_MAE, score_r2score = evauation_model(pred,y_test)
to_append = ["XGboost_hyper_tuned",score_MSE, score_MAE, score_r2score]
df_result_scores.loc[len(df_result_scores)] = to_append
df_result_scores

The result is fantastic! The hyper tuned model is really good, compared to other algorithms. For instance, XGboost improved the result of MAE from 41.65 to 36.33. It is a great illustration, how powerful hyperparameter tuning is.
Step 7: Choose the best model and prediction
# winner
reg = xgb.XGBRegressor(**best)
reg.fit(x_train,y_train)
pred = reg.predict(x_test)
plt.figure(figsize=(18,7))
plt.subplot(1, 2, 1) # row 1, col 2 index 1
plt.scatter(range(0,len(x_test)), pred,color="green",label="predicted")
plt.scatter(range(0,len(x_test)), y_test,color="red",label="True value")
plt.legend()
plt.subplot(1, 2, 2) # index 2
plt.plot(range(0,len(x_test)), pred,color="green",label="predicted")
plt.plot(range(0,len(x_test)), y_test,color="red",label="True value")
plt.legend()
plt.show()

The visualization is a clear illustration of how close the predicted and true values are to each other and how well the tuned XGBoost performed.
Part 2.2 Analyze ML algorithms
What is a Decision Tree?
A Decision tree is a supervised ML algorithm that is good at capturing non-linear relationships between the features and the target variable. The intuition behind the algorithms is similar to human logic. In each node, the algorithm finds the feature and threshold on which the data is split into two parts. Below is the illustration of a Decision tree.

First, let’s see what each variable represents in the figure. let’s take the first node as an example.
width≤5.154: Feature and value threshold on which the algorithm decided to split the data samples.
samples = 127: There are 127 data points before splitting.
value = 386.794: Average value of the predicted feature(fish weight).
Squared_error = 122928.22: Same as MSE(true, pred) – where pred is the same as value(average fish weight of the samples).
So the algorithm based on width≤5.154 threshold split data into two parts. But the question is how did the algorithm find this threshold? There are several splitting criteria, for the regression task CART algorithm tries to find a threshold by searching in a greedy fashion such that the weighted average of MSE of both subgroups is minimized.

For instance, in our case after the first split, the weighted average MSE of both subgroups was the minimum compared to other splits.
J(k,t_k) = 88/127 *20583.394 + 39/127 *75630.727 = 37487.69
Problem with a Decision Tree:
Trees are very sensitive to small variations in the training data. A small change in the data can result in a major change in the structure of the decision tree. The solution to this limitation is a random forest.
What is Random Forest?
Random Forest is an ensemble of Decision Trees. The intuition behind Random Forest is to build multiple decision trees and in each decision tree, instead of searching the best feature to split the data, it searched for the best feature among a subset of features, therefore this improves tree diversity. However, it is less interpretable than a simple decision tree. Also, it needs a large number of trees to build which makes the algorithm slow for real-time applications. Generally, algorithms are fast to train but slow to create predictions. An improved version of the Decision tree is also XGBoost.
What is Extreme Gradient Boosting? (XGBoost)
XGBoost is also a tree-based ensemble supervised learning algorithm, that uses a gradient boosting framework. The intuition behind this algorithm is that it tries to fit the new predictor to residual errors made by the previous predictor. It is extremely fast, scalable, and portable.
Decision Tree vs Random Forest vs XGBoost
As a result, in our experiment, XGboost outperformed others in terms of performance. Also theoretically, we can conclude that Decision Tree is the simplest tree-based algorithm, which has the limitation of unstable nature – the variation in the data can cause a big change of tree structure, however, it has perfect interpretability nature. Random Forest and XGboost are more complex. One of the differences is that Random Forest combines results at the end of the process(majority rules), while XGboost combines the result along the way. In general, XGboost has better performance than random forest, However, XGBoost can not be a good choice when we have a lot of noise in the data, it will result in overfitting and it will be harder to tune than random forest.
Linear Models vs Tree-Based models.
- Linear models capture linear relationships between independent and dependent variables, which is not the case in most cases of real-world scenarios. However, Tree-based models capture more complex relationships.
- Linear models majority of times need feature scaling, however tree-based models do not.
- The performance of tree-based models is majority times better than linear models. Our experiment is a good illustration of that, the best-hyper tuned linear model achieved 66.20 MAE, and the best tree-based model achieved 36.33 which is a big improvement.
- Tree-based algorithms are easily interpretable than linear models.
Conclusion
As discussed before, there is no ready-made receipt for which type of algorithm will work best, everything depends on the data and the task. That is why several algorithms should be tested and evaluated. However, it is beneficial to know the intuition behind each algorithm, what are their advantages and disadvantages and how to cope with its limitations.
Here is the full code in my GitHub.
You can follow me on medium to keep updated for upcoming articles.
References
[1] Stephanie Glen Decision Tree vs Random Forest vs Gradient Boosting Machines: Explained Simply (2018)
[2] Vishal Morde XGBoost Algorithm: Long May She Reign! (2019)
[3] GAURAV SHARMA, 5 Regression Algorithms you should know – Introductory Guide!
[4] Aarshay Jain, Complete Guide to Parameter Tuning in XGBoost with codes in Python
[5] scikit-learn.org, Decision Trees, Understanding the decision tree structure¶
[6] Hyperopt: Distributed Asynchronous Hyper-parameter Optimization
[7] XGboost, XGBoost Parameters
[8] TINU ROHITH D, HyperParameter Tuning – Hyperopt Bayesian Optimization for (Xgboost and Neural network) (2019)
[9] Jobs Admin, Alternative Hyperparameter Optimization Technique You need to Know – Hyperopt (2020)
[10] Aurelien Geron, hands-on Machine Learning with Scikit-learn and Tensorflow (2019)