I’ve Trained My First Machine Learning Models: Here’s How It’s Gone
An overview of a couple of ML algorithms and some personal reflections

Finally, I’ve trained my first Machine Learning model! Yes! In the end, I did it. There is so much hype on the field and I didn’t see the hour to finally do it.
But maybe this is the reason for my initial sadness: maybe, there is too much hype on the field.
I’ve started studying Data Science and programming around march 2021 when I was:
- studying for my last exam for my Bachelor in Mechanical Engineering
- studying for my thesis, which of course relies on Data Science
- working full time
- doing all the needed stuff related to my family
So, of course, it took me a year before arriving here: training my first ML model. But I want to tell you how it all has gone, and what I learned (and what I learned really made me happy in the end!).
K-nearest neighbor: the first algorithm I tried
During my way to an Engineering degree, what I learned which is really important (maybe, it is the most important thing I’ve learned) is to start small and start by doing easy things, so that I can master them with time.
The first models I was suggested studying (these days, I’m concluding a course in Data Science) were K-NN and Decision Tree. And I’ve decided to start with the K-NN. I’ve searched a little online and found a typical dataset for a good exercise with K-NN, which is a dataset in which we have some features like BMI, pregnancy, blood pressure, etc.. which are related to diabetes (you can find the data online from different sources; for example, here). The purpose of the exercise is to train a ML model to predict if a patient with such medical conditions can have diabetes.
So, let’s see some code with some comments.
#importing all the needed libraries
import pandas as pd
import numpy as np
import pandas_profiling
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.preprocessing import MinMaxScaler
#importing data
diab = pd.read_csv('diabetes.csv')
Now, the first thing I want to show you is a package, derived from Pandas, which I’ve discovered some days ago; it is ‘pandas profiling’, which I’ve imported with the above code. This package is very useful in Exploratory Data Analysis because it gives us a lot of information. Let’s see it:
#profiling data
profile = ProfileReport(diab, explorative=True, dark_mode=True)
#output in HTML
profile.to_file('output.html')
This package really gives you a lot of information on the data. For example:

The above image shows us what Pandas Profiling shows us in the first instance, which are info on the variables; for example, the data set has a total of 9 variable, 8 of which are numerical and 1 categorical.
And in fact, if we type ‘diab.head()’ we can see:

And, in fact, we have 8 numerical features (the first 8 columns) and 1 categorical (the last column, named ‘Outcome’, which tells us if the patient has diabetes, with the value 1, or not, with the value 0).
If you see the previous image, you can see that the output given by Pandas Profililing tells us that there are 15 alerts. In that section the package will give us more information, for example, it tells us the feature with high correlations and many more. I really advise you to take a look at it.
Now, let’s continue our analysis. First of all, we have to fill the rows with 0 values; in fact, it is not realistic to have a blood pressure of 0 mmHg, for example. To do so, we can substitute the zeros with the mean value on the same column; we can do it with the following code:
#filling zeros with mean value
non_zero = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
for coloumn in non_zero:
diab[coloumn] = diab[coloumn].replace(0,np.NaN)
mean = int(diab[coloumn].mean(skipna = True))
diab[coloumn] = diab[coloumn].replace(np.NaN, mean)
print(diab[coloumn])
Now, we can begin training our model. We want the features to be the first 8 columns, so that given a patient with the values associated with the first 8 columns (age, pregnancy, BPM, etc) we want to predict if he/she can have diabetes; so that the ‘Outcome’ column (has/hasn’t diabetes) will be our label.
#all the column except the last one (features)
X =diab.iloc[:,0:8]
#the outcome column (label)
y =diab.iloc[:,8]
#feature scaling
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
#testing and training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0, stratify=y)
Now, let’s fit our model and see our prediction:
#fitting model
model = KNN(n_neighbors=5).fit(X_train, y_train)
#predictions and model accuracy
y_pred = model.predict(X_test)
print(f'Model accuracy on test set: {accuracy_score(y_test, y_pred):.2f}')
And the result is:
Model accuracy on test set: 0.75
"mhmmhmh….pretty good" was my first comment. But can we say something else? I’ve used k=5…is it a good choice? How can I understand if I made a good choice? Well, we can try to find a good ‘k’ with the following code:
#best k
for k in [1,2,3,5,10]:
model = KNN(n_neighbors=k)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
validation_accuracy = accuracy_score(y_test, ypred)
print('Validation accuracy with k {}: {:.2f}'.format(k, validation_accuracy))
With the above for cycle, we iteratively calculate the best k for the values indicated in the list. The result is:
Validation accuracy with k 1: 0.71
Validation accuracy with k 2: 0.72
Validation accuracy with k 3: 0.75
Validation accuracy with k 5: 0.75
Validation accuracy with k 10: 0.78
So, accordingly to that calculation, the k that gives us the best accuracy is k=10, and we are going to use it. Why did I choose to test the above k values? Well, 3 is a typical value, and choosing a ‘too high’ value gets to model instability. So, if you try yourself you will see that for values higher than k=10 the accuracy is practically the same.
Now, let’s train our model with the best k:
#training model with best k
model = KNN(n_neighbors=10).fit(X_train, y_train)
#predictions and model accuracy
y_pred = model.predict(X_test)
Now, let’s see the confusion matrix to valuate some metrics:
#confusion matrix
cm= confusion_matrix(y_test, y_pred)
cm
--------
array([[89, 11],
[23, 31]])
Well, the confusion matrix gives us a good result. In fact, we have 89+31=120 values well predicted (89 true positives and 31 true negatives), and 23+11=34 values bad predicted (23 false negatives and 11 false positives). In fact, if we calculate the precision score, which is the ability of the model to predict a true positive as a true positive, we get:
#precision score
precision = precision_score(y_test, y_pred)
print(f'{precision: .2f}')
--------------------
0.74
Which is a satisfying value.
That’s it. So, my (initial) sadness. The model is well trained, and the K-NN, with k=10, is a good algorithm to predict if a patient has diabetes, knowing other values related to his health state.
All the hype on Machine Learning ends here?
Well, I’m a guy who is never satisfied, and especially in this field, I have this hunger for knowledge that I can not satisfy. So, I want to go deeper.
Also, I love graphs…I need to see things. So, I found a way to evaluate graphically my model. For example, we can plot the Kernel Density Estimation; in statistics, the KDE s a non-parametric way to estimate the probability density function of a random variable (you can read more here, for example). We can do it this way;
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 10))
#diabetes outcome (Kernel Density Estimation)
ax = sns.kdeplot(y, color="r", label="Actual Value") #y = diab['Outcome']
#predictions (Kernel Densiti Estimation), in the same plot (ax=ax)
sns.kdeplot(y_pred, color="b", label="Predicted Values", ax=ax)
#labeling title
plt.title('Actual vs Precited values')
#showing legend
plt.legend()
#showing plot
plt.show()

So, as we can see, the two curves are very similar, confirming the previous analysis.
Now, I became a little happy with my study, but I wasn’t happy enough; in fact, I asked myself if I could do even better, at least at an understanding level. I’ve studied the decision tree model and wanted to see how it can perform with such data.
Decision tree: the second ML algorithm I tried
Since the math and the code related to the splitting and training is the same as before, I’m showing you the part after the model fitting:
from sklearn.tree import DecisionTreeClassifier
#decision tree model and fitting
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train,y_train)
#predictions and model accuracy
y_pred = clf.predict(X_test)
print(f'Model accuracy on test set: {accuracy_score(y_test, y_pred):.2f}')
----------------------
Model accuracy on test set: 0.79
This first result really made me reflect…I got an accuracy of 0.79, while the accuracy for K-NN was 0.78 with k=10 (‘the best’ k!) because I was pretty sure that the K-NN was a better algorithm to use in such a case. I wanted to try to find the best depth of the tree and I used the same method as for the best k:
#best depth
for d in [3,5,10, 12, 15]:
model = DecisionTreeClassifier(max_depth=d)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
validation_accuracy = accuracy_score(y_test, y_pred)
print('Validation accuracy with d {}: {:.2f}'.format(d, validation_accuracy))
------------------------------
Validation accuracy with d 3: 0.79
Validation accuracy with d 5: 0.79
Validation accuracy with d 10: 0.79
Validation accuracy with d 12: 0.79
Validation accuracy with d 15: 0.79
So, we get the same accuracy with a bunch of different ‘d’ parameters.
Now, if we calculate the confusion matrix we get:
cm= confusion_matrix(y_test, y_pred)
cm
-------------------
array([[84, 16],
[16, 38]], dtype=int64)
And, finally, the precision score:
#precision score
precision = precision_score(y_test, y_pred)
print(f'{precision: .2f}')
-------------------------------
0.70
And, in the end, I’m not reporting the KDE plot since is practically the same as the one generated with the K-NN algorithm.
These results made me a little sad again because, as said before, I was pretty sure K-NN was a better algorithm than DT in such cases, but as you can see the results we get are pretty similar for both models.
So, I asked myself: there are so many algorithms out there to be used in ML, how can one (especially, a newbie like me) be able to understand – given a problem-which algorithm can suit well for the given problem?
And, believed it or not, I found an answer that, finally, made me definitively happy.
Pycaret
So, I googled a little bit and found Pycaret. As they say on their website, "Pycaret is an open-source, low code Machine Learning library that automates Machine Learning workflows". To install it, you just need to type:
pip install pycaret
And the work is done, easily as usual.
Now, what made me really happy was discovering the fact that with Pycaret we can compare the ML models to use to predict the values of the problems we are studying, and we are doing it in a very simple way. Pycaret gives us the result of the comparison it does, using the ‘cross-validation’ method to evaluate the hyperparameters of the various model, and it gives us the needed metrics useful to decide which algorithm to use for our problem. Let’s see how it works:
from pycaret.classification import *
#defining the features
numeric_features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
#setting up the data
clf = setup(data=diab, target='Outcome')
#comparing the models
best = compare_models()

As we can see, the comparison in Pycaret is very useful and it tells us:
- K-NN and DT are far away to be the best algorithm to use in this case (they are in the middle-down part of the rank), but they give us metrics with similar values (remember that we have calculated the hyperparameters with for loops, while Pycaret uses cross-validation, so the numbers are a little different)
- The best algorithm to use – based on Pycaret metrics calculations- is the Random Forest Classifier
Before concluding, please note that at the time I’m writing (March 2022), Pycaret is not compatible with all the available scikit-learn versions; I had Scikir-learn version 1.0.2 and had to downgrade it to 0.23.2. Doing it is very simple:
pip uninstall scikit-learn
pip install scikit-learn==0.23.2
Conclusions
As we have seen, when learning to train a ML model, trying different algorithms is a good idea if you are a newbie, as I am. These kinds of exercises really give us the possibility to get ‘hands-on’ projects and algorithms.
Of course, the need to know always more remains, but with time we can get more experienced, every day. I’m quite sure that Pycaret is a good solution to decide what algorithm to use for a particular problem we face, but I’m even sure that experience – with I’m actually missing, as newbie-will help (maybe most).
Let’s connect together!
LINKEDIN (send me a connection request)
If you want, you can subscribe to my mailing list so you can stay always updated!
Consider becoming a member: you could support me and other writers like me with no additional fee. Click here to become a member.