The world’s leading publication for data science, AI, and ML professionals.

Predicting Diabetes with Machine Learning – Part II

The ultimate part of an overview of different ML models to predict diabetes

Photo by Mykenzie Johnson on Unsplash
Photo by Mykenzie Johnson on Unsplash

This is the second part of an overview of different Machine Learning models I made to compare them in predicting diabetes, using the famous ‘diabetes dataset’ provided by the scikit-learn library.

You can find Part I here. Since I’ve gone through different ML models to compare them, it is preparatory for you to read part I first. Also, in part I you will find the complete Exploratory Data Analysis.

Also, at the end of this article, you will find my GitHub repository where I’ve stored the full code of this analysis.


We concluded part I by saying that the Simple Linear Regression model is not a good model for this ML problem; let’s see what happens if we try with a regularized model


[EDIT 04/06/2022] I’d like to thank Francis Van Schie for contacting me, showing I made a mistake in fitting the polynomial method. Now the mistake has been fixed.

1. The Linear Regularized Regression Model: Lasso Model

I want to try a regularized model of the Linear Regression, and I choose Lasso Regression since Ridge is to be used when there is a high correlation between variables, but the correlation matrix has shown us that this is not the case.

#defining the lasso model
model = Lasso()
#define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# efine grid
grid = dict()
grid['alpha'] = np.arange(0, 1, 0.01)
#define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
#performing the search on the train dataset
results = search.fit(X_train, y_train)
#printing
print(f'MAE:{results.best_score_: .2f}')
print(f'Best Alpha:{results.best_params_}')
------------------------
>>>
MAE:-44.86
Best Alpha:{'alpha': 0.01}

The best alpha value is 0.01, and it gives me a fairly high Mean Absolute Error (the fact that the MAE is negative doesn’t matter: sklearn makes it negative for its optimization reasons; in any case, to be a good value it should be "more next possible "to 0). It should also be added that alpha = 0.01 is a "very small" value; alpha = 0 is the case for normal (non-regularized) regression, which, in addition to the high MAE, tells me that this model is not too good for solving this type of ML problem.

Apart from the sign of the number, the MAE is practically the same as the simple regression method. In the above example, the grid was uniform; now, I want to try to extend it using the ‘loguniform’ method:

# define model
model = Lasso()
# define evaluation
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search space
space = dict()
space['alpha'] = loguniform(1e-5, 100)
space['fit_intercept'] = [True, False]
space['normalize'] = [True, False]
#define search
search = RandomizedSearchCV(model, space, n_iter=500, scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv, random_state=1)
# execute search
result = search.fit(X, y)
#printing
print(f'MAE:{results.best_score_: .2f}')
print(f'Best Alpha:{results.best_params_}')
-----------------------
>>>
MAE:-44.86
Best Alpha:{'alpha': 0.01}

I found the same best alpha value as before, so I’m going to do the fit with alpha = 0.01 to evaluate the performance of the regularized regression model of the Lasso type. What I expect is a poor performance, since the MAE has practically the same value as seen with the Simple Linear Regression model, but let’s see it:

#lasso with best alpha
model_best = Lasso(alpha=0.01).fit(X_train, y_train)
#predictions
y_test_pred = model_best.predict(X_test)
y_train_pred = model_best.predict(X_train)
#R^2
print(f'Coeff. of determination on train set:{model_best.score(X_train, y_train): .2f}') #train set
print(f'Coeff. of determination on test set:{model_best.score(X_test, y_test): .2f}') #test set
---------------------------
>>>
Coeff. of determination on train set: 0.53
Coeff. of determination on test set: 0.46

These values are too similar to the ones obtained with the simple linear regression model, as respected; and this is telling us that even the regularized method is not a good one. This has to be expected, since the best alpha is 0.01 and we have to remember that alpha=0 is the simple linear regression case.

Anyway, let’s see a couple of visualizations:

#figure size
plt.figure(figsize=(10, 7))
#scatterplot of y_test and y_test_pred
plt.scatter(y_test, y_test_pred)
plt.plot(y_test, y_test, color='r')
#labeling
plt.title('ACTUAL VS PREDICTED VALUES (TEST SET)')
plt.xlabel('ACTUAL VALUES')
plt.ylabel('PREDICTED VALUES')
#showig plot
plt.show()
Actual VS Predicted (by Lasso model) values with linear regression. Image by Author.
Actual VS Predicted (by Lasso model) values with linear regression. Image by Author.

As can be seen from the graph above, there is no clear tendency for the spots to be distributed around a line. Now, let’s see the KDE:

KDE for actual and predicted (by Lasso model) values. Image by Author.
KDE for actual and predicted (by Lasso model) values. Image by Author.

As can be seen from this graph, the probability density of the predicted values ​​does not at all approximate that of the real values.

Finally, I do a graphical analysis of the residuals:

#figure size
plt.figure(figsize=(10, 7))
#residual plot
sns.residplot(x=y_test, y=y_test_pred)
#labeling
plt.title('REDISUALS VS PREDICTED VALUES')
plt.xlabel('PREDICTED VALUES (DIABETES PROGRESSION)')
plt.ylabel('REDISUALS')
Residuals VS predicted (by Lasso model) values. Image by Author.
Residuals VS predicted (by Lasso model) values. Image by Author.

The residuals are randomly distributed (there is no clear pattern in the plot above), which tells us that the model chosen is not entirely bad, but there are too many high values ​​of the residuals (even over 100) which means that the errors of the model are high. There is no particular tendency to underestimate or overestimate values; there is, however, a bit of a tendency to have high errors, especially in the area with low disease progression values, while the errors decrease a little for high progression values, with the exception of some outliers.

Thus, this graph also confirms the fact that the linear regression model (albeit, regularized) is not a good model for this ML problem, and another one must be sought. So, we have to try with a different model: let’s try the polynomial regression method.

2. The Polynomial Regression Method

Considering the values of MSE and RSME and of the graphs seen, I try the path of increasing the degree of the polynomial; that is, I try polynomial regression.

Considering the results obtained previously, I am going directly to use a polynomial of degree 3, as degree 2 immediately seems a little bit short to me. However, I do not want to exaggerate by inserting a degree that is too high as here it is a question of making a fit directly by transforming the available data, and then using the functions already seen for linear regression; in practice: if I use a polynomial degree that is too high I risk overfitting on the training set.

I create the 3rd degree and split polynomial function:

#creating the 3rd degree polinomial
poly = PolynomialFeatures(degree=3, include_bias=False)
#transforming the values in all X
poly_features = poly.fit_transform(X)
#splitting
X_train3, X_test3, y_train3, y_test3 = train_test_split(poly_features,y, test_size=0.2,random_state=42)

Creating the polynomial regression:

#creating the polynomial regression
poly_reg = LinearRegression()
#fitting
poly_reg.fit(X_train3, y_train3)
#predictions
y_test3_pred = poly_reg.predict(X_test3)
y_train3_pred = poly_reg.predict(X_train3)

Printing metrics:

#R^2
#train set
print(f'Coeff. of determination on train set:{poly_reg.score(X_train3, y_train3): .2f}') 
#test set
print(f'Coeff. of determination on test set:{poly_reg.score(X_test3, y_test3): .2f}')
---------------
>>>
Coeff. of determination on train set: 0.88
Coeff. of determination on test set: -17.42

The coefficient of determination on train set, for this model, is much better than the linear regression model. But, the coefficient of determination on test set drops on value (with respect of the one on test set) which leds me think to an obvious situation appened here: overfitting (on test set)! Let’s see the other metrics:

#model metrics
print(f'The mean absolute error is:{metrics.mean_absolute_error(y_test3, y_test3_pred): .2f}')
print(f'The root mean squared error is:{np.sqrt(metrics.mean_squared_error(y_test3, y_test3_pred)): .2f}')
------------
>>>
The mean absolute error is: 169.65
The root mean squared error is: 312.43

The MAE and the MSE are much higher than the ones calculated with the previuos linear models !! I make a scatter chart of the actual values in comparison with the predicted:

Actual VS predicted (by polynomial model) values. Image by Author.
Actual VS predicted (by polynomial model) values. Image by Author.

Remembering that we have transformed the starting domain with a third-degree polynomial function, as we can now see the data are quite clearly thickened around a line (which is a line in the transformed domain), but there are high outlier values that leads me think that this model is not as good as I tought for this particolare ML problem. Finally, also here I want to make a visualization with KDE:

KDE plot for actual VS predicted (by polynomial model) values. Image by Author.
KDE plot for actual VS predicted (by polynomial model) values. Image by Author.

The KDE confirms that the polynomial regression is indeed not a good model to use here. Also, due to overfitting this is the worst model seen for now.


When I’ve shown this work to a senior Data Scientist he told me: "good; but do you know an ML model which ‘transforms’ your data, leaving linear relations between them?"

My answer was "yes!" and the model is Support Vector Regression. I tried SVR, but the results are very poor; I’m not going into it here since you have understood the method: you find the results in my GitHub repo, if you want to take a look (and/or try it for yourself).


Conclusions

In this series of articles, we have seen how different models can perform on a given dataset and what we can expect from the metrics used; trying different ML models is what we have to do to find one which can give us the best predictions; in this case, we find the linear regression models to be the best between all the models we tried, but indeed is not absolutely good at all; so other models have to be tried to solve this ML problem.


Thanks for reading!

You can find my GitHub repo with full code here.


Let’s connect together!

MEDIUM

LINKEDIN (send me a connection request)

If you want, you can subscribe to my mailing list so you can stay always updated!


Consider becoming a member: you could support me and other writers like me with no additional fee. Click here to become a member.


Related Articles