Inside AI
Prediction without accuracy is worse than not predicting. Learn to save the machine learning trained models for future use and gauge the model accuracy.

In the first part of this series, I have discussed a basic simplified implementation of Machine Learning algorithms to predict the defect per cent of future purchase orders based on input parameters.
In this part, I will touch on the accuracy metrics of the trained models. We have a number of parameters (known as hyper-parameters of an estimator) which are passed in the model as arguments to perform the prediction. In practice, based on the accuracy metrics result of the trained model hyper-parameters are tweaked before the model is implemented for the prediction in production. Instead of tweaking the hyper-parameters manually with a trial and error approach for optimised accuracy score, it is possible for algorithms to search and recommend optimised hyper-parameters. I will discuss efficient parameter search strategies in later part of this series.
As mentioned in the earlier part of this series, it not pragmatic to train the model from scratch every time before prediction. I will also discuss on saving a trained model and importing it in another program directly for prediction.
Note: I will explain in detail the new areas and concepts, and will avoid repeating in details the parts explained in my earlier article. I will encourage you to please refer the earlier part for it.
Step 1
First, we will import the packages required for our model. StratifiedShuffleSplit import is required to build a training model with a sample set well represented from different subset value ranges. Pickle module will help us to save the trained model and then import the model in other programs directly for prediction. Finally, sklearn.metrics has sets of methods to measure the accuracy of any model.
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit #import to have equal weigtage samples in training dataset
from sklearn.tree import DecisionTreeRegressor # import for Decision Tree Algorithm
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR #import for support vector regressor
from sklearn.metrics import mean_squared_error # import to calculate root mean square
Step 2
Read the sample dataset exported from ERP and other applications into a pandas DataFrame. Please refer to the earlier article to understand the structure of the dataset and other details.
SourceData=pd.read_excel("Supplier Past Performance.xlsx") # Load the data into Pandas DataFrame
Step 3
After the cursory analysis of the data sample, it seems that "PO Amount" has a close and strong influence on "Defect Per cent", hence we would like to make sure that we train the model with "PO Amount" records from different ranges. If we have trained our model with datasets over-represented by "PO Amount" between 30,000 to 60, 000 GBP, then our model learning will not be accurate to a real-life scenario and may not predict accurately.
In the below code a new column "PO Category" is introduced to categorise "PO Amount" value from 0 to 30,000 GBP is classified as PO Category 1, from 30,000 to 60, 000 GBP as PO Category 2 and henceforth.
SourceData["PO Category"]=pd.cut(SourceData["PO Amount"],
bins=[0., 30000, 60000, 90000,
np.inf],
labels=[1, 2, 3, 4])
Step 4
StatifiedShuffleSplit provides training and test indices to split past dataset into train and test sets. In the below code, we are reserving 30% of the data for testing the model and 60% for training it
split = StratifiedShuffleSplit(n_splits=2, test_size=0.3)
Step 5

Using the training and test indices from the earlier step, we will divide the initial SourceData into two parts viz. strat_train_set as training dataset and strat_test_set as test and train data set for the model respectively.
In the code below we are using "PO Category" to make sure in each of the divided sets different PO categories are well represented.
for train_index, test_index in split.split(SourceData, SourceData["PO Category"]):
strat_train_set = SourceData.loc[train_index] # stratfied train dataset
strat_test_set = SourceData.loc[test_index] #stratified test dataset
Step 6
We introduced an additional column "PO Category" to ensure ample representation of purchase orders from all PO Amount range in the test and train dataset. As it is be been achieved, hence we will delete this additional PO Category from our dataset.
for set_ in (strat_train_set, strat_test_set):
set_.drop("PO Category", axis=1, inplace=True)
Step 7
Now we will define and test and training data-independent and dependent variables for our model. We will train the model using the independent and dependent training dataset as it is a supervised machine learning. Further, we will test the model performance using the test data set which model has not seen earlier.
SourceData_train_independent= strat_train_set.drop(["Defect Percent"], axis=1)
SourceData_train_dependent=strat_train_set["Defect Percent"].copy()
SourceData_test_independent= strat_test_set.drop(["Defect Percent"], axis=1)
SourceData_test_dependent=strat_test_set["Defect Percent"].copy()
Step 8
As the data attributes have different ranges, hence we need to scale it before using it for training. Please refer to my earlier article for more information on scaling the data.
In the code below we are using pickle.dump() to save scale as "Scaler.sav" and we can import later in other programs for use.
sc_X = StandardScaler()
X_train=sc_X.fit_transform(SourceData_train_independent.values)
y_train=SourceData_train_dependent
pickle.dump(sc_X, open("Scaler.sav", 'wb'))
X_test=sc_X.fit_transform(SourceData_test_independent.values)
y_test=SourceData_test_dependent
Step 9
We will train the support vector model with the training data set and save the trained model as "SVR_TrainedModel.sav" using pickle. "SVR_TrainedModel.sav" and the current program is saved in the same file directory as we have not provided the full path and only file name as the parameter for saving.
svm_reg = SVR(kernel="linear", C=1)
svm_reg.fit(X_train, y_train)
filename = 'SVR_TrainedModel.sav'
pickle.dump(svm_reg, open(filename, 'wb'),protocol=-1)
Step 10
Now we will predict the dependent variable i.e. defect per cent values from the trained independent dataset and measure the error/accuracy of the model. In the below code, we pass the training dataset and then compare the model’s prediction values with actual value. Predictions for independent train variable dataset is compared with actual values, and R² score for regression estimators is returned by the scoring method.
decision_predictions = svm_reg.predict(X_train)
Score = (svm_reg.score(X_train, y_train)) # It provides the R-Squared Value
print ( "The score of the Support Vector model is", round(Score,2))
lin_mse = mean_squared_error(y_train, decision_predictions)
print("MSE of Vector model is ", round(lin_mse,2))
lin_rmse = mean_squared_error(y_train, decision_predictions, squared=False)
print("RMSE of Support Vector Learning model is ", round(lin_rmse,2))
I will not go in the statistical details of R-Square, mean square error and root mean square error in this article, and I will highly encourage you all to read the Wikipedia pages on these statistical measures. It will help to interpret whether the model is trained to acceptable limit or we need fine-tune the data and hyper-parameters.
The score of the Support Vector model is 0.09
MSE of Vector model is 0.05
RMSE of Support Vector Learning model is 0.12
Step 11
We follow the same steps for the decision tree learning model and check the model accuracy.
tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)
filename = 'DecisionTree_TrainedModel.sav'
pickle.dump(tree_reg, open(filename, 'wb'),protocol=-1
predictions = tree_reg.predict(X_train)
Score = (tree_reg.score(X_train, y_train)) # It provides the R-Squared Value
print ( "The score of model Decision Tree model is ", round(Score,2))
lin_mse = mean_squared_error(y_train, predictions)
print("MSE of Decision Tree model is ", round(lin_mse,2))
lin_rmse = mean_squared_error(y_train, decision_predictions, squared=False)
print("RMSE of Decision Tree model is ", round(lin_rmse,2))
Step 12
Once the model is predicting the results within the acceptable error limits, then we can feed the test dataset which the model has not seen before into model for predictions. We can compare the test dataset dependent data accuracy in the same way as we did with the training dataset.
test_predictions = tree_reg.predict(X_test)
test_decision_predictions = svm_reg.predict(X_test)
Importing trained model and prediction
To use the trained model in another program, we need to import the independent variable scales and learned model using "pickle.load" as shown in the below code snippet.
Here we read the new independent value dataset from "Supply Chain Predict.xlsx" file and further steps for predicting the corresponding dependent variable is as explained above for test data.
import pickle
import pandas as pd
testdata=pd.read_excel("Supply Chain Predict.xlsx") # Load the test data
sc_X = pickle.load(open('Scaler.sav', 'rb')) # Load the pickle
loaded_model = pickle.load(open('DecisionTree_TrainedModel.sav', 'rb')) # load the trained model
X_test=sc_X.transform(testdata.values) # scale the independent variables for test data
decision_predictions = loaded_model.predict(X_test) # Predict the value of dependent variable
print("The prediction by Decision Treemodel is " , decision_predictions )
Full Code Snippet
# Importing the required modules
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit #import to have equal weigtage samples in training dataset
from sklearn.tree import DecisionTreeRegressor # import for Decision Tree Algorithm
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR #import for support vector regressor
from sklearn.metrics import mean_squared_error # import to calculate root mean square
SourceData=pd.read_excel("Supplier Past Performance.xlsx") # Load the data into Pandas DataFrame
SourceData_independent= SourceData.drop(["Defect Percent"], axis=1) # Drop depedent variable from training dataset
SourceData_dependent=SourceData["Defect Percent"].copy() # New dataframe with only independent variable value for training dataset
SourceData["PO Category"]=pd.cut(SourceData["PO Amount "],
bins=[0., 30000, 60000, 90000,
np.inf],
labels=[1, 2, 3, 4])
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3)
for train_index, test_index in split.split(SourceData, SourceData["PO Category"]):
strat_train_set = SourceData.loc[train_index] # stratfied train dataset
strat_test_set = SourceData.loc[test_index] #stratified test dataset
for set_ in (strat_train_set, strat_test_set):
set_.drop("PO Category", axis=1, inplace=True)
SourceData_train_independent= strat_train_set.drop(["Defect Percent"], axis=1)
SourceData_train_dependent=strat_train_set["Defect Percent"].copy()
SourceData_test_independent= strat_test_set.drop(["Defect Percent"], axis=1)
SourceData_test_dependent=strat_test_set["Defect Percent"].copy()
sc_X = StandardScaler()
X_train=sc_X.fit_transform(SourceData_train_independent.values)
y_train=SourceData_train_dependent
pickle.dump(sc_X, open("Scaler.sav", 'wb'))
X_test=sc_X.fit_transform(SourceData_test_independent.values)
y_test=SourceData_test_dependent
svm_reg = SVR(kernel="linear", C=1)
svm_reg.fit(X_train, y_train)
filename = 'SVR_TrainedModel.sav'
pickle.dump(svm_reg, open(filename, 'wb'),protocol=-1)
decision_predictions = svm_reg.predict(X_test)
Score = (svm_reg.score(X_test, y_test)) # It provides the R-Squared Value
print ( "The score of the Support Vector model is", round(Score,2))
lin_mse = mean_squared_error(y_test, decision_predictions)
print("MSE of Vector model is ", round(lin_mse,2))
lin_rmse = mean_squared_error(y_test, decision_predictions, squared=False)
print("RMSE of Support Vector Learning model is ", round(lin_rmse,2))
tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)
filename = 'DecisionTree_TrainedModel.sav'
pickle.dump(tree_reg, open(filename, 'wb'),protocol=-1)
predictions = tree_reg.predict(X_test)
Score = (tree_reg.score(X_test, y_test)) # It provides the R-Squared Value
print ( "The score of model Decision Tree model is ", round(Score,2))
lin_mse = mean_squared_error(y_test, predictions)
print("MSE of Decision Tree model is ", round(lin_mse,2))
lin_rmse = mean_squared_error(y_test, decision_predictions, squared=False)
print("RMSE of Decision Tree model is ", round(lin_rmse,2))
In the next series of this article, I will discuss on hyper-parameters in different machine learning algorithms and options available to identify the optimal values for these parameters for robust machine learning models.
Machine Learning and Supply Chain Management: Hand-on Series #1
Learn a structured approach on how to identify the right independent variables for Machine Learning Supervised Algorithms?