Introduction
Before understanding configuration files my scripts are often very long, repetitive, and inefficient. In addition, every time there are changes to the variables, I spent most of my time making changes at different parts of the script, which is time-consuming. Then I noticed others are using configuration files as part of their development and I started exploring and implementing them as well and I realize things became more efficient, flexible, and organized when using a configuration file.
So what are Configuration Files?
- Configuration files allow us to configure parameters and initial settings.
- Format of Configuration files can be in – yaml, ini, json, xml
Configuration files are commonly used for storing sensitive information such as credentials for database, passwords, server hostname, managing parameters, etc.
In this article, I will share the difference between using configuration files vs not using a configuration file in a Machine Learning project. The configuration file format that we will be using is in YAML format. YAML which represents Yet Another Markup language was selected as there are no formatting such as braces and brackets which makes it popular for their ease of readability and ease to write.
The scenario in this use case is to perform scoring (prediction) based on different pre-built models. Each model required a different set of data to perform the prediction but the source table is the same. The image below is an illustration of the process required to build:

Overview of Scoring Proces:
- There are 2 models: Model_A & Model_B that have already been pre-built based on a different set of data but retrieved from the same features table.
- To prepare a scoring script for model A to predict based on data set A and push the forecast result into the database.
- To prepare a scoring script for model B to predict based on data set B and push the forecast result into the database.
(For this use case, the model developed is based on a data set taken from Kaggle: Superstore Sales Dataset)
Now let’s look at how to implement configuration files and also looking at the comparison when configuration files are not used.
Scoring Script with Config File:
First, let’s look at how a configuration file in YAML looks like. Below is an example of a configuration file (model_A.yml) that specified the segment, model file name, and columns required for Model A to perform model prediction.
We will then load and read the YAML file as part of the scoring script. Below is an example to load a YAML file and reading the values inside the configuration file with Python. Notice that the config file name is specified with a variable "+model+".yml instead of _"model_A.yml"_ as we have multiple YAML files to load into the script (model_A.yml, model_B.yml). Specifying in this approach allows us to reuse the same script for different models.
config_file = ""+model+".yml"
with open(config_file, "rb") as file:
config = Yaml.load(file)
segment = config['segment']
model_file_name = config['model_file_name']
columns = config['columns']
In our scoring script, config variables are called in two different areas:
(1) Config variable [‘segment’] & [‘columns’] are used to retrieve the required data set for Model A from Google Big Query
client = bigquery.Client()
table_id = 'sales_data.superstore_sales_data_processed'
sql = "SELECT {columns} FROM `sue-gcp-learning-env.sales_data.superstore_sales_data_processed` where segment = ? and year_week = ?;".format(columns=",".join(columns))
job_config = bigquery.QueryJobConfig(
query_parameters = [
bigquery.ScalarQueryParameter(None, "STRING" , segment),
bigquery.ScalarQueryParameter(None, "INTEGER" , int(year_week))
]
)
(2) Config variable [‘model_file_name’] is used to call load Model A pickle file
pickle_file_name = model_file_name
with open(pickle_file_name,'rb') as pickle_model:
model = pickle.load(pickle_model)
Now let’s look at how our final scoring script looks like :
To run our scoring script for Model A or Model B, we can call the function with the specified parameter. (model_A / model_B). For example in the image below, the same scoring script can be used to run for Model A and Model B by passing the parameter with the same name as the YAML File (model_A.yml, model_B.yml).

Scoring Script without Config File:
Now let’s look at the difference when not using the configuration files. In the script below, notice that the variables are being hard-coded based on the segment value required for Model A – segment = ‘Consumer’.
client = bigquery.Client()
table_id = 'sales_data.superstore_sales_data_processed'
sql = "SELECT * EXCEPT(total_sales) FROM `sue-gcp-learning-env.sales_data.superstore_sales_data_processed` where segment = 'Consumer' and year_week =?"
job_config = bigquery.QueryJobConfig(
query_parameters = [
bigquery.ScalarQueryParameter(None, "INTEGER" , int(year_week))
]
)
In Addition, the model file name for Model A is also being hard-coded
pickle_file_name = 'model_A_consumer.pkl'
with open(pickle_file_name,'rb') as pickle_model:
model = pickle.load(pickle_model)
And below is how the final script will look like without a configuration file. Right now the script is showing only for ONE model, and if we were to use this approach for Model B it will be a duplicate of the script with Model B variables being hard-coded. And as we have more and more models, the script will continue duplicating. For example, if we have 10 models to perform the scoring and configuration files are not introduced then our script will be 10 times the length of the script below.
If you would like to view the entire code including the configuration file for Model B – it is available on Github.
Conclusion:
Adding config files as part of your software development makes things more manageable and definitely makes my life easier. In my current project, we have hundreds of models to develop and perform scoring. If I were to hard code everything it will take up a lot of time and might hit to the point of unmanageable. I’m glad that I realized the importance of using a configuration file and hope that this article can help anyone out there to start using a configuration file in their next project.
References & Links
[1]https://www.analyticsvidhya.com/blog/2021/05/reproducible-ml-reports-using-yaml-configs-with-codes/