How to Build A First-Time Machine Learning Project (with Full Code)

A Machine Learning Walk-Through using a Facilities Operations Example

Cory Randolph
Towards Data Science

--

Photo by Crystal Kwok on Unsplash

While Machine Learning can seem overwhelming, knowing where to begin is key when looking for potential ways to get started. A great jumping-off point is to look wherever a business carries out a repeatable process. As someone with a career background with several years in facilities operations and reliability engineering, my day to day experiences have provided insight into how machine learning can apply to operations with a repeatable process.

When I first got interested in machine learning, I was searching for ways to apply my newfound interest by developing technical solutions that actually created business value. However, I often found that many of the sources I went to in my early learning were pitching the sale of their proprietary ML platform or software that I didn’t want to buy. While these companies can have value, I was determined to learn how to apply Machine Learning without relying on costly third party solutions.

For this reason, I’ve created a sample Machine Learning project that can be done for free through open source technologies with a relatively small barrier to entry and is stripped of any “proprietary data” or “hidden sales pitch” for some software you don’t want to buy. Rather it is designed to be a starting place to inspire others to experiment with Machine Learning and possibly utilize the code to create their own solutions. (Even though this example is related to the facilities industry, this walk-through can be applied to any “repeated process” within whatever domain expertise you may have.)

Problem Statement and Background

Getting maintenance done on time is critical for a facility to reliably provide its business value, and many operations today are using a CMMS (Computerized Maintenance Management Systems) to document and control the work going on. Stored within these enterprise database systems are all of the historical records (often called Work Orders) of when work was completed on time or not. Machine Learning can be used to help find the patterns in that data so that proactive decisions can be made to ensure the right work is being done on time.

Machine Learning Project Overview

Here is the high-level overview of what our Machine Learning project will do:

  1. Learn from historical Work Order data
  2. Take features of a Work Order as input
  3. Predict if a future Work Order will be late or not
  4. Provide detailed explanations for how the input affected the prediction
Model Overview (Image by Cory Randolph)

Detailed Walkthrough

For this detailed walk through I will only show and explain relevant code sections since the entire code can be viewed and run in Google Colab here.

The Data

Many facility operations rely on enterprise databases that store historical information about Work Orders. For this project, I created 500 fictionalized work orders that mirror the types of information found in such systems. When trying to identify data of your own, I find that there are 5 good questions to ask when kicking off a Machine Learning project. I give a detailed explanation of this process in my article 5 Simple Questions to Find Data for a Machine Learning Project

Sample Data (Image by Cory Randolph)

Explanation of each feature/column:

Department = The department name of the type of work being performed on a given Work Order/Maintenance Task. (e.g. Electrical, Mechanical, etc)

Name = Name of the technician who completed the work. (Randomly generated fictional names for this sample data)

Estimated Labor Hours = The approximate number of labor hours a given Work Order/Task is expected to take.

Frequency = The frequency interval for when these Work Orders/Task will have to be completed again. (e.g. a 90 Day task would mean the task would be done ~4 times per year (365 Days / 90 Days)).

Past Due = The label for if a particular work order was past due. (e.g. 1 = Work Order was Past Due, and 0 = Work Order completed on time.)

To get the complete data in a Jupyter Notebook:

# Set the url of where the csv data can be downloaded fromurl = 'https://raw.githubusercontent.com/coryroyce/Facilities_ML_Project/main/Data/Maintenace_Past_Due_Sample_Data.csv'# Load the csv data into a Pandas DataFrame to easily work with the data in pythondf = pd.read_csv(url)

Next we process the data so that it can be used by the ML model by separating the features/inputs (denoted a “X”) from the labels/outputs (denoted as “y”). Then save 20% of the data as test data that will be used to validate if the ML model is actually learning patterns vs just memorizing the training data.

# Separate the features/inputs from the labels/outputsX = df.copy()y = X.pop('Past_Due')# Split the data into train and test data with 20% of the data to test the modelX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

The Machine Learning Model

Now that the data is loaded and ready to be processed, we can begin creating a Machine Learning Model. If you’re just getting started with ML, it’s worth noting that our model is a supervised classification model with numerical and categorical feature data. Catboost is a ML framework that serves as a good fit in terms of simplicity and good performance without much hyper-parameter tuning. To better understand how these ML models work in greater detail, a good place to begin is Google’s Machine Learning Crash Course.

Since we are using the CatBoost library, we can create and build a well performing model in only 2 lines of code: First, set the model as a CatBoost Classifier. Next, fit/train it on the data we set up earlier.

# Select the ML Model typemodel = CatBoostClassifier()# Fit the model to the training datamodel.fit(X_train, y_train, cat_features=['Department', 'Name', 'Frequency'], eval_set=(X_test, y_test))

After running the above code block, we now have a trained machine learning model that we can use to make predictions for future work orders being late or not.

The Metrics

Now that we have a trained machine learning model ready to go, we need to check to see how “Good” our model is and understand if it can provide useful predictions about if a Work Order is likely to be late or not. And while there is a large number of metrics or ways to evaluate how good a model is, the most useful and easy to understand metric is Accuracy (For a deeper dive into Accuracy check Jeremy Jordan’s article, Evaluating a machine learning model). Simply put, accuracy is the percentage of correct predictions a model makes. So an accuracy score of 50% would be the same as a random guess or coin flip when deciding if a Work Order will be late or not.

Using a common ML library called SKlearn makes getting the accuracy of our model really straightforward.

# Store the predicted scores from the test datasetpreds_test = model.predict(X_test)# Print the accuracy of the test predictionsprint(f'Model Accuracy on test data: {metrics.accuracy_score(y_test, preds_test)*100}%')

While the overall accuracy can vary slightly based on the data split and model training, the most recent accuracy I got was 89%. This means that we have a model that will correctly predict if a Work Order will be late approximately 9 out 10 times.

The Predictions

With our working and accurate model, the value comes in making predictions on future Work Orders. This is what gives a business insight into when a process needs to be adjusted in order to make sure that work that would otherwise be at risk of becoming late can be proactively addressed ahead of time.

This next code block shows how to make a prediction for a single work order as an example, but could easily be modified to do batch predictions to check all Work Orders due in the next month and then export them to a csv, excel, or Google Sheet to best fit the operational processes your business is already using.

# Manually input any combination of features to get a prediction (Note: Order of data has to match the column orders)sample = ['Electrical', 'Chris', 4,'90 Days']# Send the sample to the model for predictionsample_prediction = model.predict(sample)# Display predictionprint(f'Current Sample is predicted as {sample_prediction} \n(Note: 1 = Past Due, 0 = On Time)')

The Explanations (Bonus)

This working machine learning model to predict if Work Orders will go late is already in a great place to add business value, but opening up the “Black-Box” of machine learning can help us understand the data more and provide insight into “why” the model made a certain prediction. Being able to provide detailed explanations can help establish the trustworthiness of the project internally and to clients/customers.

A tool that is growing in popularity to help with these explanations is called SHAP. So that the following portion of the article is easier to read, I will leave out the detailed code and only show the visuals and explanations (full code can be found here).

The first level of explanation comes at the model summary level, and an important question to answer is: “Which features/inputs are most important and how important are they?” There are a few different ways to visually see this with SHAP, and for simplicity we will use the bar graph summary plot.

SHAP Summary Plot of Feature Importance (Image by Cory Randolph)

This summary bar graph shows that the Department has the most influence on the model and that the Estimated Labor Hours has the least influence on the model.

The next question needing answered is: “How much did each input actually affect the model’s prediction?” Here again the SHAP tool provides a way to see this for any individual Work Order.

SHAP Waterfall Plot for Single Prediction (Image by Cory Randolph)

To understand this graph, we start by reading from the bottom left E[f(x)] which is basically the expected output for a given Work Order and movements to the right (in red) show an increased change of the Work Order being late, and movements in the left (in blue) show a decrease in the chance of a Work Order being late.

  • Having an Estimated Labor Hours of 16 hours, increases chances of this Work Order being late.
  • Having Nathan as the technician assigned to the work order also increases the chance that it will be late.
  • Having an annual frequency of 360 Days also increases the chance of it being late.
  • Having plumbing as the department significantly decreases the chance of the Work Order going late.

Overall, since the final f(x) is to the right of the starting place E[f(x)], this Work Order would be predicted as being late, but now we have a detailed explanation of why this conclusion was reached.

The last question to ask is: “Which detailed groups of data are causing Work Orders to be late?” In SHAP the tool to help answer this question is called Dependance Plots. These plots help us do a deep dive into the actual values of an input (both categorical and numerical) and see how they affect the model’s predictions. While the full code file shows these plots for each feature/input, let’s just take a look at the Dependance Plot for the Department.

SHAP Dependance Plot for Department (Image by Cory Randolph)

To understand this plot, if you were to draw a horizontal line at the 0 value, then values above that line would indicate how likely a Work Order is to go late for that Department, and below 0 would indicate that a Work Order would be completed on time. In summary, the HAVC department has a very high likelihood of having late Work Orders, whereas Plumbing has a very low likelihood of late Work Orders. Electrical and Mechanical are pretty balanced and have a similar amount of late and on time Work Orders.

This too becomes actionable data. Since we see there is a correlation between HVAC and late Work Orders, we can ask further business questions; Is HVAC understaffed? Are there specific types of work in HVAC that cause issues with being late? What is Plumbing doing differently that allows them to finish work on time?

Summary

In summary, an early stage ML project can be broken down into 4 main phases:

  • Defining an initial business problem
  • Determining what repeatable process provides usable data
  • Creating a working machine learning model
  • Generating detailed explanations

Using the code provided as a template, I hope you will venture through each of these 4 phases to create something valuable.

If you go through this tutorial and are able to utilize it on a project of your own, please share your experience and application in the comment section below. I’m genuinely curious to hear about how you’ve been able to use Machine Learning to solve a problem.

--

--

Enjoy applying machine learning to real business problems and helping others learn to do the same.