Battle of the Auto ML titans for People Analytics application

Evaluating Tpot & Pycaret Auto Machine Learning libraries for employee attrition prediction

Shilpa Leo
Towards Data Science

--

Photo by GR Stocks on Unsplash

How powerful is it to look into the future and predict something that is yet to happen? In Data Science, Machine Learning enables this! Ingeniously, by learning from what has happened in the past to predict what might potentially happen in the future. As we can imagine, the applications of such a technique can be revolutionary in People Analytics, one classic use case being to predict employee resignations.

ML has matured so much since its advent and is today at a stage where budding Data Scientists who are new to the concept and usually daunted by the very idea of predictive analytics can also jump into exploring ML and shorten time-to-solution. Auto ML is that gateway enabling just this — evaluating and applying the best ML pipeline for your dataset to perform predictive analytics. Thus, I would highly encourage People Analysts (with programming knowledge) to take these reassuring steps into the world of predictive analytics. In this article, we’ll evaluate two powerful auto ML libraries — Tpot and Pycaret on an open-source dataset to predict attrition — whether someone stays or leaves, based on some known feature information.

Why Auto ML?

Opting for auto ML over traditional ML was really a no-brainer decision for me simply because you can shrink your time-to-solution with the former from data preparation to predictions. Because a lot of time leaks when you’re trying to individually optimize one ML model at a time with the best predictive ability for your dataset, following the sequence of steps outlined in the image below. While in auto ML, generally speaking, all these steps, including hyperparameter tuning, grouping individual models to produce the best combination predictive model overall, are configured to happen under its parental hood, with minimal to no intervention from the user.

Image by author

Understanding Fundamentals

The next segment will be illustrating how the two auto ML libraries being evaluated perform on predicting attrition on an open-source HR Attrition Kaggle dataset. Before that, we need to familiarize ourselves with some crucial metrics captured in the image below — Confusion Matrix, Precision, Recall, F1, to choose the winning ML model. Note that this is not an exhaustive list of all the metrics to evaluate any machine learning model, but most relevant to the task on hand and has also been explained with relevance to this particular ML task.

Image by author

After looking at the above crucial ones, I would also suggest looking at the AUC (area-under-the-curve) score, which, simply put, gauges the accuracy of the True Positive predictions vs. a random guess. Accuracy metric, which is a measure of all the True predictions (both positive & negatives) by all observations, is really not a very indicative metric for this case as compared to Precision and Recall, simply because a model can be highly accurate based on its correct predictions on the population that stays (True negatives) but may completely be off in its predictions on the population that leaves (True positives), which is the target for our case.

It is important to note that the example considered in this article is a Supervised (Binary) Classification Machine Learning model — Classification because the response variable is categorical, holding values denoting an employee leaving or staying. Thus the metrics are really explained in a simplified way molded to the case study.

Evaluating Auto ML libraries

There are many online resources available introducing the models in evaluation; some useful links have been included in the resources segment below on this article for further reading. This article will focus on applying the Pycaret and Tpot auto ML libraries to the business task of predicting resigning employees by understanding the given dataset.

The very first step is to install the auto ML libraries we’re evaluating using pip install tpot and pip install pycaret in your command prompt window. Once the installations are done, the next is to open the Jupyter environment to code (this article has used JupyterLab as the coding platform).

Let’s begin with the imports for the libraries and the dataset.

Image by author

Printing the shape of the dataset (or, more specifically, pandas data frame) shows us there are 14,999 rows and 9 columns. The head() function outputs the first 5 rows in the dataset and lets us get a glimpse of the information we’re dealing with.

Image by author

Clearly, the column labeled “left” captures attrition with a “1” implying those that left and “0” implying those that continued to stay. This is our Response column. The remaining columns are the Features that we will feed to the ML model to interpret relations to the Response and then make predictions based on those interpretations.

We can further deduce that all the Feature columns are Numerical, simply because the values they hold are NOT DISCRETE (like Gender — Male, Female are discrete values). This also stands true for the “salary” column that holds the values of “low,” “medium,” and “high” and thus clearly exhibits an inherent relation in these values, meaning high salary>medium salary>low salary. Before we evaluate the auto ML models, a simple check with the below code confirms if any of the columns held missing values, but luckily for this dataset, there were none, and the code just returned 0. In real business case datasets that you might be working with, it is crucial to decide how you want to handle any missing values (drop them or impute to average and such). Generally, the auto ML libraries have some level of missing values imputation configured into the modeling to cater for most cases.

Image by author

1. Pycaret

Let’s begin by setting up the Pycaret classification module. Mandated elements to pass at this step are the data frame in “data” and the response column in “target” arguments. There are a LOT more arguments to play around with and further enhance the usage of this setup() function; however, we will just falling back to rely on the configuration defaults for this illustration. Something to call out amongst the defaults is the train-test split is 70–30% unless tweaked while calling the setup() function.

Image by author

As we can see from the below output of executing the setup() function, Pycaret interprets most of the feature columns as categorical.

Image by author

We can pass the columns we want to change the data types to Numeric, using the “numeric_features” argument, so the updated code would look like below.

Image by author

Executing this updated setup code now shows the feature columns passed as a list to the “numeric_features” argument, all reflected with Numeric data type.

Image by author

We still need to deal with the “salary” column to convert it from the current Categorical data type to “Numeric.” Because the values in the column are of string type, one approach for the conversion is to assign numbers to these string values to maintain that relation of high>medium>low: 2>1>0. This change can be done as part of the data preparation before evaluating the ML models as it would be a universal change for evaluating Tpot too.

Image by author

Now we can see that the values in the “salary” column have been converted to numbers representing the previous string values of low, medium, and high. Subsequently, executing the Pycaret classification setup() with “salary” also passed in the list for the “numeric_features” argument enables reflection of all feature columns as Numeric data types.

Image by author
Image by author

We can then press “enter” as prompted by the setup() function to initialize the model and get on to the exciting part of visualizing the magic that auto ML can bring to the table. By just calling the compare_models() function as below, we will see on the code environment that Pycaret runs 10-fold cross-validation on the train split to make the best interpretation of the feature to response relation for enabling the best predictions on test split or future unseen data. Default results sorting is by Accuracy metric, thus has been called as an argument to sort instead by F1 metric to see the best precision-recall balanced model for the use case here.

Image by author
Image by author

The ML models with the best metric scores are conveniently highlighted in yellow for narrowing and easing our selection process. The variable “best_model” holds the Random forest classifier for this case after executing the compare_models() function. We can simply go with the Random forest classifier model for the given dataset since this ML model shows the most promising overall metric scores. It does not get simpler than this!

Something to note is that the metrics tabulated as the output of the compare_models() function are averages from the 10-fold cross-validation. Let’s now look at the Confusion Matrix for the test split to reveal how accurate the predictions were compared to the actuals.

Image by author
Image by author

We can see that the misses reflected in the False Negatives and False Positives are rather small compared to the vast majority of the accurate predictions reflected in True Negatives and True Positives, thus explaining the >95% mean metrics.

The plot model also has other suites of essential plots under its hood that you may execute by just changing the “plot” argument to specify other values like “pr” to plot the precision-recall curve, “auc” to plot the ROC curve and confirm AUC, and so on.

The final step of prediction is as straightforward as calling the predict_model() function and passing the selected ML model and the dataset to make the predictions. For this illustration, the initial dataset has been used back but can be easily tweaked to a new dataset with unseen data for the model to make fresh predictions on.

Image by author

The predict_model() function output, as we can see below, provides additional columns — “Label” that reflects the ML model’s prediction against each row item in the dataset and “Score” that reflects the probability percentage of the prediction in “Label.” This is all that we need to equip our business stakeholders with the foresight of those likely to leave for appropriate actions to try and retain them.

Image by author

2. TPOT

Tpot is execution-wise different compared to Pycaret. Firstly, we’ll need to split the entire dataset to clearly define the features and response in “X” and “y” and further define train-test splits following the same 70–30% staying consistent with Pycaret. The shape function outputs confirm the splits.

Image by author
Image by author

In preparation to train the Tpot classifier model, similarly, the “n_splits” argument is set to 10 to match the 10-fold cross-validation coherent with Pycaret. The other arguments tweaked for this use-case vs. what you’ll find in the resource link further down are “scoring” changed from accuracy to f1, similar to choosing the best-ranked model in Pycaret based on the same precision-recall balance metric and “n_jobs” changed from -1 to 6, which impacts the number of system cores used up for Tpot to construct the ML pipeline (higher the number, lesser the usage because -1 uses up all cores).

Image by author

An important difference to highlight for Pycaret vs. Tpot is that the latter took close to 2hours to fit the best ML model while the former took less than 2minutes. 2hours later here’s Tpot’s output after fitting the train features-response. The top-performing ML pipeline for the given dataset achieved the mean F1 score of about 97.2%, which isn’t vastly different from what we got from the compare_models() function output of Pycaret. We also see that the top model itself is a combination of KNeighbours and Decision tree classifiers, so that’s different from Pycaret’s best model output too.

Image by author

Great! Now that we have come this far let’s finally use the trained Tpot model to access predictions on the test set and see if we really achieve superior results vs. Pycaret. To perform predictions, we simply call the function plot_metrics(X_test, y_test, model) where plot_metrics() is a custom function defined to arrive at the below plot using traditional matplotlib and seaborn libraries (refer to the GitHub gist link provided in resources for the code details).

Image by author

Conclusion

The results we saw above speaks to conclude that for this given dataset, with minimal intervention and time, the best auto ML we could narrow on and potentially “productionize” would be Pycaret. It is important to take away that finalizing ML models with actual business data might involve more experimentations than relying on defaults (like hyperparameter tuning, stacking, blending models, feature scaling, etc.) even with auto ML to do a lot of heavy lifting on the initial narrowing part. Happy exploring these on your business data!

  • Resources for further reading
  1. Pycaret classification tutorial
  2. Tpot classification tutorial
  3. GitHub gist link to code

--

--

Data Scientist| EdTech Instructor| Data Analytics| AWS| Public Speaker| Python| ML| NLP| Power BI | SQL| RPA| https://www.linkedin.com/in/shilpa-sindhe/