The world’s leading publication for data science, AI, and ML professionals.

Cracking the Employee Attrition Problem with Machine Learning

Why it's not a simple problem

Photo by Nick Fewings on Unsplash
Photo by Nick Fewings on Unsplash

Employee Attrition i.e. the process of employees leaving an organisation, has become more than an alarming problem in recent times. As a pandemic aftereffect, globalised remote jobs have become a staple in the market, which unsurprisingly has made the process of changing jobs much easier for highly skilled individuals. This, together with other characteristics of the new order have caused turnover rates to increase significantly, with an additional aggravating factor for companies: the most qualified employees are those leaving more frequently. It is clear that this problem should not be taken lightly by organisations as it is not only associated with immediate costs like drops in productivity, recruitment, and training, but also with the motivation and engagement of the remaining employees which in turn ends up taking a toll on the potential growth of the company.

In this regard, developing a machine learning model capable of predicting employee attrition has become a critical component of the decision-making process for companies to be able to retain their employees. Instead of blindly assigning resources (salary raises, promotions, business trips, integration events, training, etc), machine learning could help you redirect your retention efforts where it actually matters. Attrition models can help you understand who are the employees at a higher risk of leaving, which conditions could explain this risk and what could be done to avoid their potential exit (provided that there is something that can be done).

But are these models simple to build? Have you developed one? And if that’s the case, did you follow the correct framework? Though it may sound as a simple classification task, once you dig in a little deeper you’ll end up realising that this is not as easy as it looks. In fact, many of the examples going around the internet repeat the same mistakes over and over again… but don’t worry we’ll address them soon. For now let’s say that depending on the available data sources, the questions to be answered, and the solution scope, building a solid machine learning model for this specific task may be a significant challenge. In addition, note that at the end of the day these models will have an impact on real people so one must be extremely careful with the design decisions i.e. which variables are going to be used, the model framework and its specification.

The purpose of this article is to offer a brief introduction to the current solutions and uncover the challenges behind this problem while bringing light on the common mistakes so that they are easily avoided in the future. As a quick reminder, do not forget to check whether or not you are following the requirements to succeed while building a project of this kind (here are two articles – part I and part II – that explain them in a simple fashion).


Problem definition

If we want to provide a real-world solution instead of a toy model, we need to start by defining the business questions and the minimum viable product/solution. Based on our experience, there are three main questions to be answered about employee attrition:

  1. whether an employee will leave the company or not;
  2. when it will occur;
  3. why it may happen.

As you can guess these questions are not independent, but strongly linked. Thus, it would be in our best interest to develop a single tool that could answer them simultaneously. Let us summarise how this is currently done (or more specifically how it is being approached).

Current attempts to solve the problem

i) A simple classification problem?

The problem of predicting employee attrition has been addressed by many studies and machine learning articles. However, many of them just implement the same models without any critical analysis on the reason why they are following a certain approach. In this sense, and as hinted before, most non-academic examples available on the internet suggest working the problem as a classification task, where:

  • The target variable is 1 if the employee left the company and 0 otherwise;
  • Characteristics of the employees mainly categorical (role, department, age group, performance class, etc.) and some continuous low variance variables such as the salary are used to predict the exit event (there’s a synthetic dataset built by IBM that summarises these variables);
  • The censoring problem is completely disregarded. But what exactly is censoring? In simple terms, censoring is a condition where we have incomplete information about a subject being studied or value of measurement (if an event hasn’t been measured it doesn’t mean that it won’t happen, we just haven’t seen it yet). For example, in the attrition problem we have incomplete information about the employees, since in order to have complete information we would need them to leave the company (this would allow us to know the exact time of his/her departure and hence having a complete measurement). The omission of this critical concept means that these solutions are flawed from their conception and will never be able to give a proper answer about the timing of the event, even if they include some additional transformations to the model’s input (we’ll talk about them later on);
  • Each observation is independent both between employees and for the same employee over time. Note that this is conceptually wrong unless you are using a model that reduces all the information of an employee to a single observation or you make sure that the model is capable of handling separate sequences of characteristics by employee i.e. the sequences of employee A are not mixed with sequences of employee B. However, we know that although conceptually wrong, the impact of proceeding while ignoring this mistake is difficult to measure.
  • Classes are artificially balanced since the number of events where an employee leaves the company is relatively low compared to those of employees staying;
  • Results are measured using the metrics that come from the confusion matrix while paying particular attention to the f1-score, precision and recall (as we know, accuracy is a bad metric for problems where classes are naturally unbalanced).

Note that this approach has many flaws but there is one in particular that could be seen as critical: it does not take time into consideration, hence we cannot forecast the employee exit in a time period of interest. This happens as the solution does not try to answer the real issue at hand i.e. a time-to-event problem. In simple words, we are ignoring the fact that all employees will leave the company at a certain time, the question is when.

But, how can we take care of this problem? The answer is to go over the literature about time-to-event problems, or more specifically survival analysis.

ii) Survival Analysis

For time-to-event problems, one of the most accurate solutions can be found in the field of survival analysis, mostly known by its implementations in medical studies (for example, to predict the time left of a patient who is suffering from a terminal illness).

Survival models differ from classification models in a fundamental sense: the idea behind the theory is to create a model to predict the expected survival time (by complement of the expected time to "death") conditional on a current state, which means we are talking about a regression model instead of a classification one; we are predicting the expected time to an event instead of a label. With regard to the machine learning models used to predict the expected survival time, we have 2 main models with different traits:

Proportional hazards models: they **** are simple and transparent to make inference, but they lack predictive power due to their large variance, just as it happens with linear regression models.

Accelerated failure time models: they reach a higher predictive power by taking advantage of ensemble and boosting algorithms. However, this advantage is followed by a lack of transparency in the feature transformation process and consequently, in the model decision function (which means inference is left behind).

You may be thinking, but isn’t there any model that has high accuracy and is capable of explaining the reasons behind the labeling decisions? Unfortunately, at the time of publication of this article, there are currently no open-source survival models implementations that can offer both prediction power and full transparency. Nevertheless, we can come with two alternative solutions with a huge disclaimer that we’ll address at the end: you should still make some adjustments to take in consideration the time factor.

iii) Black box models + SHAP/LIME or other explanatory models

One potential solution could be to implement black-box models such as traditional ensemble models (Catboost or XGBoost with AFT loss function) and then use some additional tools such as SHAP or LIME so as to get an explanation of the results. But, keep in mind that this approach is not robust at all since the explanatory models are new models built on the results of the main models, so they don’t exactly represent the real decision function fitted in the ensemble models.

iv) Generalised Additive Models

The second option consists of implementing GAM models (Generalised Additive Model). These models present some potential benefits in terms of both prediction power and transparency. They are non-parametric models that create complex models in an additive way, summarising the individual contribution of the variables used, which means that it is possible to recover the decision function of the final model. However, to be able to use GAM models we must leave Survival analysis techniques and move to regular classifiers. Yet… the censoring and time dependency problems arise again.

Our alternative solution

So there are powerful and transparent models available but we need to think of a way of coping with the censoring problem. Here’s where data transformations come into play. Let’s see what we can do. Since we know that it can be hard to get a mental image of what we are about to say, a dynamic figure of the problem and solution is presented below.

Image by author
Image by author

To predict the probability of an employee exit (Churn) over the next, say, 12 months using a classification model, we need to restructure the data and make some important assumptions. Our goal here is to create a dataset in which the model learns the main differences between people who remain at the company and those who leave. To do that, we need to take a dataset containing historical data from both profiles and consider their main characteristics, such as age, seniority, salary changes, among many other variables, during the 12 months prior to the event.

To introduce the notion of time to this problem, we sort the data for every single employee from the last available monthly observation to the oldest one, and create a time index based on this order. We can then use this index as the prediction labels, indicating that, for example, 10 months before leaving the company an employee showed some characteristics in contrast to the characteristics of employees who remained. However, we still have a problem with active employees – we are not sure if they will leave the company in the following months. This is the censoring problem we talked about before, which affects the observations of the active employees.

To deal with this problem, we exclude the last 12 months of records for active employees, so that if we look at their characteristics, let’s say 13 months before the last available record, we can be sure that these characteristics are a representation of someone who remains at least 12 months plus. In the case of people leaving the company, there is no problem with their characteristics because the occurrence of the event is a fact we already know.

However, it is recommended to exclude also the 12 most recent months of information for both active employees and employees who leave the company in this period to avoid a misrepresentation of one of the two types of employees due to changes in the inter-temporal distribution. Once we make these changes, we can simplify the problem from a multi-class Classification problem to a binary classification problem. To do that, we create a dummy variable based on the time index, assigning a 0 to the remaining last 12 observations of active employees and a 1 to the last 12 observations of employees that leave.

Finally, we reduce our entire dataset to this subset of records, excluding also the records beyond the time window of the 12 months. If we don’t do that, we can have another problem, a highly unbalanced dataset, that makes it very difficult to train any model. We can then use our GAMs models to predict employee churn probability.

It is important to note that we are making a strong assumption in this process – that observations month by month are independent, even when these are records of the same employee. However, this is something we have to tolerate if we want to use a simple and transparent classification model. There are alternatives to deal with this problem, but most of them include the implementation of some complex models (for example, neural networks) that may compromise the transparency of the solution.

At the end of the day we must not forget that we are building a model that could alter the future of real employees. Transparency is key. If we use a model to decide what action plan we should take to reduce their likelihood of quitting in order to change an undesired outcome, we have to be sure that we are capable of both understanding the reasoning behind the model decisions and explaining it clearly to the rest of the organisation.

Conclusion

All in all, we can say that depending on specific contexts, employee attrition can be a persistent and challenging problem for companies. In this scenario, being able to predict when and why an employee may leave is crucial for retaining valuable talent and mitigating the risks of deviating from the strategic vision of the company. As we’ve seen, though not easy, it is still possible to take a real-world problem and design a powerful and transparent solution that helps companies to make informed decisions on retention strategies and invest in employee development accordingly.

Thanks for checking out my content! If you enjoyed it, please don’t forget to follow me and like this post for more content.

References

Cox, DR. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B-Methodological, 34(2), 187–202. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x

Hastie, T., & Tibshirani, R. (1984). Generalized Additive Models. Statistical Science, 1(3). https://doi.org/10.1214/ss/1177013604

Kumar, I.E., Venkatasubramanian, S., Scheidegger, C.E., & Friedler, S.A. (2020). Problems with Shapley-value-based explanations as feature importance measures. International Conference on Machine Learning.

Wei, L. (1992). The accelerated failure time model: A useful alternative to the cox regression model in Survival analysis. Statistics in Medicine, 11(14–15), 1871–1879. https://doi.org/10.1002/sim.4780111409


Related Articles