Using Association Rules for HR Analysis

Or: how to use Apriori with data other than basket shop analysis

Eduardo Furtado
Towards Data Science

--

Photo by Mika Baumeister on Unsplash

Association Rules algorithms such as Apriori are a great way to find what items regularly occur together in your dataset and see their relationship.
It is an unsupervised machine learning model commonly used to find relationships in transactions, eg. clients purchases.

An item is considered frequent if its occurrences is greater than the threshold we set. If you have 100 lines of data and set a threshold of 0.1, everything that happens more than 10 times will show in our result.

In this example, we’ll use Apriori for a different type of problem. We are going to use an HR dataset that contains ages, gender, level of education, etc and we will try to find the frequent characteristics that happen when the employees do not have Attrition.

Firstly a quick recap of the main Apriori components:

Support is how much an item appears
freq(A, B)/Total

Confidence is the probability of B happening given that A happened
freq(A, B)/freq(A)

Lift is similar to the confidence but it also accounts for how popular B is
freq(A, B)/support(A)*Support(B)

In this example, we’ll use the IBM HR Analytics Employee Attrition & Performance from Kaggle which you can download from the link below:

This code will be written in Python using the MLxtend library (http://rasbt.github.io/mlxtend/)

Firstly, we import our libraries. For this project, only Pandas and MLxtend are needed.

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

After reading the data, we can see that there are 35 columns to work with but we will only use a few that look more interesting to us.

There are a couple of columns that would be nice to use but they are not categorical, so what we can do is create bins for them. Pandas qcut function can divide the Age, DistanceFromHome and HourlyRate columns into bins of 4 for us.

pd.qcut(df['Age'], q=4)

This piece of code shows us that the Age column could be divided in these 4 categories:

Categories (4, interval[float64]): [(17.999, 30.0] < (30.0, 36.0] < (36.0, 43.0] < (43.0, 60.0]]

So we can create an Age_Range column like so:

df['Age_Range'] = pd.qcut(df['Age'], q=4, labels=['<=30', '>30 <=36', '>36 <=43', '>43'])

After doing the same for DistanceFromHome and HourlyRate now we have to prepare our data for the algorithm. Apriori only accepts boolean values, so instead of sending the MaritalStatus column with ‘Single’, ‘Married’ and ‘Divorced’ values we will convert this into 3 different columns called ‘MaritalStatus_Single’, ‘MaritalStatus_Married’ and ‘MaritalStatus_Divorced’ with 0 or 1 values in them.

For this, the Pandas get_dummies function can be used for each column that needs to be converted. First we’ll create a list with all the columns that are going to used, and also a second list with the remaining columns. After using get_dummies, the original columns will be removed and only the boolean columns will be there, however, the unused columns are still there too, so we drop them.

columns = ['Attrition',
'Age_Range',
'BusinessTravel',
'Department',
'DistanceFromHome_Range',
'Education',
'EducationField',
'EnvironmentSatisfaction',
'Gender',
'HourlyRate_Range',
'JobInvolvement',
'JobLevel',
'JobRole',
'JobSatisfaction',
'MaritalStatus']
not_used_columns = list(set(df.columns.to_list()) - set(columns))df = pd.get_dummies(df, columns=columns)df.drop(labels=not_used_columns, axis=1, inplace=True)

Now we have a dataframe ready to use and generate the frequent items.

Our dataset has a total of 1470 lines so let’s start choosing a minimum support of 0.05. This means that only results that occurred more than 73 times in our data will be considered. And max_len is the amount of columns combinations generated in the antecedents column.

#Apriori min support
min_support = 0.05
#Max lenght of apriori n-grams
max_len = 3
frequent_items = apriori(df, use_colnames=True, min_support=min_support, max_len=max_len + 1)rules = association_rules(frequent_items, metric='lift', min_threshold=1)rules.head(10).sort_values(by='confidence', ascending=False)
Association Rules results

That’s it! Now you can see the frequent relationships in the dataset.

But what if we want to know only what gives us ‘Attrition_Yes’ or ‘Attrition_No’ in the consequents column?

First of all, let’s see how many occurrences of each there is in our dataframe.

df['Attrition_No'].value_counts()#1     1233
#0 237
#Name: Attrition_No, dtype: int64

There are way more cases of No than Yes so we’ll also need to take that in consideration choosing our threshold since one is more common than the other.

For No, let’s increase the threshold to 0.1 and filter the consequents column for Attrition_No:

#Apriori min support
min_support = 0.1
#Max lenght of apriori n-grams
max_len = 3
frequent_items = apriori(df, use_colnames=True, min_support=min_support, max_len=max_len + 1)
rules = association_rules(frequent_items, metric='lift', min_threshold=1)
target = '{\'Attrition_No\'}'results_attrition_no = rules[rules['consequents'].astype(str).str.contains(target, na=False)].sort_values(by='confidence', ascending=False)results_attrition_no.head(10)
Association Rules results with only Attrition_No in the consequents column

Now we can see that a person that is from the Research & Development Department and Job Level 2 happens 11% of the time in our dataset, however almost 95% of those do not have Attrition.

Sidenote: the antecedents and consequents columns are frozensets which look like this:

array([frozenset({‘JobSatisfaction_4’, ‘JobLevel_2’}),
frozenset({‘Department_Research_&_Development’, ‘JobLevel_2’}),
frozenset({‘Department_Research_&_Development’, ‘JobInvolvement_3’, ‘JobLevel_2’})…

If you want to beautify it and export this result you can use this piece of code:

df['antecedents'] = df['antecedents'].apply(lambda x: ','.join(list(x))).astype('unicode')df['antecedents'] = df['antecedents'].str.title().str.replace('_', ' ')

Now let’s change our threshold to 0.02 and see what we have for Attrition_Yes:

Association Rules results with only Attrition_Yes in the consequents column

The biggest confidence is for people younger than 30 years old, single and Job Level 1. But differently from the previous case, only 45% of those cases have Attrition.

And that’s the very basic of using Apriori with HR data! Hope you try it in your next project and find new relationships in your dataset that can give you more insights to your problem.

You can download all the notebook used to create this post here: https://gist.github.com/eduardoftdo/e3d2b7ca4a06d8d86b144482d0aed5a1

If you have any questions and would like to contact me, feel free to send a message at https://www.linkedin.com/in/eduardo-furtado/

--

--