How to Build a Resume Recommender like the Applicant Tracking System (ATS)

One headcount. 1,000 resumes. Which one to read first?

Published in

Towards Data Science

8 min readJun 25, 2020

In 2014, Google had 7,000 job openings on their careers site, and within these job openings, the company also received 3 million resumes that year.

Now, that is a lot of resumes to screen through to only hire less than 1% of applicants! There’s no doubt that hiring is labor-intensive, and the larger the candidate pool gets, the harder it is to sift through to find the most qualified candidate for the role. As the company expands and grows, it can become inefficient and expensive as the recruiting team also grows at the same speed as the company. How do we optimize this labor-intensive process of screening these countless resumes?

Goal

We can create a recommender system that recommends the best-matched resume based on your job listing!

To do so, we’ll use Natural Language Processing (NLP) techniques to process resumes and job listings and sort resumes by cosine similarity in descending order, to create this ultimate resume screening tool!

Natural Language Processing is a branch of artificial intelligence that allows computers to read, understand, and derive meaning from human languages.

Image by Wright Studio via Shutterstock.com

Approach

1. Data Collection & Simple EDA

The dataset was found on Kaggle, with a total of 8,653 entries of applicant experiences and 80,000 job listings. Note that these applicant experiences are not specifically for any of the job listings.

Resumes

To make a resume out of this data, I have concatenated all job experiences by applicant ID. Resumes with incomplete job title/experience were removed to ensure there are enough text to be modeled later. For example, for applicant_id 10001, its job description is shown as Nan for the first three rows, hence these observations will be removed and won’t be considered in the dataset.

employer and salary won’t be helpful for our recommendation engine, so we’ll remove them as well.

Job Description

For job description, to keep it simple, I only kept job_title and job_description in the data frame for job listings. As you can see in the list of columns above, there is a lot of information about location of the job (city, State.Name, State.Code, Address, Latitude, Longitude), which we will not use to recommend resume. The rest of the columns either don’t seem to be adding many values to our use case, or it has a lot of missing values.

2. Text Preprocessing

Text preprocessing is the practice of cleaning and preparing text data. Using SpaCy and NLTK, resumes and job description are then pre-processed with the following:

stop words removal
part-of-speech tagging
lemmatization
alpha characters only

Since after pre-processing, resumes with short length got even shorter, i.e., less than 20 words. I’ve decided to remove resumes that are less than 23 words, in order to still have at least 1,000 resumes in the dataset, ensuring that there’s enough text for the model to be trained on.

3. Vectorizer

Why choose Term Frequency-Inverse Document Frequency (TF-IDF) instead of Count Vectorizer (a.k.a. Term Frequency)?

Count Vectorizer allows popular words to dominate. Some words may be mentioned multiple times in a resume and across all resumes, which make these words as “popular” words. Resumes in longer length also have higher chance in containing these “popular” words more than once. Count Vectorizer concludes that resumes mentioning these “popular” words are more similar with each other as it puts more emphasis on these popular words, even though they may not actually be important in our context.

TF-IDF places more weight on rare words. Words that aren’t mentioned across all resumes, but only included in these two resumes — indicating that these documents may be more similar with each other than the rest.

After all text has been cleaned, TF-IDF was then fit on resume to compare similarities between resumes. We’ve set to tokenize into 2-word and 3-word phrases (bi- and tri-grams) and each phrase needs to appear in at least 12 documents to appear as a topic later.

4. Topic Modeling (Dimension Reduction)

I’ve tried 3 different topic modeling techniques: Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-negative Matrix Factorization (NMF).

In the beginning, I used LDA and pyLDAvis to gauge the appropriate number of topics derived from the resumes, number of grams/terms to put in TF-IDF, and minimum number of documents containing the grams/terms. Then, for each topic modeling technique, I created a UMAP and inspected the terms in each topic. It seems that LDA has the best separability for the topics.

Below is the chart created using PyLDAvis, which presents a global view of our topic modeling.

How to interpret the chart generated from pyLDAvis?

On the left panel, it’s the Intertopic Distnace Map containing 8 circles, which indicates that there are 8 topics from our resume datasets. Each topic is composed of most relevant/important terms, which is shown on the right panel of the chart. The right panel shows the Top-10 most relevant terms for each topic. And for topic 2, some of the most relevant terms are “sales associate”, “customer service”, “answer questions”, and “cash register”, which sounds like a sales roles potentially in the retail industry. Lastly, the blue bar indicates the overall term frequency, while the red bar indicates the estimated term frequency in that topic.

What should you aim for a PyLDAvis chart?

In short, you want your PyLDAvis chart to look similar to what we have here.

Intertopic Distance Map (Left)— Notice that each circle is fairly spaced out and is similar sized with each other, meaning that each topic is mostly evenly distributed across the documents (resumes) and the topics have terms that are unique to them. However, it’s still inevitable that topics have very few terms overlapping with each other for reasons such that people can have various levels of experience, i.e., going from being an Admin to Data Scientist after 3 years.

Top-10 most relevant terms for Topic 2 (Right) — You can see that the red bars are very close to filling up the blue bar, which means that these terms are mostly in this topic with little overlapping with other topics.

5. Recommendation System Based on Cosine Similarity

Lastly, we needed to find some type of similarity between a job description and the pool of resumes. I was debating between whether to use dot product or cosine similarity. See my thought process below.

Cosine similarity is the cosine of the angle between two vectors, which determines whether two vectors are pointing roughly in the same direction. In our case, whether the terms (our previously defined bi- and tri-grams from TFIDF) show up in both resumes and job description.

Dot product is the product of Euclidean magnitudes of two vectors and the cosine of the angle between them. Other than term occurrence, the frequency of terms showing up in the resume would also bump up the matching.

Which similarity metric should we use?

Dot product sounds like a win-win situation where we will consider whether the terms occurs and how frequently it does in each document. That may sound like a good idea, but recall that was the downside that we’re trying to stay away from earlier (TF vs. TF-IDF)— placing importance on these frequently mentioned words.

Think of dot product between a and b as if we’re projecting a onto b (or vice-versa), and then taking product of projected length of a ( |a|) with length of b (|b|).

When a is orthogonal to b, dot product is zero. Projection of a onto b yields a vector of zero length, hence zero similarity.
When a and b points in the same direction, dot product yields the largest value.
When a and b points in the opposite direction, dot product yields the lowest value.

Dot product takes magnitude into the account, hence dot product increases as the length of vectors increases. To normalize dot product, it is then divided by normalized vectors. The output is the cosine similarity, which is invariant to scaling and limits the value between -1 and 1. A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and the greater the match between vectors.

As you can see in the formula above, cos ⁡θ is actually just dot product of the two vectors divided by the product of the two vectors’ lengths (or magnitudes). In general, cos ⁡θ indicates the similarity in terms of the direction of the vectors . This still holds true as the number of dimensions (aka vectors aka terms) increases, hence cos ⁡θ is a helpful measure in multi-dimensional space.

Example

Here’s a job listing I found, “Retails Sales Associate” at Staples in San Francisco.

Future Work

In the future, given more time and access to company data, I would love to do the following to improve the recommendation engine.

Obtain company data on job description, candidate resumes, and additional internal information, i.e., What job level is this role? Is this a People Manager role?
Model on text data on one job category or role at a time. As this project was based on a resume dataset that involves a wide range of job functions with only 1,000 observations, the topics were mostly a more high-level breakdown of job functions. When modeling on only one job role, we could narrow the topics down to be more specific to the role, which could lead to better recommendation engine.
Work with Recruiters to apply supervised learning techniques, using screened-in resumes as an indicator that this resume is a good recommendation.
Work with Engineering to apply a user feedback loop to continuously improve the recommendation engine.