The world’s leading publication for data science, AI, and ML professionals.

How to identify the most requested skills on the data science job market, with data science

Use scraping and NLP to extract information from job ads

Making a nice word cloud is not an essential skill, but it should be.
Making a nice word cloud is not an essential skill, but it should be.

A few days ago, a tech recruiter in my LinkedIn network posted a request for learning resource recommendations for prospective data scientists. She wanted to ensure candidates were acquiring the right skills for today’s market. Among the replies she received, many were anecdotal ("I did this…I learned that"), suggesting the usual set of online learning platforms, while others contained generic recommendations, such as "practice coding interviews" or "try a Kaggle competition".

I thought: this is useless. How can I do all these things at once? Is there a better way of gathering useful information about the most requested skills in the job market right now? As a data science newcomer, I have felt like I am chasing the market, picking new projects or learning materials depending on what I see (or I think I see) most commonly on the job ads, or on what I am asked at interviews.

I decided to take a step forward and find out for myself: in this project I am collecting a number of job descriptions from a popular job board website, to find the recurring key words in them, which ones correspond to acquirable skills, and how (and if) they correlate to the advertised salary.

Project structure

The project is divided into three main parts:

  • Data acquisition: the source of information is identified, and relevant job ad web pages are scraped, and the contents are saved and stored;
  • Data manipulation: the text contained in the job descriptions is cleaned through stemming, and vectorised;
  • Data analysis and conclusion: the most frequently recurring words corresponding key skills are identified, and a correlation with the advertised salary is investigated.

You can find the code of the project here.

Data acquisition

The first step in the process is to identify the source of information, to select the right tools and to acquire it. As a source, I picked reed.co.uk, as it easy to navigate, rich of detail, and quite popular in my region.

Using Beautiful Soup, I made a simple script scraping the job ads of a set number of pages of results when searching for "Data Scientist" __ jobs in London, and within 20 miles. The script saves the search results, then accesses the pages corresponding to every job ad and stores job description, salary, contract type and location information in JSON format. Each job is identified by a reference number, and if some jobs appear multiple times in the search (for example if promoted), they are saved only once.

Data manipulation

T[here](https://www.topbots.com/top-ai-ml-research-trends-2020/) is no shortage of tools for text manipulation and analysis, as NLP is one of the most trendy applications of machine learning at the moment (look here and here for some interesting insight on this topic). My personal recommendation for learning is to take a look at fastAI online material, covering the fastAI library and much more.

The job ad descriptions contain words with different inflections (mainly plurals and-ing suffix), punctuation and stop-words. They can all be cleaned up, leaving the text with only the kind of words which may convey useful information. As a first step, I used the nltk package to perform stemming on the text, which removes the words inflections and upper cases.

Comparison between original text and its stemmed and lemmatised versions. Lemmatisation requires to know the position of each word to as input to work properly, so for example the words "Spoken" and "Understanding " start with capital letters and are treated as nouns, and are not transformed into "speak" and "understand".
Comparison between original text and its stemmed and lemmatised versions. Lemmatisation requires to know the position of each word to as input to work properly, so for example the words "Spoken" and "Understanding " start with capital letters and are treated as nouns, and are not transformed into "speak" and "understand".

An alternative to stemming is lemmatising, which reduces words based on meaning, rather than spelling. I decided to use stemming, since my purpose is to isolate words in the text, rather than study the text meaning.

The second step is to vectorise my text using the TF-IDF vectoriser function within scikit learn, which returns a vocabulary containing all the words in the analysed text and, for each job ad description, a vector containing the value of term-frequency times inverse document frequency (hence TF-IDF) for each stemmed word, a value that highlights meaningful words in favour of highly recurring words such as "the", "a", "is", etc.

The matrix in figure contains the TF-IDF values for the analysed text corpus. Each row corresponds to a job description text, and each column to a word in the corpus vocabulary.
The matrix in figure contains the TF-IDF values for the analysed text corpus. Each row corresponds to a job description text, and each column to a word in the corpus vocabulary.

Simply put, each row of the TF-IDF values matrix corresponds to an observation of our target (a job ad description), each column corresponds to the values of a given variable (a word in our vocabulary), and its values correspond to the values of the variables.

This is very useful to characterise the job descriptions, and we can use it to find the most frequent words, and see if they correlate to other factors, such as salary.

Most present key skills

The most present words across the whole job descriptions corpus can be identified by summing the matrix along its columns, getting a cumulative value for each word. I have filtered the results to remove some omnipresent words, such as "data" and "work". At the time of the analysis, I can observe, roughly in order and grouped by affinity:

Stemmed words with their relative importance score, across the job ads.
Stemmed words with their relative importance score, across the job ads.
  • Experience, and Senior (much more than Junior);
  • Product, Market;
  • Modelling, Statistics;
  • Python, SQL;
  • NLP, Machine Learning, Deep Learning;
  • AWS;
  • Research, Insight, Analytics.

To my surprise, computer vision, FinTech and social media were somewhat less present. From a technical point of view, NLP and SQL stand out as skills worth acquiring, along with familiarity with AWS.

If you are Junior or entering the market (like me), then you probably won’t like seeing "experience" at the top of the list, but it is consistent with market trends.

In absolute terms, we can take a look at how frequently certain key-words appear in job ads, and use the obtained figure as a metric for our decision-making process.

Out of 167 analysed job ads, distribution for type of employment (a), occurrence of the term "Senior" in the job description (b), and of some key skills (c).
Out of 167 analysed job ads, distribution for type of employment (a), occurrence of the term "Senior" in the job description (b), and of some key skills (c).

As an example, I have counted the occurrence of contract and permanent jobs, of the term "Senior", and of some key skills in job descriptions. Note how, from a simple vocabulary analysis, it is possible to gather valuable information about the importance of certain skills.

Correlation with salary

My plan is to define a data acquisition scheme, with the idea of collecting information from hundreds of job ads and, explore the possibility to build a regression model to estimate the salary. I am testing the viability of this idea by evaluating the quality of the reported salary information, and by seeing if the occurrence of any key word correlates with salary.

Job ad pages contain information about the proposed salary range in a well defined field, which is easy to read. However, salary figures are often replaced by "Competitive salary", "Salary negotiable", or similar expressions. In the case of interest, 69% of values are unfortunately missing. Another complication is that salary is reported per annum for full-time jobs and per day for contractor jobs, and differences in taxation regimes make comparison tricky.

We can check if salary correlates with any key word by creating a single table with each job’s salary figure and the subset of the TF-IDF matrix corresponding to the key-skills, and use Pandas corr() method to get a corresponding correlation matrix.

Correlation-clustered heat map (see seaborn's guide on how to make one) of the "Salary" with relevant key words.
Correlation-clustered heat map (see seaborn’s guide on how to make one) of the "Salary" with relevant key words.

In a clustered correlation matrix, correlated quantities are put next to each other, and are therefore easy to spot. In this case, the minimum correlation score between any 2 variables is higher than -0.25, meaning that there are no significant negative correlations. With regards to positive correlation, we are interested in the squares which are white to red. The dendrograms represent show how the variables are progressively clustered.

The most evident correlations are between Python and SQL, that probably come often together in the kind of job description that describes its requirements in detail, and of deep (learning) with NLP and vision, being NLP and computer vision two of the most trendy applications of neural networks. Other visible correlations are between machin(e learning) and statist(ics), and product and market.

What about salary? it weakly correlates with financ(e), which might say something about the most remunerative sector, but it is not enough evidence.

Note: clustered maps are also very useful to get a visually intuitive feel on how the chosen key words are distributed amongst the job description corpus, how they could be clustered based on that, and how the job descriptions could be clustered amongst themselves. You could see, for example, that key words experience and python are almost evenly distributed across the corpus, whereas others, such as NLP and vision, are more concentrated (see my map for further detail).

Conclusions

In this article I have shown my approach to smart Job Hunting, using basic web-scraping and NLP techniques to gather insights into the job market. Based on the results:

  • I would consider investing time in improving my skills in NLP and SQL.
  • I would pay particular attention to enriching my portfolio with real-life projects and thus enhance my demonstrable experience in the field.
  • I would also work on my analytics and research skills, creating my own problems to solve, and producing valuable insights from them.

Further work

This is very much an open-ended project, as the information acquired depends on the moment and will change over time, and the analysis can be done on many different levels. For example, it would be interesting to:

  • Perform clustering on job ads, on the basis of their content, to find further links between key words and quantities or features characterising jobs.
  • Develop a job salary regression predictor, gathering large quantities of salary value observations, and with a correct comparison between compensation for different kinds of contracts.

I hope the work I have described here can be useful to others. Feel free to reach out and connect if you would like to talk about it.

About me

I am a Data Scientist currently looking for new opportunities. In the past few years, I have been working on applied quantum techonologies for space applications.

GitHub: https://github.com/RaffaToSpace

LinkedIn: https://www.linkedin.com/in/raffaele-nolli-581a4351/


Related Articles