Can we redefine data science titles based on roles?

Let’s find out if job descriptions are accurate to the titles they describe.

Published in

Towards Data Science

6 min readMar 12, 2020

I realized I had already done part of the hard work of scraping and cleaning the data for an expansion of an earlier post. I expanded this by collecting data science postings for various cities in North America. I wanted to focus in on the job descriptions, job titles given, and whether the descriptions matched the titles.

Here, I’m going to apply some techniques to try to better differentiate between these positions. First I took the data from the job postings and extracted the presence or absence of key data science terms that I would think would be relevant. I sorted these into whether they were listed as ‘requirements’ for the position or as ‘assets’ (or nice-to-haves).

All job titles were assigned a ‘job type’ based on whether the title contained keywords such as ‘analyst’, ‘scientist’, ‘engineer’, or ‘manager’. I started with these as some version of these are often accepted as the ‘types’ of data science jobs out there. I started with ~6000 job postings and filtered out roughly 1000 whose title I couldn’t sort into these ‘types’ (such as ‘VP of workforce planning’ or positions without a title given).

For this dataset, we have 50 dimensions of skills, programming languages, and software knowledge that companies are looking for as either a ‘requirement’ or an ‘asset’. I ran this through a first PCA with a maximum likelihood estimator to determine the number of components to identify and dropped any components that explained less than 5% of the variance leaving 5 meaningful components. I then re-ran the PCA with 5 components and here are the results, coloured by the described position type:

This is a pretty big mess with job type not giving a clear delineation between the spread of data here. So, I ran a k-means clustering with 5 clusters. I then re-produced the above figure changing the colour from the described position type to the k-mean label identified cluster.

PCA with 5 components on job postings coloured by k-mean identified label.

There is still some overlap and especially in clusters 0 and 2, and there is still some large spread in some clusters, but these k-mean determined labels seem to better describe the variance in the positions than the title.

But what do these k-means labels mean?

On there own, they don’t really mean anything except that they’ve identified and classified different clusters. So let’s focus on the greatest difference between these clusters for different required and nice-to-have skills and attributes. For the heatmap below, we want to focus on the darkest colours (deepest reds and blues) as these are the skills that explain the greatest variance for that cluster. In addition, reading the chart along the ‘columns’, we can see which attributes account for the greatest variation between clusters such as python_requirements with variations between most of the clusters, or graduateEducation_requirements with a large variation between clusters 0 and 1, and between these ones and the rest of the clusters.

Heatmap of the contribution of PCA for each skill for each cluster.

Next, we turn to how often these skills appear in the job postings for different clusters. First, we subset out those with a greater than 40% difference between the K-means labeled clusters for whether that skill or attribute was required or not. Cluster 2 is often the highest or listings lots of requirements/assets that other postings aren’t listing. Other major differences emerge with clusters 1 and 4 putting lower emphasis on university education (notice ‘university’, ‘master’, and ‘phd’ panels). You can also see an emphasis on some machine learning in cluster 0, 2, and 4 (e.g., python, machine learning, and spark).

Job attributes with a greater than 40% difference between the minimum occurrence in a cluster and the maximum occurrence in a cluster.

If we zoom out to all requirements, we again see that Cluster 2 places a strong importance on education with graduate education being required for 100% of posts, and a PhD listed as a requirement for over 50% of these jobs. This also has some related keywords that would fit with our understanding of this cluster as highly educated with keywords like computer science and statistics (e.g., as in ‘PhD or Master’s in computer science or statistics’)

Cluster 2 importance of different skills with ‘education’ requirements topping the list.

But are these clusters meaningful? Looking, at the job requirements for the fourth cluster they are almost equally likely to include Hadoop and Tableau. These would seem to be at different spectrums of data science work of data engineer versus more analytics roles using dashboards and visualizations. That SQL is the second most common requirement for this role could be evidence in either direction for these roles.

Another of the determined clusters was around positions with little requirements given. My explanation for this is that the job posting describes the position without listing a lot of requirements or assets and these are being clustered together as some kind of ‘null’ position. This isn’t really useful for us to understand these positions then.

Cluster 1 with very few required skills (note the x-axis scale).

If we come back to Randy’s ‘optimistic result’ of coming up with better role definitions, I don’t think we achieved this. The data is likely complicated and noisy due to data science positions often being written in a laundry list style or looking for their data scientist unicorn. The other related caveat around this is whether these differences are actually attributable to the difference in the role, or if some companies or people write descriptions in different ways. Some might write postings that go heavy on listing requirements and assets, whereas others might just generally describe the position without the keywords I was searching for here. So while it would be nice if we could get to a clearer understanding of different data science positions and their requirements and roles, I wasn’t able to get there with this line of analysis.

If you have ideas on how to improve this or other questions, feel free to reach out on LinkedIn or Twitter.

Notes

For the updated code I used for this analysis, check it out on my Github repository.
I wanted to do this after some earlier work focused on my current city of Vancouver.
Other people have had more luck than me sorting out these roles: https://towardsdatascience.com/what-type-of-data-scientist-are-you-84c3c2b9fc16. The difference in success might be attributed to searching for more specific job titles (e.g., “data engineer”, “machine learning engineer”) and with a smaller dataset.
Full list of skills/assets: “bigquery”, “python”, “sql”, “jira”, “tableau”, “docker”, “scala”, “java”, “spark”, “hadoop”, “statistics”, “nlp”, “cnn”, “rnn”, “programming”, “R”, “bachelor”, “master”, “phd”, “C”, “machine learning”, “CS”, “SAS”, and “AI”.

Can we redefine data science titles based on roles?

Let’s find out if job descriptions are accurate to the titles they describe.

Written by Tim Cashion