Create and Explore the Landscape of Roles and Salaries in Data Science
The data science field is under constant development for which new roles and functions are created. The traditional data science role is evolving into tens of new roles, from Data Engineer, ML Engineer, Product Data Analyst, Research Scientist, Cloud Data Engineer, and many more. In this blog, we will load the data science salary data set and create a landscape to examine how roles are related to each other based on location, remote work, job title, experience level, and the relationship with salaries. I will demonstrate the steps of how to create such a landscape and how to deeper analyze the samples using unsupervised Clustering. The cluster analysis is performed using the library clusteval, whereas scatter plots are created using scatterd and d3blocks for interactive plots.
Introduction
Data Science is one of the fastest-changing fields in today’s digital landscape. The role of a data scientist is to solve complex problems and drive informed decision-making. Fundamental is knowledge about statistical techniques, machine learning algorithms, and data visualization. But when it comes to creating products, it also requires deep knowledge of engineering together with data governance, ethics, and privacy among others. The know-how needed as a data scientist depends on the exact role description, and business, among others. Nowadays there are many new roles, from Data Engineer, ML Engineer, Product Data Analyst, Research Scientist, Cloud Data Engineer etc. Each role in its turn also changes with experience level. Entry-level positions typically involve working under the guidance of senior data scientists, assisting with data preprocessing, and contributing to model development. As you progress, mid-level roles require a deeper understanding of statistical analysis, feature engineering, and algorithm selection. Senior data scientists often lead projects, mentor junior team members, and contribute to strategic planning.
Salaries in the data science field vary based on experience, industry, country, and advanced qualifications. For many years, the location played a very important role in data science opportunities and compensation. Tech hubs like Silicon Valley, New York City, and San Francisco offer high-paying positions. However, in the last years, it has become technically easier, and more accepted to work from different locations worldwide. Let’s see whether we can confirm some of these trends using the data science salary data set.
If you find this article helpful, you are welcome to follow me because I write more about topics like this one. If you are thinking of taking a Medium membership, you can support my work a bit by using my referral link. It is the same price as a coffee but this allows you to read unlimited articles every month!
Libraries used for the analysis.
We will use seven libraries that will help to load the data set, perform pre-processing steps, analyze the data, create visualizations, and create the data science landscape. If you want to dive into the details, I recommend reading the underneath blogs.
# Easy import of data sets
pip install datazets
# One-hot encoding
pip install df2onehot
# PCA analysis. Create explainable biplots
pip install pca
# Clustering with automatic evaluation
pip install clusteval
# Making beautifull scatter plots
pip install scatterd
# Making beautifull interactive scatter plots
pip install d3blocks
# HNET for association analysis for the clusterlabels
pip install hnet
What are PCA loadings and how to effectively use Biplots?
From Data to Clusters; When is Your Clustering Good Enough?
From Clusters To Insights; The Next Step
Get the most out of your Scatterplot by making it interactive using D3js and Python.
Explore and understand your data with a network of significant associations.
The Data Science Salary Dataset
The data science salary data set is derived from ai-jobs.net [1] and is also open as a Kaggle competition [2]. The data set contains 11 features for 4134 samples. The samples are collected worldwide and weekly updated from 2020 to the present time (somewhere beginning of 2023). The dataset is published in the public domain, and free of use. Let’s load the data and have a look at the variables.
# Import library
import datazets as dz
# Get the data science salary data set
df = dz.get('ds_salaries.zip')
# The features are as following
df.columns
# 'work_year' > The year the salary was paid.
# 'experience_level' > The experience level in the job during the year.
# 'employment_type' > Type of employment: Part-time, full time, contract or freelance.
# 'job_title' > Name of the role.
# 'salary' > Total gross salary amount paid.
# 'salary_currency' > Currency of the salary paid (ISO 4217 code).
# 'salary_in_usd' > Converted salary in USD.
# 'employee_residence' > Primary country of residence.
# 'remote_ratio' > Remote work: less than 20%, partially, more than 80%
# 'company_location' > Country of the employer's main office.
# 'company_size' > Average number of people that worked for the company during the year.
# Selection of only European countries
# countries_europe = ['SM', 'DE', 'GB', 'ES', 'FR', 'RU', 'IT', 'NL', 'CH', 'CF', 'FI', 'UA', 'IE', 'GR', 'MK', 'RO', 'AL', 'LT', 'BA', 'LV', 'EE', 'AM', 'HR', 'SI', 'PT', 'HU', 'AT', 'SK', 'CZ', 'DK', 'BE', 'MD', 'MT']
# df['europe'] = np.isin(df['company_location'], countries_europe)
A summary of the top job titles together with the distribution of the salaries is shown in Figure 1. The two top panels are worldwide whereas the bottom two panels are only for Europe. Although such graphs are informative, they show averages and it is unknown how location, experience level, remote work, country, etc are related in a particular context. As an example: Is the salary of an entry-level data engineer that works remotely for a small company more or less similar to an experienced data engineer with other properties? Such questions can be better answered with the analysis as shown in the next sections.
Preprocessing
The data science salary data set is a mixed data set containing continuous, and categorical variables. We will perform an unsupervised analysis and create the data science landscape. But before doing any preprocessing, we need to remove redundant features such as salary_currency
_ and `salary_ to prevent multicollinearity issues. In addition, we will exclude the variable
salary_in_usdfrom the data set and store it as a target variable
y` because we do not want that grouping occurs because of the salary itself. Based on the clustering, we can investigate whether any of the detected groupings can be related to salary. The cleaned data set results in 8 features with the same 4134 samples.
# Store salary in separate target variable.
y = df['salary_in_usd']
# Remove redundant variables
df.drop(labels=['salary_currency', 'salary', 'salary_in_usd'], inplace=True, axis=1)
# Make the catagorical variables better to understand.
df['experience_level'] = df['experience_level'].replace({'EN':'Entry-level', 'MI':'Junior Mid-level', 'SE':'Intermediate Senior-level', 'EX':'Expert Executive-level / Director'}, regex=True)
df['employment_type'] = df['employment_type'].replace({'PT':'Part-time', 'FT':'Full-time', 'CT':'Contract', 'FL':'Freelance'}, regex=True)
df['company_size'] = df['company_size'].replace({'S':'Small (less than 50)', 'M':'Medium (50 to 250)', 'L':'Large (>250)'}, regex=True)
df['remote_ratio'] = df['remote_ratio'].replace({0:'No remote', 50:'Partially remote', 100:'>80% remote'}, regex=True)
df['work_year'] = df['work_year'].astype(str)
df.shape
# (4134, 8)
The next step is to get all measurements into the same unit of measurement. In order to do this, we will carefully perform one-hot encoding and take care of multicollinearity that we unknowingly can introduce. In other words, when we transform any categorical variable into multiple one-hot variables, we introduce a bias that allows us to perfectly predict a feature based on two or more features from the same categorical column (aka the sum of one-hot encode features is always one). This is called a dummy trap and we can prevent it by breaking the chain of linearity by simply dropping one column. The df2onehot
package contains the dummy trap protection feature. This feature is slightly more advanced than simply dropping a one-hot column pér category because it only removes a one-hot column if the chain of linearity is not yet broken due to other cleaning actions, such as a minimum number of samples pér one-hot feature or the removal of the False
state in boolean features.
# Import library
from df2onehot import df2onehot
# One hot encoding and removing any multicollinearity to prevent the dummy trap.
dfhot = df2onehot(df,
remove_multicollinearity=True,
y_min=5,
verbose=4)['onehot']
print(dfhot)
# work_year_2021 ... company_size_Small (less than 50)
# 0 False ... False
# 1 False ... False
# 2 False ... False
# 3 False ... False
# 4 False ... False
# ... ... ...
# 4129 False ... False
# 4130 True ... False
# 4131 False ... True
# 4132 False ... False
# 4133 True ... False
# [4134 rows x 115 columns]
In our case, we will remove one-hot encoded features that contain less than 5 samples (y_min=5
), and remove multicollinearity to prevent the dummy trap (remove_multicollinearity=True
). This results in 115 one-hot encoded features for the same 4134 samples.
PCA analysis shows that Salary is Driven by Experience Level, Company size, and Location.
We are going to analyze the preprocessed data set using PCA and determine how samples are related to each based on the loadings and by visual inspections. See the code section below on how to perform the PCA analysis. Read this blog for more details. The results of the sample relationships with the loadings of the features are shown in the biplot (Figure 2). Take the time to understand it, it contains a lot of information. First of all, each point in this 2D space is one of the 4134 samples and the distance between two points describes their similarity based on the 115 features. The closer the two points are, the more similar the points are in some way. In the background of the plot is shown a density map with a red glow. High and low-dense areas are stressed in such a manner. The size of each point is the salary, a higher salary is related to larger sizes. The color is based on job_title
, and the marker is set to experience_level
.
# Import library
from pca import pca
# Initialize
model = pca(normalize=False)
# Fit model using PCA
model.fit_transform(dfhot)
# Make biplot
model.biplot(labels=df['job_title'],
s=y/500,
marker=df['experience_level'],
n_feat=10,
density=True,
fontsize=0,
jitter=0.05,
alpha=0.8,
color_arrow='#000000',
arrowdict={'color_text': '#000000', 'fontsize': 32},
figsize=(40, 30),
verbose=4,
)
If we look at Figure 2, we can see that the 1st Principal Component contains 18.9% of the explained variance and the 2nd principal component 14.7%. This indicates that the first two components capture a large chunk of the explained variance in the data which makes it worthwhile for further interpretation of the sample relationships and the loadings. The loadings (black arrows) describe which of the original features was responsible for the variance in the principal components. The largest loading is from the category work_year
that separates the data into two cigar-like shapes (left bottom towards the top right), whereas experience level
and company size
creates the long tail of samples towards the top right corner.
Samples with larger shapes indicate higher salaries and appear to group together (left bottom). Many of the samples in the left bottom corner have a rectangular shape which depicts seniority experience level. The more we move towards the tail in the right top, the smaller the dots become (relatively less salary) and contain mixed shapes, meaning various experience levels. For visual investigation, we can apply different colors as shown in Figure 3. Overall, we can see that samples cluster per year and are cigar-like-shaped. Over the years, patterns of experience level, company size, and remote work are seen.
# Import library
from scatterd import scatterd
# Create various scatter plots with different coloring.
model.scatter(labels=df['company_size'],
s=y/500,
marker=df['experience_level'],
density=True,
fontsize=20,
jitter=0.05,
alpha=0.8,
figsize=(40, 30),
verbose=4,
grid=True,
legend=True,
)
When we color the samples based on country (Figure 4), countries in Europe pop up mainly in the tail (grey colored). These results point out that higher-paid salaries are mainly outside Europe, have seniority experience level, and work for large companies. Remote work does not seem to make a difference. This trend also seems not to be changed over the years 2020 to 2023.
Creating The Landscape for Data Science Roles and Salaries.
To create the landscape, we will use t-distributed stochastic neighbor embedding (t-SNE). In the preprocessing step, the work_year
column is removed to avoid any separation on years (the patterns across the years were very similar). After the t-SNE embedding, we can scatter the data points in the 2D space using the scatterd
library. This library highlights the dense areas (density=True
), and provides transparency towards less dense areas (gradient=opaque
), which keeps the scatter plot tidy. The landscape can now be found in Figure 5. Just like in the previous section, samples are colored on the job title, markers are set to experience level, and dot size is based on the salary in USD.
# Import libraries
from scatterd import scatterd
from sklearn.manifold import TSNE
# Remove work year from dataframe
df.drop(labels=['work_year'], inplace=True, axis=1)
# Create new one hot matrix without work year
dfhot = df2onehot(df, remove_multicollinearity=True, y_min=5, verbose=4)['onehot']
# Feature embedding using tSNE
X = TSNE(n_components=2, init='pca', perplexity=100).fit_transform(dfhot.values)
# Import library
fig, ax = scatterd(X[:, 0],
X[:, 1],
marker=df['experience_level'],
s=y/500,
labels=df['job_title'],
fontsize=0,
density=True,
args_density={'alpha': 0.4},
gradient='opaque',
edgecolor='#000000',
jitter=1,
grid=True,
legend=False,
figsize=(40, 30),
)
Detection of the driving features.
To determine which features drive the grouping of samples, we need to investigate the grouping in the landscape (Figure 5). We will tackle this question in two parts, first by clustering and then by enrichment analysis. Both steps can be performed using the [clusteval](https://towardsdatascience.com/from-data-to-clusters-when-is-your-clustering-good-enough-5895440a978a)
library. Let’s load the library and start clustering the data.
The clustering method is DBSCAN
, and the evaluation approach the Silhouette score
. Within [clusteval](https://towardsdatascience.com/from-data-to-clusters-when-is-your-clustering-good-enough-5895440a978a)
, a grid search is performed across the epsilon parameter of DBSCAN, and the clustering is used with the best Silhouette score. In our case, we detected 24 clusters (clusters 0 to 23). The results are shown in the code section below and in Figure 6 is shown the grid-search for Epsilon, together with the Silhouette coefficients. For more details, I recommend reading the blog: From Data to Clusters: When is Your Clustering Good Enough?
# Import library
from clusteval import clusteval
# Initialize clusteval
ce = clusteval(cluster='dbscan', metric='euclidean', linkage='complete', min_clust=7, normalize=True, verbose='info')
# Cluster evaluation
results = ce.fit(X)
For the detected cluster labels we can analyze the driving features behind the clusters using enrichment analysis. Read From Clusters To Insights; The Next Step for more details. The code section below depicts which features are significantly associated with each cluster label. We can manually explore these results but we can also make a scatter plot that will overlay the top n features in the scatter plot (Figures 7 and 8).
# Compute enrichment for each of the cluster labels
ce.enrichment(df)
# Show the significantly associated catagories for the cluster labels
print(ce.results['enrichment'])
# category_label P ... category_name Padj
# 0 Entry-level 8.988604e-31 ... experience_level 5.954950e-27
# 1 Junior Mid-level 3.818216e-294 ... experience_level 2.547895e-290
# 2 Intermediate Senior-level 5.812236e-51 ... experience_level 3.857000e-47
# 3 Junior Mid-level 4.519280e-43 ... experience_level 2.997639e-39
# 4 Junior Mid-level 1.477602e-68 ... experience_level 9.821622e-65
# .. ... ... ... ... ...
# 146 Medium (50 to 250) 6.991347e-12 ... company_size 4.603802e-08
# 147 Large (>250) 1.424008e-61 ... company_size 9.459684e-58
# 148 Small (less than 50) 1.487384e-55 ... company_size 9.874743e-52
# 149 Medium (50 to 250) 4.985496e-22 ... company_size 3.296410e-18
# 150 Medium (50 to 250) 1.461693e-06 ... company_size 9.553627e-03
# [151 rows x 11 columns]
# Create scatter plot with enrichment results.
ce.scatter(n_feat=4, s=y/500, jitter=0.05, fontsize=14, density=True, params_scatterd={'marker':df['experience_level'], 'gradient':'opaque', 'dpi':200}, figsize=(40,30))
# Create dense areas with enrichment results.
ce.scatter(n_feat=4, s=0, jitter=0.05, fontsize=14, density=True, params_scatterd={'marker':df['experience_level'], 'gradient':'opaque', 'dpi':200}, figsize=(40,30))
Interactive visualization using D3Blocks.
As a final exercise, we will create an interactive scatterplot for both the 2D PCA and the t-SNE embedding. This can help to better understand how individual samples are distributed for which zooming and panning can help to investigate the individual samples and/or groups.
# Import libraries
from scatterd import scatterd, jitter_func
from d3blocks import D3Blocks, normalize
import numpy as np
# Initialize
d3 = D3Blocks()
tooltip = []
for i in range(0, df.shape[0]):
tip = '<br>'.join(list(map(lambda x: x[0].replace('_', ' ').title()+': '+x[1], np.array(list(zip(df.columns, df.iloc[i,:].values))))))
tip = tip + '<br>' + 'Salary: $' + str(y[i])
tooltip.append(tip)
# Set all propreties
d3.scatter(jitter_func(X[:,0], jitter=1), # tSNE x-coordinates
jitter_func(X[:,1], jitter=1), # tSNE y-coordinates
x1=jitter_func(model.results['PC']['PC1'].values, jitter=0.05), # PC1 x-coordinates
y1=jitter_func(model.results['PC']['PC2'].values, jitter=0.05), # PC2 y-coordinates
color=df['job_title'].values, # Color on job title
tooltip=tooltip, # Tooltip
size=normalize(y.values, minscale=1, maxscale=25), # Node size on salary.
opacity='opaque', # Create a tidy scatterplot by only highlighting dense areas
stroke='#000000', # Edge color is black
cmap='tab20', # Colormap
scale=True, # Scale the datapoints
label_radio=['tSNE', 'PCA'],
figsize=[1024, 768],
filepath='Data_science_landscape.html',
)
Final words.
We started this quest to examine the trends in the data science field using the data science salary data set. We can conclude that higher-paid salaries are mainly outside Europe, have seniority experience level, and work for large companies. Remote work does not seem to make a difference. This trend seems unchanged over the years 2020 to 2023. We created an interactive data science landscape and analyzed the clusters using enrichment analysis. The interactive scatterplot is created using the D3blocks library. I hope you enjoyed and learned from reading this blog. Just remember that there were two important features that were not included in the data set: happiness and time spent with family and friends. – Nothing is forever, enjoy the moment. –
Be Safe. Stay Frosty.
Cheers, E.
If you found this article helpful, you are welcome to follow me because I write more about such topics. If you are thinking of taking a Medium membership, you can support my work by using my referral link. It is the same price as a coffee but this allows you to read unlimited articles every month!
Software
- Scatterd Documentation pages
- Clusteval Documentation pages
- D3Blocks Documentation pages
- PCA Documentation pages
Let’s connect!
References
- https://ai-jobs.net/salaries/download/ (CC0: Public Domain)
- Salary trends in AI, ML, Data around the world from 2020–2023
- Creating beautiful stand-alone interactive D3 charts with Python, Medium, 2022
- From Data to Clusters: When is Your Clustering Good Enough? Medium, 2023
- From Clusters To Insights; The Next Step, Medium, 2022
- Get the Most Out of Your Scatterplot by Making It Interactive Using D3js and Python, Medium, November 2022
- Quantitative comparisons between t-SNE, UMAP, PCA, and Other Mappings, Medium, 2022