Salary, Job Satisfaction, Trend of Data Jobs

What Can We Learn From Stack Overflow Survey Data

Chuangxin Lin

Follow

Published in

Towards Data Science

12 min readJun 6, 2021

--

We are now in the ear of data and the job market has witnessed a continuous demand for data-related jobs in the last few years. Data Scientist, Data Analyst and Data Engineer are the three main streams of data-related jobs. For anyone who is interested in entering the field or who is already inside the field, it will be beneficial to understand the current situation of the data job market. Information such as job demands, salary, and satisfaction can give you more insight when you make the next move in the data career path. In this post, I used a particular dataset, Stack Overflow Annual Developer Survey, to explore some questions that will be interesting. The post will cover some data analyses and modeling, with three main parts:

Part 1: Salary

What data role earns the highest salary?
Salaries of data roles among different countries;
Salaries with different work years;
Salaries among different genders in the data field;
Salaries v.s. Job Satisfaction.

Part 2: Change of data jobs, comparing 2020 data to 2019

Change of the numbers of data roles;
Change of Salary;
Change of Job Satisfaction.

Part 3: Job Satisfaction

Multi-class classification using XGBoost to predict job satisfaction;
Insight from modeling.

Notes on data:

Before presenting the analysis results, it is worth explaining some important data processing steps. But feel free to skip this section and return back when you find it necessary.

The survey focused on questions for people with a general developer background, including both software development and data analytics. The 2020 Survey data consists of 64,461 responses. I leveraged the question “DevType” and pre-processed the data so only responders who possess a data-related role will be analyzed. The question from the questionnaire is as followed.

Which of the following describe you? Please select all that apply.
[] Academic researcher
[] Data or business analyst (named as DA)
[] Data scientist or machine learning specialist (named as DS)
[] Database administrator
[] Designer
[] Developer, backend
[] Developer, desktop or enterprise applications
[] Developer, embedded applications or devices
[] Developer, frontend
[] Developer, fullstack
[] Developer, game or graphics Developer, mobile
[] Developer, QA or test
[] DevOps specialist
[] Educator
[] Engineer, data (named as DE)
[] Engineer, site reliability Engineering manager
[] Marketing or sales professional Product manager
[] Scientist
[] Senior Executive (CSuite, VP, etc.) Student
[] System administrator
[] Other

Note that a responder can have more than one choice for the DevType, and even choose more than one data role (i.e., DA, DS, DE). I keep only data entries that are associated with a data role and separate them by different types of data roles. Therefore, two new-created instances may come from the same data instance which has multiple data roles.

To make the analytic result more consistent, I also filtered the “Employ Status” and keep only responders that are currently employed. With such data preparation, there are 11,186 instances for our analyses on the data roles. Same data preparation has been applied to 2019 survey data, and 17,370 instances are created, which will be used to compared to the 2020 data in Part 2.

Part 1: Facts on Salary

Let’s first check the distribution of these three data jobs. Here we can see their numbers are very close, with DA ranked first and then DS, DE (Fig. 1a). In terms of the average salary (in USD), we can see DS and DE are almost the same, while DA has a slightly lower salary (Fig. 1b).

Fig.1. (a) Distribution of data roles from the 2020 survey. (b) Salaries of data jobs. Image by author.

When further displaying the salaries among different countries (top 9 countries with the most responses), we can see some interesting details in Fig. 2:

Data jobs in the United States earn the most. The average salary in the United States is substantially higher than in other countries, even than that in other developed countries in the western world (i.e., Germany, UK, Canada).
DS (and DE) is not necessarily a higher-income job, compared to DA. Indeed, we see the salary of DA is quite higher than DS in Canada and France. This may remind you of the saying: data scientist is the sexiest job of 21st century. It may not be that sexy as you thought, at least from what the salary from the survey data tells us. However, the Stack Overflow survey data does not necessarily represent the real-world population.

Fig. 2. Salaries of data jobs in different countries. Image by author.

An important factor related to the salary is the years of work experience. In the survey data, we use the “Year of Professional Coding” as a proxy of work experience, and divide the responders into four groups by “Year of Professional Coding”. In Fig. 3, from “0–3 years” to “6–13 years”, we can see the salaries gradually and slowly increase, and in the group “13+ years” the salary has a substantial raise.

Fig. 3. Salaries of data jobs in different “Year of Professional Coding” groups. The groups are dived in such a way that the sizes are as equal as possible. Image by author.

Another interesting and important point to check is the gender distribution in this field. The gender disparity in the developer field has been a long-term concern. Here we see this issue in the data field as well (Fig. 4).

Fig. 4 Gender distribution in data jobs. For the simplicity of plotting, all non-binary genders have been considered into the “Others” group. Image by author.

In terms of the salary, we can see there is no substantial difference between man and woman (Fig. 5a), but the salary for woman has a wider range (high variance, Fig. 5b).

Fig. 5. Salaries in data jobs by gender. (a) Bar plot. (b) Violin plot. Image by author.

Finally, let’s see how job satisfaction is related to salary. As you may expect, job satisfaction is not always positively correlated with salary and that’s also what the data shows us in Fig. 6a. It may also suggest that there are a bunch of data engineers who are very dissatisfied with their job while having very good pay (see the high green bar in the rightmost of Fig. 6b). Of course, there are many other factors that will contribute to our job satisfaction in the real world. And in Part 3, I will apply machine learning modeling to predict job satisfaction and find more insight.

Fig. 6. Salary v.s. Job Satisfaction. Image by author.

Part 2: Change of data jobs, comparing 2020 data to 2019

In this part, let’s see some trends in the data field. Because our analyses rely on the Stack Overflow survey data and the survey form in 2018 was quite different from 2019 and 2020, we will use 2019, 2020 data only and perform the comparison. We first note that the total number of valid survey responses filtered from raw data in 2020 is 53,159 and 15.69% of the responders have a data-related job. In comparison, the 2019 survey data has 77,420 valid survey responses and 16.71% of the responders have a data-related job.

Fig. 7. Numbers of data roles and distributions, 2019 v.s. 2020. Note that due to the way we created the data instances, the total numbers are larger than the numbers of survey responses. Image by author.

First, the data shows that the distribution of data roles has little change among these three roles (Fig. 7). The counts from 2019 are larger because there were more responses in the 2019 survey, which doesn’t necessarily mean the number of data-related jobs has decreased in 2020.

Fig. 8. Comparison of salaries between 2019 and 2020. Image by author.

When it comes to the change of salary, perhaps it’s the most surprising result I encountered. We can see the decrease in salary in all three data jobs and the overall decrease is about $16,000 (Fig. 8).

Fig. 9. Comparison of salaries between 2019 and 2020, US data only. Image by author.

To further verify this change, I made the plot with only US data. And a similar decrease in salary is observed (Fig. 9). This time we see the salary of DS has only a slight decrease while DA and DE have a more substantial decrease. It will be interesting to use other data sources to cross-check if this decrease is true for data jobs in the job market. And if so, what would be the reason and what implications can we have here? As a side note, I first thought this decrease in Salary may be due to Covid-19 and thus the downturn of the market. But the 2020 survey data was actually collected in February 2020 so the pandemic hasn’t come into play yet.

Last, let’s look at job satisfaction, another important factor to help get insight into the job market. The questionnaire has the question on job satisfaction as follows:

How satisfied are you with your current job? (If you work multiple jobs, answer for the one you spend the most hours on.)
o Very dissatisfied (-2)
o Slightly dissatisfied (-1)
o Neither satisfied nor dissatisfied (0)
o Slightly satisfied (1)
o Very satisfied (2)

I converted different grades of satisfaction into values ranging from -2 to 2. And the average scores are given in Fig. 10. We can observe that DS has the highest satisfaction score and then followed by DE and DA, in both 2019 and 2020. Moreover, the satisfaction from 2019 to 2020 has decreased across all three data jobs, which moved in the same direction as the salaries.

Fig. 10. Comparison of job satisfaction between 2019 and 2020. Image by author.

Part 3: Job Satisfaction Prediction

In the last part of this post, I am interested in building a machine learning model to predict job satisfaction. As we have seen before, there are five possible answers to the job satisfaction question. Therefore, the prediction will be a multi-classification problem.

To avoid the inconsistency of data distribution from different years, only the 2020 survey data will be used for modeling. Also, note that the data is imbalanced. “Very satisfied” and “Slightly satisfied” have taken more than 60% of the total data while there are less than 10% “Very dissatisfied”.

Some important data processing steps include: 1) data cleaning; 2) missing data imputation; 3) categorical data encoding; 4) feature selection/engineer. I will skip these technical details here and you can refer to the notebook and code in the GitHub repository for more details. For the modeling, I used the XGBoost algorithm with the oneVsRest approach for this multi-classification problem.

3.1. Exploratory Data Analysis (EDA)

Before presenting the modeling result, I will show some EDA as the routine of data science projects, to give you some quick insight into job satisfaction. As shown in Part 1, salary and job satisfaction do not necessarily correlate with each other (Fig. 6). How about the other important factors that will potentially affect job satisfaction?

There is certainly some pattern between company size and job satisfaction as shown in Fig. 11, but it will be difficult to draw a general conclusion. Plausibly, in the small company, the percentage of “Very satisfied” tends to be higher than “Slightly satisfied”.

Fig. 11. Job satisfaction distributions among companies of different sizes. Image by author.

Overtime is another factor that should relate to job satisfaction in a general sense. However, the pattern is vague if there is any (Fig. 12). Indeed, in the group that never overworks, the “Very satisfied” class is not the most response.

Fig. 12. Job satisfaction distributions among groups with different overtime. Image by author.

Last let’s check the job satisfaction distributions across different countries. One observation stands out: the percentage of “Very satisfied” is relatively low in developing countries such as India and Brazil. Nevertheless, it will be hasty to conclude this is a particular phenomenon in the data field. Instead, the lower percentage of “Very satisfied” in developing countries can be a general phenomenon regardless of whichever field.

Fig. 13. Job satisfaction distributions across different countries. Note that we don’t use Country as a feature in the modeling step for simplicity. Country is a high-cardinality feature and more advanced encoding techniques should be used rather than simple one-hot-encoding. Image by author.

3.2. Multi-classification with XGBoost

Here we jump to the model performance by checking the confusion matrix. There are two observations on our model:

There is an overfitting issue for the model, even though several techniques have been applied to combat the overfitting issues in the modeling step;
The model can predict the minor classes correctly on a reasonable level, although many mistakes are made by predicting the instances as “Very satisfied” and “Slightly satisfied”.
The model is confused by “Very satisfied” and “Slightly satisfied” (Fig. 14, right).

Fig. 14. Model performance by confusion matrix. left: confusion matrix for training data. Right: confusion matrix for test data. Image by author.

Certainly, there is room to improve the model, but the model performance is reasonably good compared to the baseline model using Naive Bayes which I haven’t shown here. Moreover, the dataset itself could be challenging and thus difficult to learn the general pattern (i.e., overfitting issue) anyway.

3.3. Model explainability and insight

In order to understand the model and get more insight, let’s check the top features calculated during the tree-building process (Fig. 15).

The use of one-hot-encoding introduces some inconvenience for the interpretation, because the original feature is now broken into multiple binary features. The two most important features are about “NEWOnboardGood”: “NEWOnboardGood_1” means “Yes” response to the question “Do you think your company has a good onboarding process” and “NEWOnboardGood_2” means “No”. It’s straightforward to see a good onboarding process usually is associated with high job satisfaction.

We also see “UndergradeMajor” plays as an important feature following “NEWOnboardGood”. Indeed, what the model has learned from the data is:

If the responders have an undergraduate major in natural science (such as biology, chemistry, physics, etc.), it’s more likely that they are “Very satisfied”;
Similar for responders who have an undergraduate major in another engineering discipline (such as civil, electrical, mechanical, etc.)

The feature “Bash/Shell/PowerShell” is interesting, responders who use “Bash/Shell/PowerShell” as one of their programming languages is more likely to be “Very satisfied” than those who don’t. At least that’s what the model learned from the data.

Fig. 16. Feature importance calculated by gain during the tree-building process of XGBoost. Image by author.

We can use SHAP, a very versatile tool for model explainability, to better understand the model behavior. You are encouraged to explore more in the notebook. Here I will show you the top features based on SHAP values and also post two examples of dependence plots for Salary and Company Size.

First, we will see the top features given by SHAP values (Fig. 17) are different from those calculated from information gain during node splitting (Fig. 16). SHAP values approach is considered more consistent for evaluating feature importance (see SHAP paper for reference). Here, the top features are “Salary”, “NEWOnboardGood_1”, “Age”, “YeasCode”, “OrgSize”, which are more consistent with common sense in my opinion.

Fig. 17. Top features from SHAP values, in contrast to those from information gain in Fig. 16. From Class 0 to 4: ‘Neither satisfied nor dissatisfied’, ‘Slightly dissatisfied’,
‘Slightly satisfied’, ‘Very dissatisfied’, ‘Very satisfied’. Image by author.

Fig. 18. SHAP dependence plot for Salary. Image by author.

Fig. 19. SHAP dependence plot for Company Size. The mapping of OrgSize is: {1: ‘Just me — I am a freelancer, sole proprietor, etc.’, 2: ‘2 to 9 employees’, 3: ’10 to 19 employees’, 4: ’20 to 99 employees’, 5: ‘100 to 499 employees’, 6: ‘500 to 999 employees’, 7: ‘1,000 to 4,999 employees’, 8: ‘5,000 to 9,999 employees’, 9: ‘10,000 or more employees’, -1: ‘Missing’}. Image by author.

The SHAP dependence plot in Fig. 18 tells us how the Salary will contribute to predicting if an instance is “Very satisfied”. We can see that overall, a higher salary will contribute to predicting the instance is more likely “Very satisfied”, as the increasing trend indicates.

For Company Size, the smaller size tends to make a positive contribution to predicting “Very satisfied” (Fig. 19).

Conclusion

In this post, we used the Stack Overflow Survey data to do some exploratory analyses and got some insight into the data field.

We analyze the salary distribution of different data roles, and also take into account other factors such as countries, year of experience, genders.
We compare the 2019 survey data with the 2020 data. And it’s quite surprising to see the salaries have decreased for the data-related jobs as well as the job satisfaction.
Finally, we built an XGBoost multi-classification model to predict job satisfaction and extract some insight into job satisfaction by checking the feature importance and dependence relationship.

The analyses and modeling in this post are more of the practice purpose and no solid conclusions can be drawn without further investigation. Most importantly, the Stack Overflow survey data itself could be biased and not necessarily the representation of the real-world population. I hope you have some fun reading through this post. To see more technical details, please refer to the GitHub repository.