
Data Science was predicted as the most promising job of the 21st century. As such it is supposed to attract a lot of professionals trying to change their careers. For those, it can be important to know upfront what are the key requirements to work as a data scientist, and what can help you to be data scientist of success. In this article we will tackle some of those points. More specifically we will try to understand what the main technological and educational characteristics of data scientists are and what distinguish them from other kinds of developers. Also, we will try to measure te importance of each of those features by how they impact data scientists professionals’ compensations levels.
To answer those questions, we will be using data from the Stack Overflow Survey. This is one of the largest worldwide survey involving primarily, people who codes. It is already a stablished and tradition initiative in the area that is running for almost a decade. In the year of 2020, the survey involved nearly 65,000 respondents.
All the data and code used for this article is available in my GitHub repository.
Perspectives and opportunities
Despite being expected to be a promising profession, from 2017 to 2020, we had a reduction of 1.87 p.p. in the share of data scientists from our sample of developers. Even if it is not much of a difference, it was found to be statistically significant. It could be argued that our sample is underrepresenting the number of data scientists. But even so, if data science really is a promising job, it would be expected for us to observe a rise in our estimates.

It would be possible for data science to still be a promising job, if what is preventing their numbers to rise is the shortage of professionals. In this sense, we would still have a rising interest from employer to contract more data scientists. Unfortunately, since we have higher levels of unemployment in the data scientists’ group, that also cannot be confirmed.

What distinguishes a data scientist?
To have a better understanding of the data scientists professionals we will be using only the data from the 2020 survey. Selecting only those respondents that are effectively writing code professionally (77.54% of our initial sample), we got that only 7.83% of those are data scientists. Most respondents regarded themselves as being a more traditional type of developer: full-stack (56.22%), back-end (56.21%) and front-end (37.19%). Since respondents could select more than one option for their answers, the sum of all options relative frequencies is larger than one.

Comparing the educational level of data scientists and other developers we can identify that data scientists in general, have a higher educational level than the others. While more than half of the other developers stops at a bachelor’s degree (50.96%), 41.11% of data scientists goes further and achieve a master’s degree, and 14.68% even further with a doctoral degree.

Most of data scientists and other developers had declared their undergraduate majors as computer science, computer engineering or software engineering. Despite being the most declared major for both groups, there is a difference of 13.46 p.p. more for other developers. This lower share in the data scientists’ group is compensated by higher relative frequencies in mathematics or statistics (12.35% against 2.81%), natural science (10.73% against 3.77%) and other engineering disciplines (11.93% against 9.01%).

The key point here, is that the science part in data science name is important. Data scientists’ approach is like a scientific approach, where we first have our initial hypothesis and then we will do our research to verify our assumptions. Data scientists usually studies into more scientific areas and goes further into their academic formation. Those aspects might be important to develop mathematics and researching skills necessary for working as a data scientist.
Comparing the top-10 most worked with technologies in 2020 by type of technology for both groups of developers, we identified some significant differences for the languages and miscellaneous technologies.

While Python is the most popular language for data scientist developers, it only achieved fifth place with the other developers. Another important distinction is regarding the R programming language. While it scored the eighth position in the data scientists’ distribution it did not even was in the top-10 for the other developers.
The miscellaneous is the more specific technologic layer and is where most of the differences can be observed. First, it looks like the distribution is much more concentrated in the data scientists’ group, reflecting a certain uniformity and specialization in tech technological usage. At the top-10 miscellaneous technologies, we can identify two import kinds of technologies: technologies for machine learning (TensorFlow, Keras and Torch/Pytorch) and technologies for big data (Apache Spark and Hadoop). Both these types of technologies were not found into top-10 of other developers, stressing the specificities of data sciences works.
How does technological and educational characteristics reflect into compensations
To understand the relationship of the technological and educational features descripted above with compensations levels we will be using a multiple Linear Regression model (ordinary least square – OLS). By fitting a model of all those features against the log of the compensation’s levels we can measure how the presence of a specific feature can impact in percentual terms the compensation levels.
Some of the features used to predict the compensations were deemed as not significant (their coefficient was equivalent to zero in a statistical sense). Those features were discarded from our analysis, and we will only focus on those that are significant to explain differences in compensations.

From the ordered bar graph above, we can see that having a higher educational level is what most contributes to professional’s compensations. By our estimates, having a doctoral degree is expected to increase professional’s compensations by 71.37%, while having a professional degree and master’s degree, 52.60% and 32.31% respectively. Surprisingly, declaring undergraduate major in more technical areas is negative associated with compensations. Those who declared in information systems, information technology or system administration earns 47.62% less, and those with a computer science, computer engineering, or software engineering graduation 47.02% less.
From the ten most popular languages for data scientists, only HTML/CSS and Bash/Shell/PowerShell were statistically significant, with the command line language having a positive impact of increasing the compensation by 25.35% and HTML/CSS decreasing by 15.87%. Other not so popular languages achieved a positive impact on compensation, with Scala increasing it by 26.18%, Perl by 27.10% and Objective-C by 33.06%. Even though we did not found Python as being positive related with compensations it does not mean that this language is not important for data scientists. It might be the case that programming in Python is becoming a requirement to work as a data scientist. As such, it does not contribute to professional differentiation and as a result does not improve the expected compensation.
For the platforms we identified three deployment technologies being important to the compensation level determination. Heroku is the only with a negative impact (22%). Docker (13.00%) and AWS (24.17%) both had a positive relationship with compensation. Another surprising result is that most popular miscellaneous techs are not important for the compensation. Only Keras was significant, but not in a positive way, having worked with this technology in 2020 contributed to a reduction in compensation levels by 20.30%.
Conclusion
We could not confirm that data science is still a promising profession. Might be because we have a biased sample that does not correctly represent the proportion of data scientists in the population of developer. Or, contrary to our initial hypothesis, data science is not so much promising anymore and the boom in the number of data scientists already occurred.
What we could confirm is that data scientists have some significant differences compared to other kinds of developers. Python is their main language, used by most professionals. There are also other technologies that are specific to te profession, such as TensorFlow, Torch/Pytorch and Spark. But even though those tools are popular, they are not important in acquiring a compensation above the average. It could be the case that they are like a requirement for the profession, not providing any competitive advantage for developers.
Getting a higher educational level is what really matter. Most data scientists have a master or doctoral degree. Also, having a higher educational level is what is most important to receive above the average compensations. Even though it is not requirement, studying more is what might pave the path to success for the data science profession.