Learning about Data Science using the Stackoverflow survey
![Ada Lovelace following CRISP-DM [Public Domain]](https://towardsdatascience.com/wp-content/uploads/2020/09/1DGeYeKR9zll2fhViFemm5g.jpeg)
Introduction
As someone looking to bring more Data Science to my work, I felt it would be interesting to look at the characteristics of those already in the field – and see if these differed from other developers.
The Stackoverflow user survey is a great resource for detailed information on all kinds of developers including data scientists. My analysis is based on the 2019 survey results which had 88,883 responses in total.
Who is the typical coder?
I wanted to look at who is the typical coder. What are their attributes? Can we generalize them in any way. Further, can we identify a typical data scientist.
First, I looked at the most common questions and answers. The single most common answer to a question was "Yes" to "Do you code as a hobby?". Around 80% of respondents answered in this way.
![[Image by author]](https://towardsdatascience.com/wp-content/uploads/2020/09/1GRH8VKT5CiY2BKor8S9_ew.png)
Not all questions were open to such simple analysis as many allowed multiple answers. For example, respondents were given a long list of programming languages to select from, plus a box to enter others. Once I de-duplicated this information I found that JavaScript came out as the most popular language (almost 60,000 users).
![[Image by author]](https://towardsdatascience.com/wp-content/uploads/2020/09/1FzRUaMhxxMN7r9DEx3siaA.png)
Whilst these single response statistics are interesting, I felt that the best way to define the typical coder was to look at all responses and find the most common combination of all answers.
One note, whilst most answers were categorical or grouped values, four of the questions allowed continuous numerical values. In order to process these I created buckets for the four quartiles (i.e. covering a 25% range of values in each).
It became clear that this method divided the responses into lots of very small slices. I found that, by this definition, the most common set of answers only occurred 15 times… and that for those 15 people only four questions were answered.
![The typical coders responses - other answers not shown all "(No response)" [Image by author]](https://towardsdatascience.com/wp-content/uploads/2020/09/1yeLFDvhgR_WO1NI8Sw2cyw.png)
Perhaps all we can say so far is that developers are unique individuals and aren’t all the same!
Using the same methodology but just focusing on data scientists, no clear "winner" emerged as all combinations of answers only occurred once. Data scientists are even more unique! 🙂
What makes data scientists stand out?
Moving on from looking just at typical responses. I next wanted to look if we could identify some attributes that clearly make data scientists different from other developers. Can we predict which developers are data scientists based on those attributes?
I wanted to model this as a machine learning classification. Firstly, I needed a target variable to train our model on. The "DevType" question seemed to hold the key to this, but first we needed to extract the distinct roles from this multi-answer field. "Data scientist or machine learning engineer" was the value that most accurately described our target group. By this method we can say 6,460 of our respondents are data scientists.
![[Image by author]](https://towardsdatascience.com/wp-content/uploads/2020/09/1Wqs35LoIiRcIyBuewPdwVg.png)
The next step was preparing the data, which involved:
- Removing irrelevant fields which don’t aid prediction
- Converting categorical values into numeric
- Removing rows with missing values
- Scaling features, so that they all carried the same weight in our model
After that it was time to train and test my model. I selected three classification algorithms to evaluate: Naive Bayes, Ada Boost and Gradient Boost. When I compared the results of these 3 algorithms there were some interesting results:
- Naive Bayes generally performed poorly, except in eliminating false negatives (recall) – less than 2% of Data Scientists weren’t classified correctly
- Ada Boost and Gradient Boost performed similarly to one another; both being 93% accurate overall, whilst doing a middling job of eliminating false positives (precision) and a poor job of eliminating false negatives (recall)
![[Image by author]](https://towardsdatascience.com/wp-content/uploads/2020/09/1hcOqnQoyv6HOeJMuL66lIA.png)
On balance, I feel that the Ada Boost algorithm gave the most encouraging results. The overall accuracy was good and it had the least-worst balance of precision and recall (represented by the f1-score) But there’s certainly lot’s of room for improvement in the model!
Is data science inclusive?
My final question was to look at whether data science is an inclusive field? Or rather, more specifically, whether it shows greater Diversity in it’s practitioners than other coding disciplines. Are minority groups better represented in data science?
I decided to focus on three main representations of diversity – ethnicity, sexuality and gender – and excluded any responses which hadn’t answered these questions. We can see quite quickly that coding in general is dominated by white heterosexual men…
![Count of Developers by Ethnicity [Image by author]](https://towardsdatascience.com/wp-content/uploads/2020/09/108iPiTLNwZoxSOy_VibIMA.png)
![Count of Developers by Gender [Image by author]](https://towardsdatascience.com/wp-content/uploads/2020/09/12VOddAzbYe8IEel3kg6Vzw.png)
![Count of Developers by Sexuality [Image by author]](https://towardsdatascience.com/wp-content/uploads/2020/09/1HfWLH1G5Mh6MEBcvTDKDGw.png)
As with the previous questions we used the "Data scientist or machine learning specialist" role to divide our data. Diversity groups were created to record each combination of gender, sexuality and ethnicity. Also a Diversity score was calculated by totalling how many of the three dimensions a person did not belong to the predominant group.
Looking at the spread of diversity groups we can see that 218 groups are represented within data scientists, as opposed to 550 within other developers. When you factor in that the data scientist group is much smaller, this translates, on average, to being over four times more likely that two data scientists are from different diversity groups (1 in 24 vs 1 in 109).
Now looking at the diversity scores, we find that for each point increase in score the proportion of data scientists increases. For example, 44% of data scientists are NOT white heterosexual men as opposed to 40% of all other developers



![[Images by author]](https://towardsdatascience.com/wp-content/uploads/2020/09/1HjCvvwBl5Ae5Ad6vQNaaYg.png)
Conclusion
- Coders actually aren’t all identical. No set of answers appeared very often at all. In fact, if we just look at data scientists we find that no set of answers appears more than once!
- Data Scientists aren’t that different from other developers. Or at least not enough that we can confidently predict which developers are data scientists based on their other responses (yet!)
- Data Science is more diverse than other development roles. Not by a lot but there is a measurable difference.
Full code and analysis can be found here https://github.com/deacona/stackoverflow2019
Thanks for reading!