R vs Python: Comparing Data Science Job Postings Seeking R or Python specialists

What are employers looking for when making language targeted data science job ads?

Alex daSilva
Towards Data Science

--

There are a million posts online titled something like, “R vs Python: the battle for Data Science”, pitting the two languages against each other attempting to crown “the one language that rules them all for data science”. Both are powerful tools each with their respective strengths (i.e., advanced statistical libraries vs implementing production code); and in general, it makes a lot more sense to think about how they complement each other than trying to choose one over the other.

That being said, when looking at job postings, not all data science job ads contain both R and Python in their list of preferred skills. That makes for a potential interesting comparison: what are companies looking for when hiring a data scientist fluent in R but not Python? And vice versa? To answer this question, I scraped job descriptions, titles, and locations from a popular job posting website with two different search queries: one containing the terms data science AND R but NOT python, and the other containing the terms data science AND python but NOT R. The R specific job search results returned about 30% fewer hits than the python specific jobs searches. Not surprisingly a search containing both languages returned far more hits than either of the language specific searches.

Geographical Preferences for R vs Python

As a first step, I looked at the posting breakdown by city for the top 10 most common cities for both R and python jobs searches. When I collected the data, the largest amount of postings for both languages were in New York and Boston (though I scraped postings again 2 weeks later and Silicon Valley took over for python jobs while New York and Boston remained constant for R jobs). After the first two cities, things begin to diverge. There seems to be a higher demand for R fluent data scientists in Washington D.C. and a higher demand for python fluent data scientists in San Francisco. I’m guessing the demand for R data scientists in DC is driven by companies looking for political Data Scientists, many of whom probably have PhDs in Political Science, an area in which R has strong roots.

A surprising number of job postings fell on the East Coast

Data Science Specific Terms by Posting

A quick way to get a sense of how the postings for these roles differ is to simply count the number of data science tools and techniques that occur in each set of job postings. That is, what tools and techniques are being used in conjunction with the R data science jobs? And vice versa for python data science jobs. Below, we can easily view how many times these terms occur with a word cloud.

By seemingly shouting “research” or “machine learning” at you, the word clouds begin to paint a picture of some differences in the ads. In the R data science job postings, we can clearly see the most common word used is “research” followed by terms such as “SQL” and “statistics”. For the python data science job postings, “machine learning” appears the most followed by “SQL”, “research”, and tools for working with big-data such as AWS and spark.

The larger the term, the more frequently it occurred in the posting

Topic Modeling on Job Descriptions

We can look deeper at the content in the job descriptions using topic modeling. After some text cleaning, but before model fitting, it’s a good idea to try and get a feel for how many topics might be appropriate to model, rather than just choosing an arbitrary number. For help with that, we can turn to the “ldatuning” package in R. After feeding ldatuning a document term matrix (DTM), it will return a few different metrics for assessing the number of topics to move forward with. The plot below represents the information from the R job descriptions; it looks like around 25 topics might be a reasonable choice. The results from the python descriptions were almost identical, so 25 topics were used in the python models as well.

The y axis represents a normalized metric that’s either to be minimized (top pane) or maximized (bottom pane)

Below are the resulting topics, starting with the R jobs first. The terms are sorted in descending order with respect to their weighting on each topic. Some of the topics aren’t terribly interesting as they are related to hiring practices and benefits (e.g., 10, 13, 14, 17, 22) or even to the city of New York (6) though that makes sense as many of the job postings occur there. Multiple topics seem to be representing the health and healthcare industries(18, 23). Others depict scientific research, particularly biological and clinical trial research (2, 7, 25) - this helps explain Durham, NC’s place on the list as it is a hotbed for biotech/pharma research. The business world is also represented with topic 15 related to business broadly, 20 seemingly related to finance, and topic 24 depicting market research. Topic 5 may confirm a hypothesis from above regarding work related to political science as it contains terms related to policy, government, and surveys. Finally, we come to what I’ll call general data science topics; such as, topics 3, 8, 11, 16, and 18 that cover themes like teamwork, analytics broadly, machine learning, statistics, and databasing.

Terms are sorted in descending order with respect to their weighting on each topic

Next, we can move to the topic analysis results from the python only descriptions. Ignoring the topics pertaining to hiring practices and benefits, we see some similar themes emerging. For example, topic 14 appears to represent health research, 9 marketing, 11 finance, 24 business generally, and we have general data science topics such as 4, 20, and 25 emphasizing team work, machine learning, and analytics.

Now, on to the differences, and there are some big differences. That served as a segue to mention topics related to big data (ha!). Topic 1 literally contains the term “big” in addition to platforms (hadoop, spark) dedicated to processing big-data; topic 23 also seems to be representing large data in some format. Topic 21 depicts cloud computing, there are two topics devoted to databasing (10, 22), and multiple topics related to software engineering (15, 19).

Terms are sorted in descending order with respect to their weighting on each topic

What did we learn?

There are of course a lot of similarities (e.g., SQL - data has to live somewhere) between the postings for R focused data scientists and python fluent data scientists, but there are some key differences. At the highest level, it looks like the postings seeking a data scientist with R experience are looking for an academic trained researcher with a heavy dose of analysis experience. This thought makes sense as R is pretty dominant in academia and also fits in well with many of the research themed topics in the R jobs descriptions. The postings seeking a python fluent data scientist seem to be looking for someone with more of a computer science or engineering background and would match the description of a data engineer or machine learning engineer.

At another level, I think the differences we’re seeing in description content have been already been articulated in a Quora post from 2014 by Michael Hoster (current director of data science at Stitchfix) where he explains two different types of data scientists:

Type A Data Scientist: The A is for Analysis. This type is primarily concerned with making sense of data or working with it in a fairly static way.The Type A Data Scientist is very similar to a statistician (and may be one) but knows all the practical details of working with data that aren’t taught in the statistics curriculum: data cleaning, methods for dealing with very large data sets, visualization, deep knowledge of a particular domain, writing well about data, and so on.

The Type A Data Scientist can code well enough to work with data but is not necessarily an expert. The Type A data scientist may be an expert in experimental design, forecasting, modeling, statistical inference, or other things typically taught in statistics departments. Generally speaking though, the work product of a data scientist is not “p-values and confidence intervals” as academic statistics sometimes seems to suggest (and as it sometimes is for traditional statisticians working in the pharmaceutical industry, for example). At Google, Type A Data Scientists are known variously as Statistician, Quantitative Analyst, Decision Support Engineering Analyst, or Data Scientist, and probably a few more.

Type B Data Scientist: The B is for Building. Type B Data Scientists share some statistical background with Type A, but they are also very strong coders and may be trained software engineers. The Type B Data Scientist is mainly interested in using data “in production.” They build models which interact with users, often serving recommendations (products, people you may know, ads, movies, search results). At Google, a Type B Data Scientist would typically be called a Software Engineer. Type B Data Scientists may use the term Data Scientist to refer just to themselves, and since the definition of the field is very much in flux, they may be right. But I see the term being used most often in the general way I am proposing here.

The distinction between these two types is also well explained and expanded upon in another medium post by Robert Chang. To me, the type A data scientist seems match up fairly well with what we’ve uncovered about postings for a R data scientist and the type B data scientist from the ads seeking a python data scientist. More recently, AirBnB restructured their data science division into 3 branches: Data Scientist-Analytics, Data Scientist-Algorithms, and Data Scientist-Inference. From the data here, it looks like data scientists described by the python posts would be a good fit for the algorithms branch, data scientists depicted in the R posts with the Inference branch, and either potentially fitting in the Analytics track.

In closing, R and python are critical tools for data science and knowing them both well will undoubtedly get you further than knowing one but not the other. Nonetheless, as we’ve seen, there will be positions and postings that require one language and not the other, and understanding the differences in the content of those postings might help you consider which language you may want to prioritize as your primary working language.

--

--

Social Science and Data Science Nerd | PhD Candidate in Psychological and Brain Sciences at Dartmouth College | https://www.linkedin.com/in/alex-w-dasil/