When people ask coding questions on stackoverflow, which topics are inter-related for R and Python respectively?
I used two datasets with complete R and Python questions on stackoverflow till Oct 16. By looking at the 240K R tags and 1.8M Python tags linked to different questions, we can detect clusters of topics based on co-occurrence of the tags.
Most R questions surround ggplot2, dataframe and shiny

We can see how tags like ggplot2 and dataframe branch out to their subtopics: dataframe is related to csv (data input), and pre-processing libraries (reshape2, dplyr, data.table), and conditionals and loops. ggplot2 is related to different chart elements, data visualization, and the interactive shiny dashboard.
To have an even clearer view of the clusters, we can color the communities.

By counting the appearance of individual tags, we can tell ggplot2 appeared the most, indicating the common usage of R as a data visualization tool above other functions.

We will look into Python now.
Python questions mainly surround django, pandas and numpy
Next we look into Python questions. Due to its general functionality, it has more questions than R.

Similarity we can have a quick view of major communities of topics.

These topics appeared the most among Python questions: django, pandas, numpy.

Commonalities & Differences in R/Python questions
Next I looked into tags that co-occurred with both R and Python tags on different occasions.

We can observe shared topics like dataframe, plot, loop and function. Those tags specific to packages available only in one language will not co-occur often with both R and Python tags.
We can also observe tags that appear only with Python or R tags.

Round up
In this analysis we visualised co-occurrence of tags among R and Python questions respectively and compared the commonalities and differences of questions on the two languages.
What I learnt today is how to compute and visualise co-occurrence.
One way of getting co-occurrence is to calculate incidence matrix of bi-partite graph, the turn it into one-mode adjacency matrix by multiplying the matrix with its transpose. However, this method can be slow. A better way is to use join:
co_occurence = sqldf("SELECT a.Tag a, b.Tag b, COUNT(*) cnt
FROM df a
JOIN df b
ON b.Id = a.Id AND b.Tag > a.Tag
GROUP BY a.Tag, b.Tag")
This is #day37 of my #100dayprojects on Data Science and visual storytelling. Full code is on my github. Thanks for reading. Suggestions of new topics and feedbacks are always welcomed.