The world’s leading publication for data science, AI, and ML professionals.

The network of R/python questions on stackoverflow

Visualising co-occurence with ggraph and community detection

When people ask coding questions on stackoverflow, which topics are inter-related for R and Python respectively?

I used two datasets with complete R and Python questions on stackoverflow till Oct 16. By looking at the 240K R tags and 1.8M Python tags linked to different questions, we can detect clusters of topics based on co-occurrence of the tags.

Most R questions surround ggplot2, dataframe and shiny

network of tags on R questions with more than 150 co-occurence
network of tags on R questions with more than 150 co-occurence

We can see how tags like ggplot2 and dataframe branch out to their subtopics: dataframe is related to csv (data input), and pre-processing libraries (reshape2, dplyr, data.table), and conditionals and loops. ggplot2 is related to different chart elements, data visualization, and the interactive shiny dashboard.

To have an even clearer view of the clusters, we can color the communities.

Clusters of R questions
Clusters of R questions

By counting the appearance of individual tags, we can tell ggplot2 appeared the most, indicating the common usage of R as a data visualization tool above other functions.


We will look into Python now.

Python questions mainly surround django, pandas and numpy

Next we look into Python questions. Due to its general functionality, it has more questions than R.

network of tags on Python questions with more than 800 co-occurrence
network of tags on Python questions with more than 800 co-occurrence

Similarity we can have a quick view of major communities of topics.

clusters of python questions
clusters of python questions

These topics appeared the most among Python questions: django, pandas, numpy.


Commonalities & Differences in R/Python questions

Next I looked into tags that co-occurred with both R and Python tags on different occasions.

Commonalities
Commonalities

We can observe shared topics like dataframe, plot, loop and function. Those tags specific to packages available only in one language will not co-occur often with both R and Python tags.

We can also observe tags that appear only with Python or R tags.

Differences
Differences

Round up

In this analysis we visualised co-occurrence of tags among R and Python questions respectively and compared the commonalities and differences of questions on the two languages.


What I learnt today is how to compute and visualise co-occurrence.

One way of getting co-occurrence is to calculate incidence matrix of bi-partite graph, the turn it into one-mode adjacency matrix by multiplying the matrix with its transpose. However, this method can be slow. A better way is to use join:

co_occurence = sqldf("SELECT a.Tag a, b.Tag b, COUNT(*) cnt
FROM  df a 
JOIN df b 
ON b.Id = a.Id AND b.Tag > a.Tag
GROUP BY a.Tag, b.Tag")

This is #day37 of my #100dayprojects on Data Science and visual storytelling. Full code is on my github. Thanks for reading. Suggestions of new topics and feedbacks are always welcomed.


Related Articles