Visualizing Developer Activity via the GitHub API

Can we somehow measure open source love?

Vikas Negi
Towards Data Science

--

Photo by Roman Synkevych 🇺🇦 on Unsplash

Apart from hosting our precious code, GitHub also provides an extensive set of REST APIs. These can be used to retrieve a variety of useful metrics for a given repository, which might also give an idea about its current state. Julia is a relatively new language, however, it’s package ecosystem has witnessed a tremendous growth over the last few years. The number of new packages seems to have skyrocketed! I would like to argue that simply developing a package is never enough, we also need to maintain and improve upon it. Having a mature package ecosystem also helps attract new users.

In this article, I will show you how to compare various metrics for different packages using the GitHub API. Based on that, we will also try to draw some conclusions about their popularity. To achieve this, we will make use of yet another package GitHub.jl in a Pluto notebook environment. The source code is available here.

Prerequisite

In order to use the GitHub API, it’s advised to authenticate requests using a personal access token as described here. Without it, there will be restrictions such as limiting requests to 60 per hour. The level of permissions to grant are up to you, however, a minimum read access will be necessary. An example setting is shown below:

Image by author

Adding necessary packages

Pluto’s built-in package manager will handle installation of packages and their dependencies. We will make use of the following for handling data and creating plots:

using GitHub, JSON, DataFrames, VegaLite, Dates

Authentication

Using the personal access token described earlier, we can generate an authentication object, which is later passed as a keyword argument to our requests. It’s not a good practice to hardcode tokens in your code. Instead, you can read them from a file (e.g. JSON) stored in a private location.

# Contents of JSON file: 
# { "GITHUB_AUTH":"YOUR_TOKEN" }
access_token = JSON.parsefile("/path/to/JSON/file")
myauth = GitHub.authenticate(access_token["GITHUB_AUTH"])
typeof(myauth)
# GitHub.OAuth2

Test API

Let’s check if our credentials are working correctly. We will try to fetch the list of contributors for DataFrames.jl, a must-have for doing data science using Julia.

Contributors for DataFrames.jl (Image by author)

Note that results of type Tuple{Vector{T}, Dict}imply that they are paginated. By supplying Dict("page" => 1)as an input parameter, we get to see all the results in the first Vector{T}as shown above. You could also tune the results per page, for example: Dict("per_page" => 25, "page" => 2)will return 25 results per page stating from page #2.

List of interesting packages

We can now start to gather data for multiple packages, and plot them together for comparison. I have curated the following list to cover different domains (data analysis, parsing, web, plotting, math etc.), which by no means is meant to be exhaustive. Do you think we can add more to this? Let me know in the comments, and I will update the plots accordingly.

List of packages compared in this article (Image by author)

Number of contributors

Let’s start by determining the number of contributors to a given package using the function shown below.

We will do this for all the packages in our list, and then plot the results using @vlplot macro from VegaLite.jl.

Image by author

It seems DataFrames.jl currently has the highest number of contributors, which is not surprising given its utility in almost all data science workflows. Plots.jl is a close second, and has the highest number amongst all plotting packages used in this comparison.

Number of forks

Using the same logic as above, we can also compare the amount of forks.

Image by author

Again DataFrames.jl appears to lead, followed closely by Plots.jl.

Weekly commits

GitHub also provides an API to determine the amount of commits made to a repository over the last 52 weeks. We need to parse the HTTP response object, and then convert it to a DataFrame where the package is used as a column name.

We then repeat the process for all packages, and construct a new DataFrame to be used as input for creating a stacked bar plot.

Note that here we are looking only at a subset of the above packages for better visual clarity.

Image by author

First plot shows that DataFrames.jl and CSV.jl have regular activity all throughout the last year. This indicates that the respective package maintainers have been working hard. Kudos to everyone involved!

In the second plot, we notice that a lot of activity happened during week 1–10 for Makie.jl and Plot.jl. After that, the number of commits has been lower than usual.

Open and closed issues

Another interesting metric to look at is the current number of open/closed issues. That could be a reasonable indicator of package maturity. After all, who would want to continue using code riddled with open issues? For example, if the ratio of open and closed issues is high (> 0.7), that indicates that devs have either been slow to fix bugs, or that the related issues are complex and will take time to fix. On the other hand, a lower ratio (< 0.3) indicates a healthy package development pace.

Keep in mind that the API also considers pull requests as closed issues. We would like to separate that from issues reported as bugs by users.

The gathered results can be combined and visualized once again as a stacked bar plot.

Image by author

It’s very heartening to see that most of the packages in our list do not have a big backlog of open issues. They have undergone a robust development and testing cycle, thus leading to a very mature state.

Open source love

It could also be interesting to have a look at some social metrics such as the number of people that have starred or are following updates of a repository.

Image by author

DifferentialEquations.jl is the clear winner here owing to its huge popularity in the Julia SciML ecosystem. Amongst the plotting engines, it appears that Plots.jl and Makie.jl are neck and neck. I was surprised to see PyCall.jl with so many stars. Now that I think about it, it makes sense since a lot of new Julia users might be switching from Python. It could also be the case that they intend to use Julia only for the performance critical part of their code.

The number of watchers also shows a similar trend, although I don’t think it’s a common habit amongst developers.

Image by author

Conclusion

The Julia ecosystem is evolving at a rapid pace. I am very happy to see that most of the prominent packages are being actively maintained, which is essential to the open source spirit. I hope you found this exercise interesting. Thank you for your time! Connect with me on LinkedIn or visit my Web 3.0 powered website.

--

--