NeurIPS Conference: Historical Data Analysis

Analysing conference trends from 1987 to 2020.

Nemanja Rakicevic
Towards Data Science

--

Closing ceremony NIPS 2017 | Image by Author

Conferences often provide statistics for a particular year, presenting the information about the acceptance rate, diversity, geographical and topic distribution, etc. Usually, this is given for only that particular year, or perhaps up to a few years back.

However, it might be worthwhile to take a step back and see the big picture, in order to visualise some trends and gain potentially useful insights.

Data Sources

In this blog, I will focus on the NeurIPS conference data. The accepted papers from all conference instances since 1987, when it was first held, are publicly available on https://papers.nips.cc/. There would have been additional insights from rejected papers, but unfortunately these are not available. I would also like to thank NeurIPS organisers for maintaining this database and giving permission to download and use this data for the analysis.

The information consistently available for each paper is: Title and Authors. For some years, also the following metadata is available: Abstract, Full paper text, Author affiliations, Reviews and Supplemental material.

In addition to this, I tried certain approaches to extract more data about the papers, authors and their institutions, which is not directly available on the main website. All the source code I used to obtain the data and generate the plots, is available on my GitHub. (All images by author)

Analysis

Paper statistics

Let’s first start with the total number of accepted papers. As the number of total submissions is not available, we cannot monitor the acceptance rate. Still, the number of accepted papers drastically increased in the recent years:

In the first conference instance, in 1987, there was a total of 90 accepted papers, while in 2020 this number is 1898. The overall number of accepted papers over the years is 11578.

If we (over)fit a polynomial curve to this data, we can make a wager for 2021:

Moving on to the paper titles, the average title length has gone down slightly:

while the use of abbreviations in the title is coming back in fashion:

The abstract length, measured by the word count, has been steadily increasing (variance is shown as vertical black lines). However, abstract data was not consistently available before 2007, so there is no information for that period:

One of my favourite plots (and the reason I started this analysis) is the supplemental material. Since 2007, the amount of accepted papers containing some form of a supplemental material (.pdf or .zip) started increasing, to almost 100% nowadays.

Research trends

It would be nice to roughly estimate some research trends based on the keywords appearing in the titles. Interesting insights can be found by looking at most frequent keyword phrases appearing in the title, normalised by the number of accepted papers, per year.

The top 5, two-word keywords’ frequencies show two periods when the phrase neural networks (blue bars) was popular — approximately before 2000 and in the 2010's. The period in between shows an increase in statistical ML terminology such as support vector (green) and Gaussian process (red). Reinforcement learning (orange), has been consistently mentioned since the 90’s, with larger peaks towards the end of the 90’s and after 2016.

Focusing on top 6, three-word phrases, similarly to the previous graph, there are two distinct “neural network” clusters: artificial neural network (red) and deep neural networks (green), with support vector machines (blue) in between them. Moreover, we can also notice the resurgence in the mentions of recurrent neural networks (orange) from the mid-2010's, following the deep learning revolution.

Author statistics

Regarding author information, quick disclaimer, there are some inconsistencies in the given names within the data, as they are sometimes abbreviated (e.g. “J.”, “Josh”, “Joshua”). Although it is difficult to fix all such instances manually due to the large number of authors, we can still get good enough estimates from the data at hand, in order to perform the analysis.

The total number of unique individual authors participating each year are shown below, and it closely follows the number of accepted papers:

Or more interestingly, normalised by the actual world population at the time:

In the first year there were 166 authors participating, while in the last conference there were 5917 authors, leading to an overall total of 17670 participants over 34 conference instances.

The top 50 publishing authors, with most accepted papers:

To investigate the actual inequality in the distribution of papers among the authors, one can estimate the Gini coefficient. The blue part in the graph below shows the cumulative number of publications, once we sort the authors by their number of publications. If each author had the same number of publications, the blue area would cover the orange area as well (which would be the ideal equal distribution). Therefore, the ratio: (orange area) / (orange + blue area) gives us the Gini coefficient. The value of 0 indicates perfect equality (no orange area), while the value of nearly 1 indicates total inequality (all papers are published by almost a single author).

Although author diversity is an important aspect, information such as author gender diversity is not available from the NeurIPS paper website. It is possible to get a very rough estimate of this using a python library, which is based on comparing the author’s given name to a name database (however mostly “western” names). The “unknown” label is mostly assigned to non-western names, or cases where the author’s given name is abbreviated.

Below are the graphs showing the proportion of all papers, and last-author papers, published by each author group. The last author is usually a more senior academic, such as a group leader or a supervisor.

In both graphs we can see that the female-male gap is still very high.

If we consider the label “unknown” to be assigned to mostly non-western names, the recent increase in this label could be linked to more non-western authors participating in the conference in the recent years. However, when looking at the paper’s last authors, this increase is slower.

“PhD” author statistics

Next, I was interested in estimating the number of publications by PhD students over the years. Even though there is no way to reliably estimate the number of PhD students and which authors are PhD students, what we can do is to check how many papers were published by each author, within the first 5 years since their first publication at this conference.

Both the total amount of papers and first-author papers are mostly constant, with a slight upward trend.

Institution statistics

More interesting insights come from the authors’ institution affiliations, and their participation over the years. Since the first conference that mentioned author affiliations in 2013 there was a total of 2672 participating institutions.

Disclaimer:

  • Affiliation information was not available before 2013.
  • The actual institution names are sometimes inconsistent (e.g. “Harvard Unviersity” (small typo), “Harvard University”, “Harvard College”, “Harvard”, “Harvard Medical School”, “Harvard/MIT”, etc) and this might lead to some inaccuracies in the final analysis. However, even though that these examples are counted as separate institutions, we can still get good enough estimates for the analysis.

The top 50 publishing institutions with the highest number of papers:

Similarly as before, the estimated Gini coefficient below shows the inequality in the distribution of papers among institutions. In this case, the coefficient value of 0.68 indicates a much higher inequality than with individual authors.

Collaboration statistics

A proxy for an indicator of collaboration, can be obtained via the average number of authors per paper. This number has been steadily increasing, which could indicate more complex research projects published, more openness to collaboration, etc.

Regarding only single-author papers, this trend has significantly gone down:

While the number of institutions per paper has not significantly changed:(variance is shown as vertical black lines)

Funding statistics

Another interesting observation is the relationship between the GDP of the institutions’ countries w.r.t. the total number of accepted papers coming from these countries. The country information is not included in the paper metadata, so it needs to be inferred based on the institution name (by scraping Wikipedia, with limited success). The GDP data for each country is available from the world bank.

It would be interesting to understand how the GDP of a country relates to the available research funding, and subsequently, if there is a correlation with the number of accepted papers.

The graphs above show a correlation between a country’s GDP and the total number of accepted papers by this country’s institutions, in both standard and log-plots, with the US being a clear front runner.

Final thoughts

I hope these graphs were informative or at least interesting, as some results might be expected while others less so. I would be happy to get feedback and thoughts on the analyses I might have missed that would have been insightful.

Interesting insights could also be gained from looking into the average number of citations over the years, which I defer as a future task (since Google Scholar API is limited), or if anyone would like to contribute to the repository.

Acknowledgements

I would like to thank Robert Walecki, Linh Tran, Arash Tavakoli and Suzana Ilić for useful discussions and comments, which helped shape my first blogpost!

Citation

Cited as:

Rakicevic, Nemanja. (Feb 2021). NeurIPS Conference: Historical Data Analysis. Towards Data Science blog. https://towardsdatascience.com/neurips-conference-historical-data-analysis-e45f7641d232.

Or

@article{rakicevic2021neuripsanalysis,
title = "NeurIPS Conference: Historical Data Analysis.",
author = "Rakicevic, Nemanja",
journal = "towardsdatascience.com",
year = "2021",
month = "Feb",
url = "https://towardsdatascience.com/neurips-conference-historical-data-analysis-e45f7641d232"
}

--

--