Reviewing for Machine Learning Conferences Explained

From reading a paper for the first time to writing its complete review in a single Medium article

Ievgen Redko
Towards Data Science

--

Photo by Markus Winkler on Unsplash

Peer-reviewing is the cornerstone of modern science, and almost all major conferences in machine learning (ML), such as NeurIPS and ICML, rely on it to decide on whether submitted papers are relevant to the community and original enough to be published there. Unfortunately, with the exponentially increasing number of submitted articles over the last ten years, the reviewing quality has been dropping just as fast, with one-line reviews became widespread. You’ve probably been there already if you’ve ever submitted a paper to one of these conferences: after having worked very hard for months on what you felt like was a brilliant idea, you receive awful, useless, and (worse of it all) ironic reviews meaning that you will have to go through the submission process all over again without any hint on what was wrong with your paper in the first place.

Geoffrey Hinton, the famous Turing award winner for his contributions to the machine learning and AI fields, gave one of the reasons for why this happens in his interview to Wired journal in 2018:

Now if you send in a paper that has a radically new idea, there’s no chance in hell it will get accepted, because it’s going to get some junior reviewer who doesn’t understand it. Or it’s going to get a senior reviewer who’s trying to review too many papers and doesn’t understand it first time round and assumes it must be nonsense. Anything that makes the brain hurt is not going to get accepted. And I think that’s really bad.

While senior reviewers have little to no excuse for such behavior (why would you voluntarily agree to review if you do not have time to do it properly?!), junior reviewers may simply not know how to write a good thoughtful review. Conference organizers usually provide helpful guidelines with examples from reviews gathered over the years, but this cannot explain how to write a full review from scratch: starting from reading the submitted article for the first time and to finalizing your review and submitting it on the conference website. As I happen to have won several so-called “Top reviewer” awards previously (IJCAI’18, NeurIPS’19, ’20), I would like to explain below how I proceed when I review papers hoping that it will be useful for people who may need such guidance.

I wrote this article based on the Research Methodology course that I teach to Master’s degree students in machine learning. One of its lectures goes as follows: we read the paper together paragraph by paragraph and I explain to what particular parts of it a reviewer has to pay attention. As an example, I use the paper from the ICLR’19 conference entitled “Learning what and where to attend with humans in the loop” (its first submitted version can be found here). I chose this paper for two reasons: 1) it is not in one of my areas of primary expertise, and 2) it remains largely accessible to anybody having a general background in ML. I thought that the first point was very important as most of the future Ph.D. students will start their reviewer’s career in similar conditions and without a longstanding prior experience in any particular ML field.

I now propose you to follow me through the paper in order to understand how to write a review for it. To do this, I suggest you read full sections of the paper indicated in the titles below before reading my comments on it.

Abstract

The abstract is one of the most important things in the paper for a reviewer as it gives a general outline of what he/she will find in it. When reading this part, I note every promise made by authors and expect that authors supported it by facts in the main body of their work. Let’s see the abstract of our paper.

Image by Author based on the original paper.

I put important things in bold here. What information this abstract gives me as a reviewer? First, it defines the general area of the submission, which is the study of attention mechanisms in DCNs. Second, and most importantly, it puts forward two claims which I will want to verify, namely: 1) attention mechanisms with human supervision significantly improve the DCN’s performance, 2) learned features, in this case, are more interpretable. I note it and move on to the introduction.

Introduction

The introduction is an extended version of the abstract that includes hints to previous works and provides more details on the proposed contribution. In this paper, the introduction contains several things that attract my attention.

Images by Author based on the original paper.
Images by Author based on the original paper.

First, I identify several closely related prior works mentioned numerous times in the second paragraph, namely: (Linsley et al. ’17) and (Jiang et al. ‘15). As a reviewer, I would now briefly go through the contents of these two papers with a particular emphasis on the first one because 1) it is more recent and most likely will include a comparison to other related works mentioned in the introduction, and 2) the authors compare to it singularly.

Second, I note the positioning of proposed contributions w.r.t. state-of-the-art, namely: 1) the authors propose a more efficient strategy implemented on ClickMe.ai platform to obtain attention maps for large-scale datasets when compared to Salicon dataset and Linsley et al.’s work; 2) the authors propose a novel module for DCNs based on the idea of combining global contextual guidance with local saliency; 3) the authors improve performance with human-in-the-loop attention. Once again, as a reviewer, I will now seek arguments that support each of these claims.

Section 2: Description of ClickMe.ai

This section is very important, as it is entirely devoted to supporting the first claim mentioned above. On the one hand, it is supposed to show that the proposed strategy used to collect attention maps scales better than previous work. On the other hand, it should show that the obtained “top-down” maps are superior to “bottom-up” ones collected previously. Here is my resume for the first part.

Image by Author based on the original paper.

You may note that I put the ClickMe.ai strategy proposed by the authors as a strength of the paper as it involves only one human being contrary to two from Linsley et al., and allows to collect more attention maps. A downside to this is that the comparison with Linsley at al. allows me to discover the identities of the authors who mention ClickMe.ai in their previous paper.

Here is my summary of the second claim.

Source: original paper.

As shown above, “top-down” features (ClickMe maps) seem to perform better than “bottom-up” features (Salicon maps) when revealed to human observers. This supports the author’s claim about their superior performance w.r.t. the maps from the Salicon dataset. So far, I am only praising the strengths of the paper, but is there something to say about the weaknesses? Here are some of my remarks to be included in the review.

Image by Author.

The authors say that the ClickMe game scales better than the Clicktionary game of Linsley et al., but they never mention how many maps were collected using the latter. The second point is that the authors also use those maps that didn’t allow DCN to recognize the object correctly. Is it reasonable? Why do we consider these maps useful later? The other two minor points are that the authors often talk about “top-down” and “bottom-up” features, but they never explain the difference between the two (I had to google it). Finally, the authors claim that these features are “sufficient for human object recognition”. This may be too strong of a statement as the overall recognition accuracy never reaches 70%, which is far from what is considered human-level performance. I note it and move to Section 3.

Section 3: Proposed network architecture

This section describes the module inspired by the idea of “combining local saliency and global contextual signals to guide attention towards image regions that are diagnostic for object recognition”. I am not an expert in attention mechanisms for DCNs, and I cannot judge the soundness of what is proposed by the authors and its novelty. At this point, I start to think that my confidence score for this paper won’t be very high if I were its official reviewer and that I would have to indicate it clearly in my review to the area chair (AC). Despite this, I still notice the following phrase of the authors:

Images by Author based on the original paper.

They do not explain how they choose these layers, while omitting low-level layers altogether: an ablative study might be useful to back it up in this context. One more thing to add to the review as such discussion can be highly useful for researchers who may decide to implement their module for architectures other than ResNet-50.

Section 4: Training with humans-in-the-loop

This section presents most of the experimental results for the architecture proposed in Section 3 with additional regularization that forces learned maps to look similar to those provided by human participants of ClickMe.ai. Here is my short resume for this part.

Images by Author.

I note that most of the results indeed seem to back up the third claim from authors: human-in-the-loop supervision improves the performance on popular object recognition datasets. Even though I would tend to be convinced by the experimental results, I still notice several inconsistencies.

Images by Author.

The first remark is rather obvious. Why using the magic number 6 for the regularization parameter? The second question is related to Table 1 from the paper that shows significant improvement in terms of both the classification accuracy on the ILSVRC12 dataset and the ability to learn features similar to ClickMe maps. What’s inconsistent here? Well, the latter improvement seems to be quite obvious to me as it merely indicates that the regularization forcing learned features to look like ClickMe maps works well. Other baselines do not particularly seek to force such behavior, and this performance gain should be presented rather as an argument justifying the chosen regularization strength. Third, the authors mention that with a reduced set of ClickMe maps (Table 4 in Appendix), their method also performs better than all other baselines, but one can see that in this case, the performance gap becomes very small. Finally, the authors mention that “Without additional training, the model’s attention localized foreground objects in Microsoft COCO 2014 (Lin et al., 2014)” but do not provide quantitative results on this dataset and show only 6 derived maps in Figure 4 (this was improved in the published version).

Putting it all together

After explaining how I went through this paper, it is now time to put it all into a review ready to be submitted to the conference website. As required by many conferences, I start with the summary of the paper.

Image by Author.

Note that the summary is very important as it shows the authors that you understand their work. Then, I provide its strengths and weakness.

Image by Author.

I find it crucial to give some positive feedback even if I plan to suggest rejecting the paper in the end. This shows the authors what parts of their work were appreciated by the reviewers. I then continue with several detailed comments.

Image by Author.
Image by Author.
Image by Author.

You may note that all of the review’s contents are just the remarks that I was writing down while reading the paper. Usually, I will make a first draft of a review in around 3 hours and then I will go back to the paper at least twice before the deadline to make sure that I didn’t miss something.

What do other reviewers say?

The good thing about how the reviewing process works nowadays is that you can often see the reviews for a given paper once it was accepted/rejected. In the case of this submission, you can check the reviews here. After reading them, you may notice that other reviewers raise similar concerns to what I mention in my review, namely: 1) lack of motivation/justification for many design choices and 2) qualitative results for the interpretability. Note also that similar to me Reviewer 2 admits that he/she is not an expert in attention mechanisms for DCNs and puts a confidence score of 3/5 to indicate this to the area chair. This is very important as an unknowledgeable reviewer with high confidence is a nightmare for both the authors and the ACs. And it goes the other way around too: if you review a paper from your narrow area of expertise, you should clearly indicate it so that the AC will be able to identify the most informative reviews.

What do the authors do then?

I did the review of the first submitted version of this paper on purpose so that you can see the camera-ready version submitted by the authors once their paper was accepted. You may notice several differences in it compared to the first version: the title has changed to “Learning what and where to attend” as suggested by Reviewer 1, and many details were added throughout the whole text to make the paper clearer following the reviewers’ remarks (you can see it in the diff file between the final version and the original version). Overall, it shows you that your duty as a reviewer is not only to criticize somebody else’s work but rather to help them to improve it with your feedback.

This last phrase is what I see as a major source of bad reviews.

A bad reviewer often sees himself not as a peer of the authors with whom he or she wants to advance the state of research in his or her area, but as an ultimate (and sometimes superior) referee who is there to judge other’s work.

The first approach to reviews takes time, requires patience and more than a pinch of goodwill. The second requires none of it and leads to a destructive half-random reviewing process where it can take years for important contributions to be actually published. Luckily, however, it is up to all of us to choose how we want it to be in the end.

Afternote

This article explains my approach to reviewing papers, but I am not the highest authority in this matter and I do not claim that it is the only right way to do it. There can be other opinions on how a good review should look like, as well as people who will find my reviews bad and uninformative. Also, there are different types of papers and reviewing a theoretical research paper may be very different from reviewing an applied research paper. The goal of this article was to show one possible way of how to do it hoping that it can be helpful for those who will find it suitable personally to them.

P.S. Thanks to Quentin Bouniot and Sofiane Dhouib for proofreading this article.

--

--