The world’s leading publication for data science, AI, and ML professionals.

How AI could save a pillar of science

Peer review is a human job, but we may need the aid of the machine

Image from Patrick Tomasso at unsplash.com
Image from Patrick Tomasso at unsplash.com

Different researchers and professors are pushing for change. The discussion is also slowed by several questions: how do we assess the quality of peer review? can we do it objectively or measure it quantitatively?

A recently published scientific article shows how this is possible and is possible at scales using machine learning models. This article explores how researchers have analyzed peer review, why it is important, and what interesting prospects open up for an unexpected new application of Artificial Intelligence

Peer review: a pillar of modern science

the pillar of peer review in the sea of science, image is done by the author using stable-diffusion open source code.
the pillar of peer review in the sea of science, image is done by the author using stable-diffusion open source code.

Peer review is considered one of the pillars of modern science. Every article and every research proposal is submitted to the judgment of "peers," researchers, and/or professors who judge the goodness of a manuscript. Today, peer review is routinely used in all disciplines (from biology to medicine, from physics to computer science) to assess whether a manuscript can be accepted for publication in a journal or conference.

In addition, peer review is also used to evaluate proposals for applying for funding, career advancement, and so on. There are different variants; essentially, an article is judged by anonymous reviewers who provide commentary on a manuscript or proposal.

However, the system is not without flaws, as shown by the retraction of COVID-19 papers, the emergence of predatory journals and conferences, and so on. This is a brief introduction, if you are interested in the topic I have outlined benefits, problems, and propositions on how to improve peer review (link below).

How Science Contribution Has Become a Toxic Environment

On one hand, several questions remain open about how to judge the quality of an article or whether it is possible to use an algorithm to evaluate a submitted manuscript. On the other hand how to evaluate the determinants and characteristics of high-quality peer review. Anna Severin recently tried to answer the latter question with the use of artificial intelligence

Evaluate the peer review with AI

image by Tingey Injury Law Firm at https://unsplash.com/
image by Tingey Injury Law Firm at https://unsplash.com/

The dataset

The confidential nature of many peer review reports and the lack of databases and tools for assessing their quality have hampered larger-scale research on peer review. – Original article

As noted by the authors, until a few years ago review was a private discussion between reviewers, the editor (or conference committee), and the authors. Today, however, the Publons site collects millions of reviews that can be downloaded and used as a dataset. The site receives and collects reviews conducted by different scholars and collects different metadata (currently there are 15 million reviews from about 1 million scholars).

For the analysis, the authors downloaded 5 million reviews, using the following criteria:

  • For this analysis, they did not include Physics, Space Science, and Mathematics. These reviews present many mathematical formulae, which are difficult to categorize
  • They selected only medical and life science journals
  • They selected 10,000 verified pre-publication reviews as a control
  • They decided to divide into 10 equal groups based on the impact factor of the journal and sample from them 1,000 reviews

What is the impact factor? Since it is hard to define the quality of a peer review they selected the impact factor as a proxy measure. The impact factor was developed to help the librarian in selecting which journals to buy and it is simply the ratio between how many articles a journal publish and how many times the journal’s articles are cited in the last two years. This simple metric is considered synonymous with the quality of a journal.

In theory, one would expect that a paper such as a higher impact factor (or a conference with a lower acceptance ratio) would be synonymous with a more thorough peer review and selection of manuscript submissions. the article addresses exactly that.

The Model and training

The authors selected from the reviews 2,000 sentences that were then labeled. They then decided on 8 categories and manually scored whether each sentence met each of these categories. The different categories covered two different macro-areas:

We focused on thoroughness (whether sentences could be categorized as commenting on materials and methods, presentation, results and discussion, or the paper’s importance), and helpfulness (if a sentence related to praise or criticism, provided examples or made improvement suggestions). – The author in a interview to Nature

After that, they used five-fold cross-validation and a Naïve Bayes algorithm. They used different evaluation metrics and identified which words were the most important for the classification of each category. In addition, they also used a series of linear mixed-effects models to study the association between review characteristics and journal impact factors.

As noted by the authors, it can be seen that the proportion of reviewers from Asia, Africa, and South America decreases from lower to higher impact factor journals (the trend is reversed for Europe and North America)

Summary table of reviews and scientific journals reviewed. from the original article
Summary table of reviews and scientific journals reviewed. from the original article

In general, the model’s predictions were not very far from those encoded by human annotators, demonstrating the feasibility of using Machine Learning models to evaluate reviews.

The author also noticed that impact factor does not correlate with review content: Journals with higher impact factor are generally longer and there is more attention to materials and methods but fewer suggestions on how to improve the manuscript

But these proportions varied widely even among journals with similar impact factors. So I would say this suggests that impact factor is a bad predictor for ‘thoroughness’ and ‘helpfulness’ of reviews.

From the original article: "Percentage point change in the proportion of sentences addressing thoroughness and helpfulness categories, relative to the lowest journal impact factor group."
From the original article: "Percentage point change in the proportion of sentences addressing thoroughness and helpfulness categories, relative to the lowest journal impact factor group."

Conclusions and parting thoughts

In general, several researchers have complained about the current state of peer review (both in life sciences and computer Science). For years the paradigm has been: high impact factor (or low acceptance rate) equals high-quality peer review and articles.

In fact, the San Francisco Declaration on Research Assessment (DORA) proposed to eliminate any journal-based metrics in funding and promotions. Utrecht University, for example, has formally abandoned the journal impact factor as a criterion when it has to make decisions on staff hiring or promotions (the new policy includes teamwork, leadership, public engagement, and open science practice).

This study also shows that peer review is not inclusive and that several geographic areas are under-represented in high-impact factor journals. Greater inclusion even during peer review would allow better consideration of articles written by authors from the same geographic areas.

Furthermore, this study shows that even if the quality of comments cannot be associated with the journal’s impact factor, it is possible to use artificial intelligence to monitor peer review quality. The European Union and several committees are pushing for peer review reform and for the peer review data to be more transparent. Studies like this one show how publishers could monitor reviewer comments or decide on thoroughness and usefulness standards for each review.

In the future, it is not inconceivable that reviewers will be aided by AI assistants during peer review. Indeed, often the most prestigious conferences and journals receive thousands of submissions and have little time to evaluate them as the number of researchers available as reviewers dwindle. In the future, using AI before submitting a review will enable a more objective, fair, and inclusive peer review.

If you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

Or feel free to check out some of my other articles on Medium:

A critical analysis of your dataset

Machine learning to tackle climate change

Machine unlearning: The duty of forgetting

A New BLOOM in AI? Why the BLOOM Model Can Be a Gamechanger


Related Articles