Analyze Scientific Publications with E-utilities and Python

How to gather data about scientific literature and discover trends

Jozsef Meszaros
Towards Data Science

--

Photo by Alex Block on Unsplash

Staying on top of scientific research trends is essential for any scientist, science writer, or aspiring start-up founder. When it comes to searching biomedical science literature, most people will turn to Google Scholar, PubMed, or their favorite reference manager. As you might expect, these public-facing search tools offer ease of use but sacrifice efficiency, control, and scalability. Thus, they are generally not readily useful when it comes to data science. Instead, you’ll want to use the E-utilities offered by the National Center of Biotechnology Information (NCBI).[1] The scientific articles available from NCBI are stored within the PubMed database, which primarily covers life science research, but also a few journals related to chemistry and physics.[2]

For data science applications, web-based search engines and popular reference managers won’t cut it

The information on how to efficiently use NCBI for data science is strewn about various governmental and academic websites. Many of the resources were uploaded in the 1990’s, focused on searching for genetic sequences rather than primary literature publications, and accompanied by code examples written only in Perl. There was once a course on using NCBI’s E-Utilities that was hosted at NIH during the early 2000’s, for which Powerpoint “slidesets” along with obligatory clip-art illustrations are available here. In this article, I will convey the lessons I have learned compiling information from these various sources and regular trial-and-error.

Note: There are many types of data searchable from within NCBI, including genetic sequences, phylogenetic trees, 3-D structures, and other information. This article will be focused exclusively on searching primary literature.

What kind of data science questions can be answered using NCBI?

There are many kinds of science questions answerable using NCBI. You could, for instance, create a cluster-map of author-provided keywords for articles to examine publishing trends over time, as I demonstrate at the end of this article. Additionally, you could build LLMs and NLP models using abstracts and text from publications to assist with making connections between literature. Here are the three steps I will cover in the article:

(1) querying the databases,

(2) returning results for multiple publications, and

(3) retrieving similar articles and full text versions of articles.

At the end, I will provide a full documented code example to see how to carry out these three steps for a relatively large corpus (~10,000 articles) and an accompanying data visualization.

Querying NCBI databases

To query an NCBI database effectively, you’ll want to learn about certain E-utilities, define your search fields, and choose your search parameters — which control the way results are returned to your browser or in our case, we’ll use Python to query the databases.

Four most useful E-utilities

There are nine E-utilities available from NCBI, and they are all implemented as server-side fast CGI programs. This means you will access them by creating URLs which end in .cgi and specify query parameters after a question-mark, with parameters separated by ampersands. All of them, except for EFetch, will give you either XML or JSON outputs.

  • ESearch generates a list of ID numbers that meet your search query

The following E-Utilities can be used with one or more ID numbers:

  • ESummary journal, author list, grants, dates, references, publication type
  • EFetch **XML ONLY** all of what ESummary provides as well as an abstract, list of grants used in the research, institutions of authors, and MeSH keywords
  • ELink provides a list of links to related citations using computed similarity score as well as providing a link to the published item [your gateway to the full-text of the article]

The NCBI hosts 38 databases across their servers, related to a variety of data that goes beyond literature citations. To get a complete list of current databases, you can use EInfo without search terms:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi

Each database will vary in how it can be accessed and the information it returns. For our purposes, we’ll focus on the pubmed and pmc databases because these are where scientific literature are searched and retrieved.

The two most important things to learn about searching NCBI are search fields and outputs. The search fields are numerous and will depend on the database. The outputs are more straightforward and learning how to use the outputs is essential, especially for doing large searches.

Search fields

You won’t be able to truly harness the potential of E-utilities without knowing about the available search fields. You can find a full list of these search fields on the NLM website along with a description of each, but for the most accurate list of search terms specific to a database, you’ll want to parse your own XML list using this link:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed

with the db flag set to the database (we will use pubmed for this article, but literature is also available through pmc).

A list of search fields for querying PubMed MEDLINE records. (Source: https://www.nlm.nih.gov/bsd/mms/medlineelements.html)

One especially useful search field is the Medline Subject Headings (MeSH).[3] Indexers, who are experts in the field, maintain the PubMed database and use MeSH terms to reflect the subject matter of journal articles as they are published. Each indexed publication is typically described by 10 to 12 carefully selected MeSH terms by the indexers. If no search terms are specified, then queries will be executed against every search term available in the database queried.[4]

Query parameters

Each of the E-utilities accepts multiple query parameters through the URL line which you can use to control the type and amount of output returned from a query. This is where you can set the number of search results retrieved or the dates searched. Here are a list of the more important parameters:

Database parameter:

  • db should be set to the database you are interested in searching — pubmed or pmc for scientific literature

Date parameters: You can get more control over the date by using search fields, [pdat] for example for the publication date, but date parameters provide a more convenient way to constrain results.

  • reldate the days to be searched relative to the current date, set reldate=1 for the most recent day
  • mindate and maxdate specify date according to the format YYYY/MM/DD, YYYY, or YYYY/MM (a query must contain both mindate and maxdate parameters)
  • datetype sets the type of date when you query by date — options are ‘mdat’ (modification date), ‘pdat’ (publication date) and ‘edat’ (Entrez date)

Retrieval parameters:

  • rettype the type of information to return (for literature searches, use the default setting)
  • retmode format of the output (XML is the default, though all E-utilities except fetch do support JSON)
  • retmax is the maximum number of records to return — the default is 20 and the maximum value is 10,000 (ten thousand)
  • retstart given a list of hits for a query, retstart specifies the index (useful for when your search exceeds the ten thousand maximum)
  • cmd this is only relevant to ELink and is used to specify whether to return IDs of similar articles or URLs to full-texts

Use Python to execute queries and store results

Once we know about the E-Utilities, have chosen our search fields, and decided upon query parameters, we’re ready to execute queries and store the results — even for multiple pages.

While you don’t specifically need to use Python to use the E-utilities, it does make it much easier to parse, store, and analyze the results of your queries. Here’s how to get started on your data science project.

Let’s say you want to search MeSH terms for the term “myoglobin” between 2022 and 2023. You’ll set your retmax to 50 for now, but remember the max is 10,000 and you can query at a rate of 3/s.

import urllib.request
search_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' + \
f'?db=pubmed' + \
f'&term=myoglobin[mesh]' + \
f'&mindate=2022' + \
f'&maxdate=2023' + \
f'&retmode=json' + \
f'&retmax=50'

link_list = urllib.request.urlopen(search_url).read().decode('utf-8')
link_list
The output of the esearch query from above.

The results are returned as a list of IDs, which can be used in a subsequent search within the database you queried. Note that “count” shows there are 154 results for this query, which you could use if you wanted to get a total count of publications for a certain set of search terms. If you wanted to return IDs for all the publication, you’d set the retmax parameter to the count, or 154. In general, I set this to a very high number so I can retrieve all of the results and store them.

Boolean searching is easy with PubMed and it only requires adding +OR+, +NOT+, or +AND+ to the URL between search terms. For example:

http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/?db=pubmed&term=CEO[cois]+OR+CTO[cois]+OR+CSO[cois]&mindate=2022&maxdate=2023&retmax=10000

These search strings can constructed using Python. In the following steps, we’ll parse the results using Python’s json package to get the IDs for each of the publications returned. The IDs can then be used to create a string — this string of IDs can be used by the other E-Utilities to return information about the publications.

Use ESummary to return information about publications

The purpose of ESummary is to return data that you might expect to see in a paper’s citation (date of publication, page numbers, authors, etc). Once you have a result in the form of a list of IDs from ESearch (in the step above), you can join this list into a long URL.

The limit for a URL is 2048 characters, and each publication’s ID is 8 characters long, so to be safe, you should split your list of links up into batches of 250 if you have a list larger than 250 IDs. See my notebook at the bottom of the article for an example.

The results from an ESummary are returned in JSON format and can include a link to the paper’s full-text:

import json
result = json.loads( link_list )
id_list = ','.join( result['esearchresult']['idlist'] )

summary_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esummary.fcgi?db=pubmed&id={id_list}&retmode=json'

summary_list = urllib.request.urlopen(summary_url).read().decode('utf-8')

We can again use json to parse summary_list. When using the json package, you can browse the fields of each individual article by using summary[‘result’][id as string], as in the example below:

summary = json.loads( summary_list )
summary['result']['37047528']

We can create a dataframe to capture the ID for each article along with the name of the journal, the publication date, title of the article, a URL for retrieving the full text, as well as the first and last author.

uid = [ x for x in summary['result'] if x != 'uids' ]
journals = [ summary['result'][x]['fulljournalname'] for x in summary['result'] if x != 'uids' ]
titles = [ summary['result'][x]['title'] for x in summary['result'] if x != 'uids' ]
first_authors = [ summary['result'][x]['sortfirstauthor'] for x in summary['result'] if x != 'uids' ]
last_authors = [ summary['result'][x]['lastauthor'] for x in summary['result'] if x != 'uids' ]
links = [ summary['result'][x]['elocationid'] for x in summary['result'] if x != 'uids' ]
pubdates = [ summary['result'][x]['pubdate'] for x in summary['result'] if x != 'uids' ]

links = [ re.sub('doi:\s','http://dx.doi.org/',x) for x in links ]
results_df = pd.DataFrame( {'ID':uid,'Journal':journals,'PublicationDate':pubdates,'Title':titles,'URL':links,'FirstAuthor':first_authors,'LastAuthor':last_authors} )

Below is a list of all the different fields that ESummary returns so you can make your own database:

'uid','pubdate','epubdate','source','authors','lastauthor','title',
'sorttitle','volume','issue','pages','lang','nlmuniqueid','issn',
'essn','pubtype','recordstatus','pubstatus','articleids','history',
'references','attributes','pmcrefcount','fulljournalname','elocationid',
'doctype','srccontriblist','booktitle','medium','edition',
'publisherlocation','publishername','srcdate','reportnumber',
'availablefromurl','locationlabel','doccontriblist','docdate',
'bookname','chapter','sortpubdate','sortfirstauthor','vernaculartitle'

Use EFetch when you want abstracts, keywords, and other details (XML output only)

We can use EFetch to return similar fields as ESummary, with the caveat that the result is returned in XML only. There are several interesting additional fields in EFetch which include: the abstract, author-selected keywords, the Medline Subheadings (MeSH terms), grants that sponsored the research, conflict of interest statements, a list of chemicals used in the research, and a complete list of all the references cited by the paper. Here’s how you would use BeautifulSoup to obtain some of these items:

from bs4 import BeautifulSoup
import lxml
import pandas as pd

abstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
abstract_ = urllib.request.urlopen(abstract_url).read().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_,features="xml")

articles_iterable = abstract_bs.find_all('PubmedArticle')

# Abstracts
abstract_texts = [ x.find('AbstractText').text for x in articles_iterable ]

# Conflict of Interest statements
coi_texts = [ x.find('CoiStatement').text if x.find('CoiStatement') is not None else '' for x in articles_iterable ]

# MeSH terms
meshheadings_all = list()
for article in articles_iterable:
result = article.find('MeshHeadingList').find_all('MeshHeading')
meshheadings_all.append( [ x.text for x in result ] )

# ReferenceList
references_all = list()
for article in articles_:
if article.find('ReferenceList') is not None:
result = article.find('ReferenceList').find_all('Citation')
references_all.append( [ x.text for x in result ] )
else:
references_all.append( [] )

results_table = pd.DataFrame( {'COI':coi_texts, 'Abstract':abstract_texts, 'MeSH_Terms':meshheadings_all, 'References':references_all} )

Now we can use this table to search abstracts, conflict of interest statements, or make visuals that connect different fields of research using MeSH headings and reference lists. There are of course many other tags that you could explore, returned by EFetch, here’s how you can see them all using BeautifulSoup:

efetch_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
efetch_result = urllib.request.urlopen( efetch_url ).read().decode('utf-8')
efetch_bs = BeautifulSoup(efetch_result,features="xml")

tags = efetch_bs.find_all()

for tag in tags:
print(tag)

Using ELink to retrieve similar publications, and full-text links

You may want to find articles similar to the ones returned by your search query. These articles are grouped according to a similarity score using a probabilistic topic-based model.[5] To retrieve the similarity scores for a given ID, you must pass cmd=neighbor_score in your call to ELink. Here’s an example for one article:

import urllib.request
import json

id_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=neighbor_score'
elinks = urllib.request.urlopen(elink_url).read().decode('utf-8')

elinks_json = json.loads( elinks )

ids_=[];score_=[];
all_links = elinks_json['linksets'][0]['linksetdbs'][0]['links']
for link in all_links:
[ (ids_.append( link['id'] ),score_.append( link['score'] )) for id,s in link.items() ]

pd.DataFrame( {'id':ids_,'score':score_} ).drop_duplicates(['id','score'])

The other function of ELink is to provide full-text links to an article based on its ID, which can be returned if you pass cmd=prlinks to ELink instead.

If you wish to access only those full-text links that are free to the public, you will want to use links that contain “pmc” (PubMed Central). Accessing articles behind a paywall may require subscription through a University—before downloading a large corpus of full-text articles through a paywall, you should consult with your organization’s librarians.

Here is a code snippet of how you could retrieve the links for a publication:

id_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).read().decode('utf-8')

elinks_json = json.loads( elinks )

[ x['url']['value'] for x in elinks_json['linksets'][0]['idurllist'][0]['objurls'] ]

You can also retrieve links for multiple publications in one call to ELink, as I show below:

id_list = '37055458,574140'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_list}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).read().decode('utf-8')

elinks_json = json.loads( elinks )

elinks_json
urls_ = elinks_json['linksets'][0]['idurllist']
for url_ in urls_:
[ print( url_['id'], x['url']['value'] ) for x in url_['objurls'] ]

Example data visualization: Scientific publications from C-suite authors

Occasionally, a scientific publication will be authored by someone who is a CEO, CSO, or CTO of a company. With PubMed, we have the ability to analyze the latest life science industry trends. Conflict of interest statements, which were introduced as a search term in PubMed during 2017,[6] give a lens into which author-provided keywords are appearing in publications where a C-suite executive is disclosed as an author. In other words, the keywords chosen by the authors to describe their finding. To carry out this function, simply include CEO[cois]+OR+CSO[cois]+OR+CTO[cois] as search term in your URL, retrieve all of the results returned, and extract the keywords from the resulting XML output for each publication. Each publication contains between 4–8 keywords. Once the corpus is generated, you can quantify keyword frequency per year within the corpus as the number of publications in a year specifying a keyword, divided by the number of publications for that year.

For example, if 10 publications list the keyword “cancer” and there are 1000 publications that year, the keyword frequency would be 0.001. Using the seaborn clustermap module with the keyword frequencies you can generate a visualization where darker bands indicate a larger value of keyword frequency/year (I have dropped COVID-19 and SARS-COV-2 from the visualization as they were both represented at frequencies far greater 0.05, predictably). Each year, approximately 1000–1500 papers were returned.

Clustermap of author-specified keyword frequencies for publications with a C-suite author listed, generated by the author using Seaborn’s clustermap module.

From this visualization, several insights about the corpus of publications with C-suite authors listed becomes clear. First, one of the most distinct clusters (at the bottom) contains keywords that have been strongly represented in the corpus for the past five years: cancer, machine learning, biomarkers, artificial intelligence — just to name a few. Clearly, industry is heavily active and publishing in these areas. A second cluster, near the middle of the figure, shows keywords that disappeared from the corpus after 2018, including physical activity, public health, children, mass spectrometry, and mhealth (or mobile health). It’s not to say that these areas are not being developed in industry, just that the publication activity has slowed. Looking at the bottom right of the figure, you can extract terms which have appeared more recently in the corpus, including liquid biopsy and precision medicine — which are indeed two very “hot” areas of medicine at the moment. By examining the publications further, you could extract the names of the companies and other information of interest. Below is the code I wrote to generate this visual:

import pandas as pd
import time
from bs4 import BeautifulSoup
import seaborn as sns
from matplotlib import pyplot as plt
import itertools
from collections import Counter
from numpy import array_split
from urllib.request import urlopen

class Searcher:
# Any instance of searcher will search for the terms and return the number of results on a per year basis #
def __init__(self, start_, end_, term_, **kwargs):
self.raw_ = input
self.name_ = 'searcher'
self.description_ = 'searcher'
self.duration_ = end_ - start_
self.start_ = start_
self.end_ = end_
self.term_ = term_
self.search_results = list()
self.count_by_year = list()
self.options = list()

# Parse keyword arguments

if 'count' in kwargs and kwargs['count'] == 1:
self.options = 'rettype=count'

if 'retmax' in kwargs:
self.options = f'retmax={kwargs["retmax"]}'

if 'run' in kwargs and kwargs['run'] == 1:
self.do_search()
self.parse_results()

def do_search(self):
datestr_ = [self.start_ + x for x in range(self.duration_)]
options = "".join(self.options)
for year in datestr_:
this_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' + \
f'?db=pubmed&term={self.term_}' + \
f'&mindate={year}&maxdate={year + 1}&{options}'
print(this_url)
self.search_results.append(
urlopen(this_url).read().decode('utf-8'))
time.sleep(.33)

def parse_results(self):
for result in self.search_results:
xml_ = BeautifulSoup(result, features="xml")
self.count_by_year.append(xml_.find('Count').text)
self.ids = [id.text for id in xml_.find_all('Id')]

def __repr__(self):
return repr(f'Search PubMed from {self.start_} to {self.end_} with search terms {self.term_}')

def __str__(self):
return self.description

# Create a list which will contain searchers, that retrieve results for each of the search queries
searchers = list()
searchers.append(Searcher(2022, 2023, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2021, 2022, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2020, 2021, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2019, 2020, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2018, 2019, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))

# Create a dictionary to store keywords for all articles from a particular year
keywords_dict = dict()

# Each searcher obtained results for a particular start and end year
# Iterate over searchers
for this_search in searchers:

# Split the results from one search into batches for URL formatting
chunk_size = 200
batches = array_split(this_search.ids, len(this_search.ids) // chunk_size + 1)

# Create a dict key for this searcher object based on the years of coverage
this_dict_key = f'{this_search.start_}to{this_search.end_}'

# Each value in the dictionary will be a list that gets appended with keywords for each article
keywords_all = list()

for this_batch in batches:
ids_ = ','.join(this_batch)

# Pull down the website containing XML for all the results in a batch
abstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={ids_}'

abstract_ = urlopen(abstract_url).read().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_, features="xml")
articles_iterable = abstract_bs.find_all('PubmedArticle')

# Iterate over all of the articles from the website
for article in articles_iterable:
result = article.find_all('Keyword')
if result is not None:
keywords_all.append([x.text for x in result])
else:
keywords_all.append([])

# Take a break between batches!
time.sleep(1)

# Once all the keywords are assembled for a searcher, add them to the dictionary
keywords_dict[this_dict_key] = keywords_all

# Print the key once it's been dumped to the pickle
print(this_dict_key)

# Limit to words that appeared approx five times or more in any given year

mapping_ = {'2018to2019':2018,'2019to2020':2019,'2020to2021':2020,'2021to2022':2021,'2022to2023':2022}
global_word_list = list()

for key_,value_ in keywords_dict.items():
Ntitles = len( value_ )
flattened_list = list( itertools.chain(*value_) )

flattened_list = [ x.lower() for x in flattened_list ]
counter_ = Counter( flattened_list )
words_this_year = [ ( item , frequency/Ntitles , mapping_[key_] ) for item, frequency in counter_.items() if frequency/Ntitles >= .005 ]
global_word_list.extend(words_this_year)

# Plot results as clustermap

global_word_df = pd.DataFrame(global_word_list)
global_word_df.columns = ['word', 'frequency', 'year']
pivot_df = global_word_df.loc[:, ['word', 'year', 'frequency']].pivot(index="word", columns="year",
values="frequency").fillna(0)

pivot_df.drop('covid-19', axis=0, inplace=True)
pivot_df.drop('sars-cov-2', axis=0, inplace=True)

sns.set(font_scale=0.7)
plt.figure(figsize=(22, 2))
res = sns.clustermap(pivot_df, col_cluster=False, yticklabels=True, cbar=True)

Conclusion

After reading this article, you should be ready to go from crafting highly tailored search queries of the scientific literature all the way to generating data visualizations for closer scrutiny. While there are other more complex ways to access and store articles using additional features of the various E-utilities, I have tried to present the most straightforward set of operations that should apply to most use cases for a data scientist interested in scientific publishing trends. By familiarizing yourself with the E-utilities as I have presented here, you will go far toward understanding the trends and connections within scientific literature. As mentioned, there are many items beyond publications that can be unlocked through mastering the E-utilities and how they operate within the larger universe of NCBI databases.

To learn more about accessing NCBI, you can download course materials for a set of NIH courses that were held up until 2008 or check out the references below. To stay up-to-date about changes to the E-Utilities, including a potentially new API, you may want to sign up for the NCBI’s very 1990’s looking mailing list, at this website. Lastly, searching arxiv.org for “PubMed” will return a handful of results that can motivate research projects using this data.

References

[1] https://www.nlm.nih.gov/bsd/difference.html#:~:text=MEDLINE%20is%20the%20largest%20subset,Journal%20Categories%20filter%20called%20MEDLINE

[2] Chapman D. Advanced search features of PubMed. J Can Acad Child Adolesc Psychiatry. 2009 Feb;18(1):58–9. PMID: 19270851; PMCID: PMC2651214.

[3] Sayers E. The E-utilities In-Depth: Parameters, Syntax and More. 2009 May 29 [Updated 2022 Nov 30]. In: Entrez Programming Utilities Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK25499/

[4] https://ftp.ncbi.nlm.nih.gov/pub/PowerTools/eutils/Oct.2007/slides/Lecture1.pdf

[5] Lin J, Wilbur WJ. PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics. 2007 Oct 30;8:423. doi: 10.1186/1471–2105–8–423. PMID: 17971238; PMCID: PMC2212667. Available from: https://pubmed.ncbi.nlm.nih.gov/17971238/

[6] https://library.mskcc.org/blog/2019/07/conflict-of-interest-statement-field-in-pubmed/

--

--