Hands-on Tutorials
When you embark on any research questions, one of the first things you will do is Literature Review. Unless you already are an expert in your field and know exactly where to look, you will likely be overwhelmed by the sheer amount of papers on Google Scholar or PubMed you need to shift through. Therefore, the most logical thing you would then do is just pick the most recent review paper and hope for the best. Alternatively, what if you could quickly scan across 5000 abstracts, extract each paper’s key words, rank, and organise them into themes?
Methods
Pubmed scrapping
First, we need a database, from which we will download our abstracts. Luckily, Pubmed API E-utilites allows access to all Entrez databases including Pubmed, PMC, Gene, Nuccore and Protein [1]. Subsequently, we can write a pubmed_scrapping
function to specify the term, the number of documents to download, and organised by their relevance to the term. Additionally, for each document, we will return its title, MeshHeading, Abstract, Publication Date, and PMID.
BeCAS annotation
While we can just use the MeshHeadings to group our papers, we can go a step further by scanning the abstracts and titles for additional important keywords that are too specific to be included in the MeshHeadings in the first place, such as a particular molecular pathway or the name of a model organisms. Doing so will allow us later to subcategorise papers with similar themes, e.g. categorising psychiatric research papers into schizophrenia vs. autism spectrum disorders.
Here, we can use a biomedical concept recognition services and visualisation tool (BeCAS), and its API to annotate the key biological terminologies (it also has a web interface here)[2]. Essentially, BeCAS tool acts as a big dictionary that performs part-of-sentence tagging and recognises concepts of species, anatomical concepts, miRNAs, enzymes, chemicals, drugs, diseases, metabolic pathways, cellular components, biological processes, and molecular functions, which are compiled from multiple meta-sources such as UMLS and NCBI Biosystems.
Combing with the input from the pubmed_scrapping
function above, we can use BeCAS API to extract further key words.
BioWordVect Word Embedding
Once we have all the necessary keywords (from MeshHeadings and BeCAS API), we can combine and convert them into a numerical value to represent each document.
Here, we can use BioWordVect, a pretrained word embedding that contains over 2.7 million tokens, and has been trained on a large body of PubMed abstracts and MeshHeadings [3]. Rather than training on individual words as in Word2Vec, the authors have trained on subwords, which proves particularly useful in biomedical data, where there are many compound words. Given that each word in BioWordVect is represented by a 200-dimensional vector, we can translate each keyword into a BioWordVect’s vector, and each document as a mean of these vectors.
Result
To test this proof-of-concept, I have looked at 5 PubMed terms: gliogenesis, neuronal migration, neurogenesis, synaptogenesis and myelination, and subsequently downloaded 1000 abstracts each (5000 papers total) and extracted their MeshHeadings and BeCAS annotation (Figure 1).

As expected, one can see the most common terminologies are cellular processes related to neurogenesis and neuron migration (Figure 2). More interestingly, if we zoom in on the less mentioned terminologies, we can see different model organisms, psychiatric diseases, and pathways; suggesting that we have managed to pull many smaller sub-categories.

Once converting these terminologies into vectors, we could try to group individual papers together. Using K-means clustering, I divided the 5000 papers into 10 clusters and look at their most common terminologies.

Ignoring the expected most common terminologies such as brain, myelination, and neurogenesis (and horrible colour schemes) in Figure 3, you can see that each of the 10 clusters appear to contain a different terminology from the rest. For example, if we asked ourselves what diseases are most commonly found across the 5000 papers, we will see schizophrenia in one cluster, depression in another, cancer in the third, stroke in the fourth and Alzheimer’s in the fifth. This is particularly exciting as this simple result reflects current landscape of Neuroscience, where these diseases are the major areas of research.
Nevertheless, given that we used an unsupervised technique, some issues with interpretability are expected. However, with more refinement in the choice of PubMed Search term and preprocessing of the key words, there is certainly room for improvements.
Conclusion
While the application of natural language processing to biomedical data is not a breakthrough, it is surprising that tools aiding literature review process do not exist or are not well-known. With this simple proof-of-concept, we can combine existing APIs with pretrained word embedding model to help with our research questions.
Finally, you can check out my code for this project below:
References
[1] Sayers, E. (2010) A General Introduction to the E-utilities. Entrez Program. Util. Help
[2] Nunes, T., Campos, D., Matos, S., and Oliveira, J. L. (2013) BeCAS: Biomedical concept recognition services and visualization. Bioinformatics. 29, 1915–1916
[3] Zhang, Y., Chen, Q., Yang, Z., Lin, H., and Lu, Z. (2019) BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data. 6, 1–9