Document search with fragment embeddings

COVID-19 questions — a use case for improving sentence fragment search

Ajit Rajasekharan
Towards Data Science
13 min readApr 8, 2020

--

Figure 1. Illustrates embeddings driven fragment search used to answer specific questions (left panel) as well broader questions(right panel). The highlighted text fragments in yellow are document matches to search input obtained using BERT embeddings. The right panel is a sample of animals with literature evidence for presence of coronavirus — the font size is a qualitative measure of reference counts in literature. Bats (in general and chinese horseshoe bats specifically) and birds have been mentioned as sources of coronavirus — bats as the gene source of alpha and beta coronaviruses and birds as the gene source of gammacoronavirus and deltacoronaviruses. Zoonotic transmission of coronavirus from civet cats and pangolins(betacoronavirus) have also been reported. All the information above was obtained automated using machine learning models without human curation. For the broad question in right panel, a bootstrap list was created by the search for term “animals” and clustering result in the neighborhood of Word2vec embeddings. This list was then filtered for biological entity types using unsupervised NER with BERT , which was then used to create the final list of animals with literature evidence captured in fragments as extractive summary of the corresponding documents. The animal source of COVID-19 is not confirmed to date. Coronavirus illustration created at CDC

TL;DR

Embeddings for sentence fragments harvested from a document can serve as extractive summary facets of that document and potentially accelerate its discovery, particularly when user input is a sentence fragment. These fragment embeddings not…

--

--