Data for Change

CoronaXiv, a product that we built in HackJaipur 2020, won the title of "Best ElasticSearch-based Product" from a pool of 350+ teams across the country. We explain our methodology in this blog post for making CoronaXiv.
What is CoronaXiv?
CoronaXiv is an ElasticSearch-powered AI Search Engine that indexes 60k+ research papers that have piled up in response to the Corona Virus global pandemic, where researchers are publishing new papers every day, trying to understand the nature of this virus and trying to find a cure for it. CoronaXiv is a web app (soon to be a PWA) that is built on the following tech stack:
- Flask
- PyTorch (BERT model)
- nltk
- Vue.js
- ElasticSearch
- Python3
- Heroku
As you can see, CoronaXiv is an amalgamation of many modern tools, frameworks, and models. This helped us to get our things working quickly, saving us the time to re-invent the wheel! Our project is available here. To try CoronaXiv, visit http://www.coronaxiv2.surge.sh. (Currently, it is down since we exhausted our free limit of ElasticSearch, but we will be back up soon!)
What problem does CoronaXiv solve?
A lot of researchers are working remotely across the globe, where lockdown restrictions are varying. Some researchers are working in labs, while some are working from home. In order to assist them in their endeavor to help defeat the Corona Virus Pandemic, it would be really handy for any researcher to have a dedicated search-engine for Covid-19 papers, and also get links to other similar papers which are AI-recommended, so that it saves their time. Every second is precious in this battle against this global pandemic and hence, we have built CoronaXiv, an ElasticSearch-powered AI Search Engine for research papers related to the Corona Virus.
In the current scenario, one would perform Google search in order to look for some research papers. However, more often than not, certain keywords will yield results not related to the Corona Virus pandemic, and also lack UX since the user has to switch from one paper to another every time by going back to the home screen. With CoronaXiv, one can directly access the papers easily, with different clustering visualizations to assist the user in understanding relations of different papers and identify papers based on keywords, or access papers clustered on basis of similar domains.

CoronaXiv is a search platform that allows researchers to search for the most significant papers that they want to see from a clutter of research papers that have piled up on the internet because of new Covid-19 research papers coming out daily. With over 60k research papers piled up till date, CoronaXiv smoothly indexes the research papers on a variety of criteria and also allows researchers to search for papers based on various filters like publish date range, peer reviews, total views/citations, H-index, etc. We are currently working on adding more such filters for quality-of-life improvements.
How did we build CoronaXiv?

The problem at hand was not something trivial. Diving right into it as one massive task would have been suicidal. Hence, we needed to break down our main goal into several smaller goals (modular approach) in order to achieve them. The steps involved in our project development include:
- Pre-process the CORD-19 Dataset available on Kaggle
- Obtain additional metadata using external APIs such as Altmetrics & Scimago
- Generate Word Embeddings using CovidBERT (pre-trained BERT model fine-tuned on medical literature)
- Using Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (tSNE) to reduce embedding dimensions, ensuring a faster performance of our search engine
- Obtain cluster information and extracting keywords from a corpus of 60k+ papers
- Integrating the final CSV of research papers corpus with ElasticSearch
- Building the front-end of the web app and integrating it with ElasticSearch
Steps 1–5 were all performed on Jupyter Notebook, which shall be made public in the future. Steps 6–7 is where the real story unfolds.
Challenges faced while building CoronaXiv
The pre-processing stage of the CORD-19 dataset took a long time but we finally extracted the useful information about each paper to be used and shown and also generated the embedding in order to assist the AI decisions. The APIs were extremely slow and took away a chunk of time of our hackathon, due to multiple failed call requests. A trade-off was made on information retained due to PCA and the time constraint of running the model, where time constraint was given preference.

For indexing based on keywords, we could have used basic TF-IDF for each query and then matched with keywords on the go. More so, our server would be bottlenecked in case of handling multiple requests since we have only one deep learning model instance and hence it can only serve one request at a time. Hence, it would be an extremely slow implementation and would take up a lot of resources and time. This is when we came across ElasticSearch.

ElasticSearch, is a distributed, open-source search and analytics engine for different kinds of textual data. With Elastic, one can perform an efficient and extremely fast full-text search based on keywords. Using our CovidBERT embeddings and the power of ElasticSearch we can get the results for most relevant papers sorted in order of decreasing importance. A paper abstract or excerpt summarises the entire paper and talks about what is being discussed in the paper. Hence, the data has been indexed on the basis of excerpts of the papers since the abstract is not available for all the papers. This helped in speeding up the retrieval process as we have indexed from a very compressed representation of the contents of the paper.
Honestly, we had never used ElasticSearch before and we had no experience with it. So we started with the documentation. The well-explained documentation of this package guided us to make our server interact with ElasticSearch in an extremely efficient and concise manner. Since our tech stack was Flask as backend and Vue.js as front end, we decided to use the official Python package of ElasticSearch which can be found here and is constantly maintained by developers all over the world.
We created indices with this package and every time a request is made from the frontend, the server interacts with ElasticSearch and provides the results related to a specific tag and we serve it to the user. We also created a script that can help us to bulk insert many documents in a particular index. The script basically takes the credentials of your ElasticSearch account and create an index and adds multiple documents at a time.
The End Product (or is it?)
Our final product was finished barely within the hackathon time frame. Thanks to the extremely kind and understanding organizers of HackJaipur 2020, the deadline was extended by 5 hours on request by multiple participants, owing to the issues faced by many people with respect to the internet amidst lockdown. The demo of the project is available below:
This is a very ambitious project of our team and we hope that we can keep on improving our website with newer features with each release. Hence, our team is looking forward to maintaining this website at least until the world is freed from this pandemic. We would request the ElasticSearch team to help us maintain this project under the free tier or sponsor our ambitious project!

Our experience with ElasticSearch and HackJaipur 2020
Once we understood the documentation, it became extremely easy for us to connect it with our project. ElasticSearch helped us scale our product, which otherwise would have stayed on the localhost or something. We got the hang of using ElasticSearch very quickly, thanks to the demo workshop on ElasticSearch by Aravind Putrevu, Senior Developer Advocate at Elastic.
It was our team’s first remote hackathon, that too at a national level! The week-long workshops were really engaging and exciting. Since this was a virtual hackathon, we communicated using WhatsApp and Telegram. We really enjoyed this unique experience that HackJaipur gave us and we would like to give a huge shoutout to the entire organizing team of HackJaipur 2020 and the sponsors like Major League Hacking, ElasticSearch, GitHub, etc.

About the Developers
We are Team Stochastic. We make decisions on the run. With each epoch, we make our next decision. After all, anything in this world is stochastic! You can connect with our members on their profiles:
- Arghyadeep Das: GitHub, LinkedIn
- Nachiket Bhuta: GitHub, LinkedIn
- Neelansh Mathur: GitHub, LinkedIn
- Jayesh Nirve: GitHub, LinkedIn
Happy coding! 😄