
By: Petr Koráb (Zeppelin University in Friedrichshafen, Germany; Lentiamo, Prague) and David Štrba (Lentiamo, Prague)
JSTOR Database is one of the leading sources of research articles in more than 50 disciplines of science. In Data for Research section, researchers can access datasets for use in research and teaching about the articles and books released in the library. Data available through the service include metadata, n-grams, and word counts for most articles, book chapters, research reports, and pamphlets on JSTOR. However, the output of the data requests are not simple csv. or txt. documents, but XML files that require some processing and cleaning to work effectively with the data. In R, the package Jstor, released in the mid of 2020, made the whole process far simpler.
To make accessing larger volumes of data for data scientists and researchers easier, in this article, I show the python code for parsing the XML outputs, explain the process of collecting the data from JSTOR data for research database, and show a nice application of this type of data.
Collecting data
Data for Research (DfR) enables manual requests with up to 25 000 files in one batch sent as external links to your mailbox. These files may include meta-data about articles and books published in the database (article or book title, journal title, year of publication, references, etc.), and N-grams (tokenized texts of articles and books).
We might be, for example, interested in the frequency of articles focusing on data science, machine learning, and big data in 5 top journals in Economics (American Economic Review, Econometrica, Quarterly Journal of Economics, Journal of Political Economy, and Review of Economic Studies) from the beginning of times until now. To make such a request, these queries need to be put in here:
(machine learning) jcode:(amereconrevi OR econometrica OR revieconstud OR quarjecon OR jpoliecon)
(data science) jcode:(amereconrevi OR econometrica OR revieconstud OR quarjecon OR jpoliecon)
(big data) jcode:(amereconrevi OR econometrica OR revieconstud OR quarjecon OR jpoliecon)
These requests produce 576 articles for machine learning, 9 459 for data science, and 3 059 for big data, stored in XML format.
Data transformation
Starting with an import of necessary libraries, ElementTree is used to parse the XML files that are the output of DfR requests, os module provides functions for creating a directory so that a single loop can access the files altogether, Pandas and Numpy do data operations, and matplotlib and seaborn plot graphs.
import xml.etree.ElementTree as ET
import os
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
Having all files stored in a folder data, this loop first accesses three sub-folders machine learning, data science, and big data, and in each of them parses the source files. It follows the hierarchical structures of XML files and ignores Book reviews and notices.
From each file, the loop stores the journal title, article title, publication year, and keywords (which is the same as the name of the sub-folder where the files are stored). The data list is then converted to a Pandas data frame.
Data cleaning
To discover research trends in the data, dissertation overviews, indexes, discussions, notes, and lists of members are removed. With this basic data cleaning, only published articles stay in the data, which provides valuable information about researchers’ interest in data science applications in Economics over time.
A quick comparison of an original XML source file and the transformed dataset in Pandas data frame shows the magic:

Finally, let’s plot the series using the seaborn relplot:
We can see that since the 1960s, research interest in data science has grown exponentially. Big data has experienced a boom since the 1980s, and interest in machine learning grows steadily since the beginning of the 21st century. The 140-year span of the data covers the essential development in the modern era of Economic research.

Conclusions
JSTOR data for research provides valuable data about research trends in all science disciplines. Researchers and data scientists may use text mining, machine learning, and other data science techniques to find valuable patterns in the data. Processing the final dataset might be challenging because the source data files are sent to the user in XML format. With the Python code I presented, these operations are no longer a challenging task.
Update: In May 2021, JSTOR Data for Research switched to a new platform Constellate.
Did you like the article? You can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!