The world’s leading publication for data science, AI, and ML professionals.

Title: Using Python for everyday tasks: An example of retrieving metadata from Scopus database

Applying average programming and analytics can help us resolving numerous problems in our daily lives and remains an underrated skill.

Python for Everyday Tasks: A Bibliometrics Analysis

An example of retrieving metadata from Scopus database

Applying average programming and analytics can help us resolving numerous problems in our daily lives and remains an underrated skill.

Photo by: Michael Constantin P on Unsplash
Photo by: Michael Constantin P on Unsplash

The knowledge of a programming language has become an essential requirement for many data science related tasks for several reasons that we all are familiar with. However, I find that many people continue to perceive the coding/programming/data-analytics as an arcane field, which is best left to the experts. While we need experts in developing complicated algorithms for artificial intelligence, it is equally important that we must focus on applying the discovered knowledge to address the individual or social issues.

Reviewing the existing research is a fundamental step in any research work. When I started my PhD years ago, I had no programming knowledge. So, I not only searched the important research articles using relevant keywords online, but also manually downloaded or parsed the data files (pdfs/excels). The processes were tedious and time consuming and I wish I knew Python back then. It took me months to keep searching for all the articles and the getting the data, a task that a programming script can now do in minutes.

This blog shows that how common Python skills can be used to collect metadata of research articles from Scopus, one of the most popular databases among the research scholars for searching the published research on a specialization [1]. The researchers frequently use such information to conduct Bibliometrics to report the state of research in their niche domains.

The problem

Let us say we want to search all research articles that used Python for data analytics or machine learning and were published in the last 30 years. We would like to know, for example, whether it would be worth learning Python for machine learning and what countries/regions are leading the research.

Storing data

Several ways exist to store the collected information. Keeping all information in-memory until the completion of loop works in case only few iterations and a small dataset. However, keeping all data in-memory of Jupyter notebook while the script is running has a major drawback. The whenever the script encounters any error, you may lose the progress and will have to restart the work. Therefore, I suggest saving the details after each iteration into a SQL database to ensure that the efforts are not repeated in case any errors occur (I used sqlite3 package).

Case study

As per the Scopus website [2], I formulated the appropriate query for request as described below (please note that you need to register to get an API). The script only shows the important steps, not the whole code.

base_url = 'http://api.elsevier.com/content/search/scopus?'
# search in title, abstract, and key
scope    = 'TITLE-ABS-KEY'
# formulating the query structure
terms1   = '({python})'
terms2   = '({machine learning} OR {big data} OR {artificial intelligence})'
terms    = '({} AND {})'.format(terms1, terms2)
# insert your personal key (it is free and available on https://dev.elsevier.com/)
apiKey   = '&apiKey=INSERT YOUR KEY' 
date     = '&date=1990-2020'
# it is the maximum number of results per query for a free account
count    = '&count=25' 
sort     = '&sort=citedby-count'
view     = '&view=standard'

The search parameters included all subject codes to examine all fields that may have some published articles on the topic of our interest. The code iterated over all subject areas and customizes the query accordingly. For each query, the free-version displays only 25 results per page, which means that if a query returns more than 25 search results. So, I used another loop within the main loop to progressively collect the results from subsequent pages on the same subject. The exercise returned 2524 published studies (including Research articles, conference proceedings, etc.), out of which 25% were open access articles. In practice, many of these papers may not contain the information relevant to us, and therefore, need manual intervention to further filter out the irrelevant information. While that is the standard research process, we can consider all preliminary findings the present task. The following block of code the main loop that iterates over all subjects:

# this function sends a request and returns the total articles, the #starting position of the first article, and the metadata of each #article.
def search_scopus(url):

    res = requests.get(url)
    if res.status_code ==200:
        content  = json.loads(res.content)['search-results']
        total    = content['opensearch:totalResults']
        start    = content['opensearch:startIndex']
        metadata = content['entry']
        return int(total), int(start), metadata

    else:
        error = json.loads(res.content)['service-error']['status']          
        print(res.status_code, error['statusText'])
# list of all subjects in Scopus database
subjects = ['AGRI', 'ARTS', 'BIOC', 'BUSI', 'CENG', 'CHEM', 'COMP', 'DECI', 'DENT', 'EART', 'ECON', 'ENER', 'ENGI', 'ENVI', 'HEAL', 'IMMU', 'MATE', 'MATH', 'MEDI', 'NEUR', 'NURS', 'PHAR', 'PHYS', 'PSYC', 'SOCI', 'VETE', 'MULT']
for sub in subjects:
  while True:   

    # starting index of the results for display
    # starting index refers to number position of not pages
    start_index  = 0

    start    = '&start={}'.format(start_index) 
    subj     = '&subj={}'.format(sub)
    query    = 'query=' + scope + terms + date + start + count +  
               sort + subj + apiKey + view
    url  = base_url + query
    # total results per subject, starting index of first result in 
    #each query and data
    total, start_index, metadata = search_scopus(url)
    # save metadata now in SQL (not shown here)
    # check how many results need to be retrieved
    remain = total - start_index - len(metadata)
    if remain>0:
        start_index+=25 # to search next 25 results
    else:
        break # breaking from while loop

Disclaimer: before presenting the results, I would to like to explicitly mention that this is only a ‘do-it-yourself’ exercise blog for learning and not a peer-reviewed research article. Therefore, the findings should be interpreted as indicative not authoritative.

Figure 1 shows total number of studies published every year since 1990s showing an exponential growth trend since 2012, which is not surprising. Because we restricted the studies having ‘python’ as the mandatory word, the number of research work is tiny despite the fact that the domain has received overwhelming attention in the last decade. The second subplot illustrates the total citations of all studies and the quality (citations per study) over the same period. While the number of studies has increased, no such trend is apparent for the total citations as well quality of publications. One of the reasons for low quality could be the recency of research. As more research is conducted in future, the recent papers will be cited more often.

Figure 1 Annual scholarly publications and citations as per Scopus database. Note: the chart only shows the articles related to data-science using Python in their abstract, title or keywords, It should also be mentioned that the results are sensitive to the database of choice as not all studies are uniformly represented by different research catalogues [1].
Figure 1 Annual scholarly publications and citations as per Scopus database. Note: the chart only shows the articles related to data-science using Python in their abstract, title or keywords, It should also be mentioned that the results are sensitive to the database of choice as not all studies are uniformly represented by different research catalogues [1].

The country-wise record is presented in Figure 2. As expected, the United States has led the cumulative research output, India seems to have surpassed everyone and gained the top position in annual publications. China is not the biggest producer, which is counter-intuitive and is perhaps because many Chinese publications are listed in Scopus database.

Figure 2 Scholarly publications based on the affiliation of the first author as per Scopus database. Note: only top ten countries are shown; the graph shows only articles related to data-science using Python in their abstract, title or keywords; free account returns only the name of first author, so I assumed that Scopus shows first author's affiliation.
Figure 2 Scholarly publications based on the affiliation of the first author as per Scopus database. Note: only top ten countries are shown; the graph shows only articles related to data-science using Python in their abstract, title or keywords; free account returns only the name of first author, so I assumed that Scopus shows first author’s affiliation.

Although India and China are clearly leading w.r.t. to the number of papers, no such pattern is noticeable in the quality of research. Figure 3 illustrates the research quality in the top ten countries in terms of total publications. The top ten highly cited papers were treated as outliers and were excluded.

Figure 3 Scholarly publications based on the affiliation of the first author as per Scopus database. Note: the graph shows only articles related to data-science using Python in their abstract, title or keywords; free account returns only the name of first author, so I assumed that Scopus shows first author's affiliation.
Figure 3 Scholarly publications based on the affiliation of the first author as per Scopus database. Note: the graph shows only articles related to data-science using Python in their abstract, title or keywords; free account returns only the name of first author, so I assumed that Scopus shows first author’s affiliation.

Figure 4 describes a city-level graphical distribution of the research productivity. The Indian and Chinese cities account for a considerable portion of the papers, with Chennai being the top city followed by Beijing.

Figure 4 Scholarly publications based on the affiliation of the first author as per Scopus database. Note: only top ten cities are shown; the graph shows only articles related to data-science using Python in their abstract, title or keywords; free account returns only the name of first author, so I assumed that Scopus shows first author's affiliation.
Figure 4 Scholarly publications based on the affiliation of the first author as per Scopus database. Note: only top ten cities are shown; the graph shows only articles related to data-science using Python in their abstract, title or keywords; free account returns only the name of first author, so I assumed that Scopus shows first author’s affiliation.

Figure 5 is an interesting graph showing the top ten most popular papers published in the domain. Many highly reputed papers introduce either a Python package or machine learning algorithm. Scikit-learn paper is by far the most popular paper published followed by the paper proposing convolutional learning.

Figure 5 Ten most cited publications in Scopus. Note: the graph shows only articles related to data-science using Python in their abstract, title or keywords. Scopus citations differ substantially from what you may see on Google Scholar [1].
Figure 5 Ten most cited publications in Scopus. Note: the graph shows only articles related to data-science using Python in their abstract, title or keywords. Scopus citations differ substantially from what you may see on Google Scholar [1].

The subject-wise distribution of publication is highly skewed. Understandably, the computer science domain accounts for an overwhelmingly large fraction of the studies (>75%) followed by the engineering and biochemical fields (see Figure 6). Further, the most articles are still paid and only 30% of belong to open-access category, which does not require the readers to subscribe or pay for the information (see Figure 7).

Figure 6 Subject-wise breakup of publications based on the affiliation of the first author as per Scopus database. Note: shows only top-five subjects and only articles related to data-science using Python in their abstract, title or keywords.
Figure 6 Subject-wise breakup of publications based on the affiliation of the first author as per Scopus database. Note: shows only top-five subjects and only articles related to data-science using Python in their abstract, title or keywords.
Figure 7 Number of open access articles per Scopus database. Note: shows only articles related to data-science using Python in their abstract, title or keywords.
Figure 7 Number of open access articles per Scopus database. Note: shows only articles related to data-science using Python in their abstract, title or keywords.

Conclusion

The popularity of Python and its application to machine learning has been growing exponentially in the recent years across the world, with no one country dominating the field. The US affiliated researchers have contributed the most, but India and China are fast catching up, at least in quantity if not in quality.

The blog post also demonstrates that the basic knowledge of coding may be adequate to solve many problems in our daily college or working lives. While much of the data science discourse continues to focus on sophisticated and attractive domains of managing and ‘fitting’ large-scale datasets, the application of the programming and machine-learning knowledge need not necessarily be limited to addressing only complex problems. We must continuously also explore how we can deploy these simple, straightforward programming skills to solve ‘micro’ problems that we encounter daily in our lives.

I believe that for most people working in the domain, the strength of data-science and programming knowledge lies not in knowing everything, but rather applying whatever you know to solve issues in areas that you are passionate about. The ‘good-enough’ knowledge may not have immediate commercial value, but regularly applying such basic skills will eventually empower you in the long-term and also grant you immense personal satisfaction.

References

[1] A. Martín-Martín, E. Orduna-Malea, M. Thelwall, and E. Delgado López-Cózar, "Google Scholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subject categories," J. Informetr., vol. 12, no. 4, pp. 1160–1177, 2018.

[2] Elsevier, "Scopus search API," 2020. [Online]. Available: https://dev.elsevier.com/documentation/ScopusSearchAPI.wadl. [Accessed: 28-Jan-2021].


Related Articles