The world’s leading publication for data science, AI, and ML professionals.

Extracting information from XML files into a Pandas dataframe

Parse XML files with the Python's ElementTree package

Hands-on Tutorials

Website vector created by stories - www.freepik.com
Website vector created by stories – www.freepik.com

Real-world data is messy, and we know it. Not only does such data require a lot of cleaning, a lot of times, the format in which we receive data is also not suited for analysis. This means that before the analysis even begins, the data has to undergo a series of transformations to get it into a suitable format – a format that makes it easy to work upon. This happens mostly when the data is either scraped from the web or is provided in the form of documents. I came across a pretty similar dataset, which was in the form of various Xml files. In this article, I lay down the steps I took to decipher those files and convert them into an analysis-ready CSV file good enough to be ingested into the pandas’ library for further analysis.


This article is part of a complete series on finding good datasets. Here are all the articles included in the series:

Part 1: Getting Datasets for Data Analysis tasks – Advanced Google Search

Part 2: Useful sites for finding datasets for Data Analysis tasks

Part 3: Creating custom image datasets for Deep Learning projects

Part 4: Import HTML tables into Google Sheets effortlessly

Part 5: Extracting tabular data from PDFs made easy with Camelot.

Part 6: Extracting information from XML files into a Pandas dataframe

Part 7: 5 Real-World datasets for honing your Exploratory Data Analysis skills


Motivation

The entire world is collectively fighting the COVID-19 war, and the year 2021 brings a glimmer of hope. We are all aware of the vaccination drive being conducted by many countries, including my own country India, which began the world’s largest vaccination drive on 16th Jan 2021. But a vaccine takes its final form after going through several phases and trials, and only after it is deemed fit, it is given a green flag to be administered to the general population.

I discovered ClinicalTrials.gov, a database of privately and publicly funded clinical studies conducted around the world. I thought it would be pretty interesting and informative to look at all the clinical trials related to COVID 19 studies presented on the site.

Source: https://www.clinicaltrials.gov/ct2/home
Source: https://www.clinicaltrials.gov/ct2/home

However, the trials’ dataset consists of XML files where each XML file corresponds to one study. Thus, it is not possible to immediately jump to analysis. The data needs to be first into a format that could be read into a pandas dataframe. I have downloaded the dataset and uploaded it on Kaggle for others to look into it. The data is updated every week and provides some great insights into COVID vaccines and medicines’ development.

source: https://www.kaggle.com/parulpandey/covid19-clinical-trials-dataset | Image by Author
source: https://www.kaggle.com/parulpandey/covid19-clinical-trials-dataset | Image by Author

The rest of the article will deal with parsing these XML files in Python. But before that, let’s dive a little deeper into an XML file and its constituents.


What is XML?

Image by Author
Image by Author

XML stands for Extensible Markup Language. As the name suggests, it is a markup language that encodes documents by defining a set of rules in both machine-readable and human-readable format. The language defines a set of rules that are used to encode a document in a specific format. Let’s look at an example of an XML file below:

<?xml version="1.0" encoding="UTF-8"?>
<class>
  <student>
    <name>Negan</name>
    <age>40</age>
    <email>[email protected]</email>
    <address>
     <country>US</country>
   </address>
  </student>
</class>

Every XML file has a tree structure where the element at the top is the root element. The child elements are then connected to the root elements. if presented visually, the tree structure of the above XML document would look like this:

Image by Author
Image by Author
  • XML version = "1.0" encoding = "UTF-8"?: <XML version and the character encoding>
  • Root element: <Class>
  • Child elements of Student element: <name>, <age>, <email> & <address>
  • Sub-child elements: <country>

The ElementTree XML API

Let’s now see how we can parse a given XML file and extract its data in a structured way. In this article, we will look at [ElementTree](https://docs.python.org/3.10/library/xml.etree.elementtree.html#module-xml.etree.ElementTree) a built-in Python library to manipulate XML files. We have already been introduced to the dataset. Here is a glimpse of how it looks on Kaggle:

A single XML file from the dataset | Image by Author
A single XML file from the dataset | Image by Author

The dataset consists of few thousand XML files. Each XML file corresponds to one study. The filename is the NCT number, which a unique identifier of a survey in the ClinicalTrials repository.


Parsing XML files

Let’s start by importing the necessary libraries and looking at the number of files in the dataset.

import xml.etree.ElementTree as ET
path = '../input/covid19-clinical-trials-dataset/COVID-19 CLinical trials studies/'
files = os.listdir(path)
print(len(files))
------------------------------------------------
4663

There are 4663 individual files in the dataset. This number above corresponds to the number of XML files in the data, including the number of clinical trial reports. Let’s now look at the first XML file and print out values to understand how the tree is structured.

# Reading the first file
file_path_file1 = os.path.join(path, list_of_files[0])
tree = ElementTree.parse(file_path_file1)
root = tree.getroot()
print(root.tag, root.attrib)
--------------------------------------------------------
clinical_study {'rank': '4406'}

ElementTree represents the whole XML document as a tree, while Element represents a single node in this tree. Every part of a tree (root included) has a tag that describes the element. Additionally, elements may also contain attributes which, as the name suggests, are additional descriptors. If you’ll look closely at the XML example provided above – clinical study is the root element and has a rank attribute that equals 4406.

We can now look at some of the other child elements by using a simple for loop.

for child in root:     
print(child.tag, child.attrib)

All the above elements appear to be the different sections of the report. One can also look at the entire document and all the attributes or levels in the tree via the code below.

print(ElementTree.tostring(root, encoding='utf8').decode('utf8'))

We can then choose to include all or a select few elements in our formatted data. These elements will appear as separate columns in the dataframe. We’ll now initialize two empty data frames, which will be populated with the above elements. We’ll include id,overall_status, study_type, start_date, enrollment, condition, location_countries, intervention, title, date_processed and sponsors in our dataset.

This gives us a nicely formatted dataframe, just like the ones that we are used to seeing. You can also save it as a CSV file for future reference.

df.head()
df_covid.to_csv('COVID_trials.csv')
Processed dataframe | Image by Author
Processed dataframe | Image by Author

What’s next

Now that you have the data in the desired format, you can analyze it to elicit specific details like :

  • what is the overall status of the studies
  • The different study types
  • The timeline of the trials
  • The number of people enrolled in various studies, and many more such questions.

You can also create a dashboard with the tools of your choice so that it becomes easy to see all the details at once. I have only extracted few attributes in the example above, but there are plenty in the report. Go ahead and experiment with the others, too, to understand more about the data. Like It is mentioned. The dataset is available for public use, and it is waiting to be explored further.


References

  • COVID-19 Clinical Trials dataset: Database of COVID-19 related clinical studies being conducted worldwide

Related Articles