The world’s leading publication for data science, AI, and ML professionals.

Data Carpentry for Electronic Medical Records (I)

Part 1: The Shape of Data

PART I: THE SHAPE OF DATA


Data from Electronic Medical Records (EMR) can be complex and difficult to navigate. This series of articles is for clinicians who hope to understand what data analysts do and to communicate their intentions for data manipulation and analysis.

What we want data to look like

Most clinicians will be familiar with flat data. This is a format where data is in a 2-dimensional table, generally with one patient per row, and one column per attribute (such as patient age, a lab value, a discharge date). Data analysts may also describe this as tidy data. However, EMR data rarely comes structured like this!

Flat data: What clinicians usually want (Image by author)
Flat data: What clinicians usually want (Image by author)

Most data generated by patients happens through multiple, aperiodic events. We need to consider how these are represented in a database.

An example patient timeline (Image by author)
An example patient timeline (Image by author)

What real data looks like

Databases often represent data in a ‘long’ format. In a long table, each observation of a variable has one row. There can be multiple observations at different times for a variable (e.g. serial troponins). Hence, there may be more than one row per variable, and more than one row per patient. The table is ‘long’ because there are generally more rows than columns.

A long table (Image by author)
A long table (Image by author)

Long tables are usually transformed to a ‘wide’ format. Each patient in the dataset has one row. The table has multiple columns representing measurements of a variables at various time points. It is ‘wide’ because there are more columns than rows. This process is called pivoting, because the data is turned on its side. We can appreciate that the wide format is almost like a timeline.

A wide table - transformed from the above long table (Image by author)
A wide table – transformed from the above long table (Image by author)

You may notice that there are many cells in the wide table with NULL values. A NULL value is a blank. This is a reflection of aperiodic clinical measurements, which do not happen at exactly the same time for every patient. You may wonder why the EMR database does not store data like this – it is because this structure would require a new column for every time point, but many of the values in the column would be NULL values. That would be very inefficient in terms of storage!

This table is not tidy yet, as there are multiple columns under each variable. This is represented by the two-level column labels. We will need to collapse these into a single column for each variable to make the dataset tidy.

Delimitation and Aggregation

To collapse the columns under each variable, we must delimit the relevant time period, and then aggregate the observations within the period.

Between start and end of ICU admission (Delimit), the mean capillary blood glucose (Aggregate) (Image by author)
Between start and end of ICU admission (Delimit), the mean capillary blood glucose (Aggregate) (Image by author)

Delimitation ensures that we exclude time periods that are not relevant to our study. For example, if we are interested in a patient’s ICU admission, we should delimit the start and end dates of the admission.

Aggregation is any rule that takes in a few values, and puts out a single one. Common aggregation rules in Medicine include finding the mean, the median, the maximum or minimum, and the first or last value.

In the above example, we may have specified a delimitation of Time 1 to Time 2, and an aggregation using the mean. Notice that Var 3 has NULL values, because there were no relevant observations from Time 1 to Time 2. This gives us a tidy table – one row per patient, and one column per variable!

A tidy table (Image by author)
A tidy table (Image by author)

Note: Why is it important to delimit a range?

When we have a discrete event in mind, it can be tempting to just specify an aggregation directly (e.g. the ‘last’ value of Creatinine measured before ICU admission). This may work for most cases, but we should be careful – what if the patient had a ‘last’ value a year ago (e.g. patient directly admitted to the ICU from the A&E, with no labs reported yet when admitted)? Delimitation is highly recommended for valid and consistent results!

In this article, we learned about how electronic medical records data is structured, and how we can make it tidy – by pivoting from long to wide, by delimiting and aggregating. Stay tuned for more on Data Carpentry for Electronic Medical Records!


Related Articles