What Is Clinical Data Science? (Part 1)

A gentle introduction to the growing field of clinical data science

Dhikrullah Folorunsho
Towards Data Science

--

Photo by Owen Beard on Unsplash

In this article and other series to follow, I’ll be digging deep into the nitty-gritty of clinical data science.

The major points discussed in this series are outlined below;

  1. Introduction to clinical data science
  2. Clinical data science and related fields
  3. Sources of clinical data
  4. Regulatory constraints

What is clinical data science

Clinical data science is a domain that focuses on applying data science to healthcare with the goal of improving the overall well-being of patients and the healthcare system. Keeping this goal in mind, clinical data scientists leverage the overwhelming volume of data generated within the healthcare system on a daily basis to solve health-related challenges.

Clinical data science has a close relation with fields like biomedical data science, healthcare analytics, and biomedical informatics albeit, with certain distinctions. Before moving forward, I’ll like to dedicate the next few lines to pointing out the distinctions between these fields to have a better perspective of the major domain of discussion.

Biomedical data science involves carrying out analysis on large-scale biological datasets in order to understand and profer solutions to health-related problems. According to Wikipedia, healthcare analytics is the analytics activities that can be undertaken as a result of data generated from core areas of healthcare including claims and cost data, pharmaceutical and research & development data, clinical data, patient behavior & sentiment data. In other worods, we can consider health analytics to be less broader in scope compared to clinical data science. Biomedical informatics on the other hand focuses on the optimal use of biomedical information, data, and knowledge for problem-solving and decision-making by employing computational and traditional approaches.

After understanding the nuances of these fields, now let us get down to one of the most important components of clinical data science, the Electronic Health Record (EHR). Like in every other data science domain, data is like the fuel that propels the engine of any analytics operation, without it, we are unable to do any analysis. In today’s world, we are surrounded by arrays of networked devices that can record clinical data. Since the entire field of data science is based on manipulating data to obtain insight, gathering this data is at the core of data science.

EHR is the comprehensive collection of all information by the individuals involved in patient care. This includes records from clinicians, laboratories, radiology imaging, health insurance, socio-demographics, genetic sequencing data, etc. Another term commonly used interchangeably with EHR is Electronic Medical Record (EMR). While EHR is more complex, EMR is very constrained because it primarily concentrates on patient medical and treatment history within a single practice. EMR is the electronic version of the traditional paper records found in clinicians’ offices.

Generally, what constitutes the source of information that is found in EHR can be trimmed down to four things. Who recorded what information when and why was it recorded. The “who” answers the healthcare personnel that recorded the information. This could be a physician, pharmacist, nurse, laboratory scientist, radiologist, etc. The nature of data captured by these personnel differs considerably, therefore the type of information (i.e the what) is determined by who recorded it. For example, clinicians may record information about the medical and treatment history of patients, the pharmacist records information concerning medication, nurses may record clinical observation data such as weight and height measurements, and laboratory scientists record laboratory data such as lab test orders and lab test results, etc. The “what” answers the specific timestamp such as visits to the clinic/hospital, data recorded upon admission, and data recorded when discharging the patient. The “why” answers the reason the data was recorded.

It is noteworthy that a lot of information is consolidated into EHR as it is widely encompassing. Summarily, we can categorize this data into distinct forms, depending on the personnel that recorded it. We can have medication data, clinical observation data, socio-demographic data, laboratory data, data containing billing information, etc.

Raw clinical data are saved in a database, whether relational or non-relational, for adequate documentation (i.e. storage) and simple retrieval. Depending on their requirements, several organizations that administer clinical databases have their own methods (models) for storing and accessing their data. Therefore, familiarity with existing models and having a solid understanding of how to set up a custom model if necessary are crucial components of a clinical data scientist’s job. Examples of clinical data models include Informatics for Integrating Biology and the Bedside (i2b2), PCORnet (Patient Centered Outcomes Research Network), Observational Health Data Sciences and Informatics (OHDSI, managing the OMOP [Observational Outcomes Medical Partnership] data model), Sentinel, etc. All of these models with specific goals in mind, therefore their architectures differ in some respect. The process of transforming raw data (from the source) into a particular model is referred to as the “extract, transform, and load” (ETL) process.

Pictorial representation of clinical data flow
Image by Author: Pictorial representation of clinical dataflow

Clinical data scientists employ a variety of methods to obtain the data once it has been saved in the format deemed most appropriate, whether it is to address various clinical-related questions or for other types of study. We utilize a set of criteria in the form of a query to retrieve the desired data from the database, and the Database Management System (DBMS) subsequently fetches the requested data from the database. A query language, most often Structured Query Language (SQL), is used to specify this set of requirements. The obtained data can then be ready for further analytical operation.

We can use the data retrieved to provide answers to questions about the characteristics of the individuals in the whole database and use it to study how the characteristics change over time. We can answer questions like;

  1. What is the average weight of patients diagnosed with obesity?
  2. What is the incidence of diabetic retinopathy in patients diagnosed with diabetes?
  3. What proportion of hypertensive patients have hypertensive retinopathy?
  4. What is the average hospital stay of patients with coronary heart disease?

As you can see, there are many questions that can be answered with our queries, provided the data exists in the database.

The regulatory restrictions are a major problem when working with clinical data. The federal Health Insurance Portability and Accountability Act of 1996 is one of the laws that have been implemented by relevant authorities to achieve this (HIPAA)

One central issue when working with clinical data is the regulatory constraints. Like healthcare professionals, clinical data scientists are expected to adhere to best practices and establish a standard for guaranteeing that patient privacy is acknowledged and protected. The federal Health Insurance Portability and Accountability Act of 1996 (HIPAA) and the European Union General Data Protection Regulation are some of the laws that have been implemented by relevant authorities to achieve this. Research involving patient data and other sensitive data available in EHR is also covered by institution-specific guidelines and governing bodies like Institutional Review Boards (IRBs), in part due to concerns about the institutions’ and healthcare providers’ liability. As a result of this, the most common clinical datasets are de-identified (i.e anonymized) and HIPAA-limited datasets.

Deidentified clinical data sets are collections of observational patient data that have been stripped of all direct Patient Health Information (PHI) components. IRB permission is not necessary for access to deidentified clinical data sets.

Clinical data sets with HIPAA restrictions include observational patient information such as dates of admission, discharge, service, and birth and death as well as city, state, zip codes with five digits or more, and ages expressed in years, months, days, or hours. Without a patient’s consent or a HIPAA waiver, HIPAA-restricted clinical data sets may be used or shared for research, public health, or healthcare operations.

Conclusion

I’ve explained what clinical data science entails in this introductory piece, as well as the differences between it and other closely related disciplines. Additionally, we have observed the significance of EHR in the field of clinical data science. In order to address clinical-related questions, we looked at the sources from which clinical data are generated as well as how they are stored and retrieved in a database. Finally, we examined the regulatory constraints in the use of clinical data. In the subsequent articles in this series, with clinical data science serving as the context, the focus will be given to the “further analytical operation” following data retrieval.

References

https://en.wikipedia.org/wiki/Data_retrieval#:~:text=In%20order%20to%20retrieve%20the,used%20to%20prepare%20the%20queries.

https://www.researchgate.net/publication/332988365_Clinical_Data_Sources_and_Types_Regulatory_Constraints_Applications

--

--