
My academic research overwhelmingly includes identifying datasets for health research, harmonizing them, and combining (pooling) the individual datasets to analyze them together. This means combining datasets across populations, study sites, or countries. It also means combining variables so that they can be effectively analyzed together. In other words, I work in the data pooling field where I have been full time since 2017.
I will outline the methodology I follow to extract data from individual datasets, and to combine the individual datasets into one pooled dataset ready for analysis. This is based on over seven years of experience working in academic environments globally. This story includes code in R.
Data pooling – what is it?
In most settings we will collect new data (primary data collection) or work with only one dataset that is already available for analysis. This one dataset can be from one hospital, a specific population (e.g., epidemiological study conducted in a community), or a health survey conducted throughout a country (i.e., nationally representative health survey such as NHANES).
Nonetheless, we can also combine or pool multiple individual datasets. This maximizes statistical power and minimizes statistical uncertainty. This also improves the representativeness of the data, because the pooled dataset includes a large and diverse sample. Thus, there are multiple benefits from working with pooled datasets, though there are many challenges as well. In here, I will share a few tips and R code to address some of these pragmatic challenges.
What steps will I cover here?
I will cover two key steps in the data pooling process.
First, extracting and harmonizing data from individual datasets. This means reading one dataset and keeping only the variables of interest and naming them consistently so that they can be combined with the variables from other datasets. We will also harmonize the variables in terms of units and labels. The output of the first step will be a processed dataset. That is, a version of the original dataset that only has the variables of interest in a consistent format.
Second, reading the processed individual datasets into one pooled dataframe. In this step we will read the datasets we extracted in the first step and combine them (pool or append). Because all the variables in individual datasets were extracted using the same names, units and labels, appending the datasets should not give errors. The output of the second step will be a pooled dataset o size N x V. Where N is the total number of observations across unique datasets, and V is the number of variables in each dataset. For example, if we processed 10 datasets in step one, each dataset with 100 observations and 10 variables, the pooled dataset will have 1,000 observations (10 datasets x 100 observations) and 10 variables.
Step 1: Processing each individual dataset.
In this step I will: a) read one dataset of interest; b) extract the variables we are interested in; c) processing the variables so that they have consistent names, units and labels; and d) save a new dataset containing only the selected variables. The output: a new dataset with only the relevant variables in a consistent format.
This example assumes you are already familiar with the datasets you are working with. You know what variables they have and their names.
First, I will read the dataset I will work with. This could be in any format, such as CSV or Excel, or they can be datasets from Stata, SAS or SPSS. Let’s assume I am reading a CSV file.
data <- read.csv("my/path/to/the/dataset.csv")
Second, I will create a new dataframe including only the variables of interest. In this example, I will extract 10 variables in total, including one variable to identify the original dataset (original_data_id
). The original_data_id
variable will be critical to identify the original unique datasets at the end, when we have pooled all the datasets.
Note that, although the original variable names are self-explanatory, we are still creating a dataframe with new variable names. This, because in other datasets the same original variables may have other names. For instance, in this particular dataset the variable sex is named GENDER_ASKED
, but in other dataset it may be named sex, Sex, gender, or just a code such as question_0001. So, for us to have all the variables with consistent names, we create this new dataframe. The variable names should be easy to remember and clear to understand.
attach(data) # This will attach the data object to the memory so that we do not have to call it in every line.
df <- data.frame(
original_data_id = "MOCK_DATASET_2025", # Note that I created this variable as a string to name this dataset. You can use any name, as long as it's meaningful and easy to remember.
subject_id = PATIENT_ID,
sex = GENDER_ASKED,
age = as.numeric(AGE_YEARS),
race = RACE_ETHNICITY,
edu = as.numeric(SCHOOLING_YEARS),
weight = as.numeric(WEIGHT_MEASURED),
height = as.numeric(HEIGHT_MEASURED),
sbp_1 = as.numeric(SYSTOL_1), # Systolic blood pressure 1.
sbp_2 = as.numeric(SYSTOL_2) # Systolic blood pressure 2.
)
detach(data) # This will detach the data object from the memory.
After running this R
code, in the environment there will be two elements: the full original dataset (data
), and this new dataset with the subset of variables (df
).
Third, we will harmonize all the variables in terms of units and labels. This process will vary tremendously between datasets. I recommend mapping all categorical variable into numbers, and that we keep track of what each numeric label refers to. For instance, in this case the variable sex was original coded as "MEN" and "WOMEN". We will change this variable from a string to an integer, and map "MEN" to 1 and "WOMEN" to 2.
df$sex <- ifelse(df$sex == "MEN", 1,
ifelse(df$sex == "WOMEN", 2, NA))
This is needed to keep consistency with the other datasets. It may be the case that in other datasets the variable sex was already coded as 1 and 2, but it may also happen that the variable sex was labelled as "M" and "W", or as "male" and "female". So, to avoid inconsistencies across datasets when these are finally combined (pooled), we should recode them into consistent labels. We will repeat this recoding or relabelling for all variables as needed.
We might also need to transform numeric variables. For example, it may happen that the variable height
was originally in centimeters, but we want it in meters. We proceed to transform this variable. We will apply the same transformation to all the datasets we are working with. The goal is to have all variables, across datasets, in the same units.
height <- height/100
Bonus track! Alongside the variables we want to extract from the original datasets, I recommend also creating string variables with the units of the original variables. See an example below. This is the same code as above, but in this case, I included XXX_units
variables. These will show the units in which the original variables are. For example, age
is in years, weight
in kilograms, height
in centimeters, and blood pressure
in millimeters of mercury. This approach will help you keep track of the units.
This will also help you if you do not want to transform the units of the original variables as we did above with height
from centimeters to meters. If you do not want to transform the units now, you can simply update the XXX_units
variables with the correct units and transform the variables to consistent units at a later stage, using the XXX_units
variables as a guide.
df <- data.frame(
original_data_id = "MOCK_DATASET_2025",
subject_id = PATIENT_ID,
sex = GENDER_ASKED,
age = as.numeric(AGE_YEARS), age_units = "years",
race = RACE_ETHNIC,
edu = as.numeric(SCHOOLING_YEARS),
weight = as.numeric(WEIGHT_MEASURED), weight_units = "kg",
height = as.numeric(HEIGHT_MEASURED), height_units = "cm",
sbp_1 = as.numeric(SYSTOL_1), sbp_1_units = "mmHg",
sbp_2 = as.numeric(SYSTOL_2), sbp_2_units = "mmHg"
)
Fourth, I recommend running the following two lines of code. The first line will print a summary of the dataset. Take this opportunity to double check that all string or categorical variables have been mapped to numbers, and that all numeric variables are in the desired units. Similarly, the second line will print the class of each variable, whether string, integer or float. Take this opportunity to double check everything is in order.
summary(df)
sapply(df, class)
Finally, save the extracted dataset in a folder. In this folder you will save all the datasets you will have extracted. For instance, if you are going to process 10 datasets, you will have 10 new datasets in this new folder. I recommend that you save your dataset with the same name of the variable original_data_id
.
write.csv(df, "path/to/folder/where/to/store/new/datasets/MOCK_DATASET_2025.csv", row.names = FALSE)
At the end of this first step, you will have saved a new dataframe with only the variables of interest, with consistent names and format so that they can be merged with other datasets. You will also need to save the R
scrip with all the lines of code I showed: reading the original dataset, creating a new dataset with only the variables of interest, harmonizing variables, checking formats, and saving the new extracted dataset. I recommend that you save this R script where you have the original dataset. Why? So that you can simply copy and paste this R
script into the folder where you have the other datasets, and you will only need to update some lines of code to extract the new dataset. For example, you may need to update the original names of the variables and some of the lines of code for relabeling the variables.
Step 2: Read the extracted datasets and append.
In Step 1 we read and processed each of the datasets we will pool. At the end of Step 1 we had separate datasets, one for each of the datasets we are working with, though these extracted datasets have only the variables of interest with consistent names, labels and units. Thus, these extracted datasets can be appended and there will be no conflicts because of different names and labels. In Step 2 I will read the extracted datasets and append them, so that we can now analyze all datasets as one.
The following chunk of code will.
A. Load the dplyr
library.
B. Clean the environment so that it remains empty.
C. Set the working directory to the folder where we have all the extracted datasets.
D. Create a string variable (datasets
) in your environment with the name of the CSV files in the directory.
E. Read all the CSV files in the datasets
string variable (for-loop statement). This will load all the datasets into your environment. Depending on how many there are, or how big they are, this might take a minute. After this line of code, your environment will have all CSV files you read (all the extracted CSV files in the directory), the datasets
string variable, and the i
variable from the for-loop statement.
F. Drop the datasets
(string variable) and i
(for-loop statement) elements from the environment.
G. Check if the number of elements in the environment is equal to the number of CSV files in the directory. In this example, we knew we were going to extract data from 10 original datasets. Thus, we expect to have loaded 10 CSV datasets into the environment. If this logical statement prints FALSE, we should double check.
H. I will condense all individual datasets in the environment in a list (datalist
).
I. I will use the rbind.fill
function from the plyr
package to append all the datasets in the list (datalist
). This will output the new dataframe pooleddata
, which will contain all the datasets we processed.
J. I will again clean the environment so that I will only keep the new pooleddata
.
library(dplyr)
rm(list = ls())
setwd("path/to/folder/where/to/store/new/datasets")
datasets = list.files(pattern = "*.csv")
for(i in 1:length(datasets)) assign(datasets[i], read.csv(datasets[i]))
rm(datasets,i)
length(unique(ls())) == 10
datalist <- lapply(ls(), function(x) if (class(get(x)) == "data.frame") get(x))
pooleddata <- plyr::rbind.fill(datalist)
rm(list = setdiff(ls(), "pooleddata"))
The number of observations (rows) in pooleddata
should be equal to the total number of observations in each original dataset. Similarly, the number of variables in pooleddata
should be equal to the number of variables in each extracted dataframe. For example, if we extracted 10 datasets, each with 100 observations and 10 variables in total, pooleddata
should have 1,000 observations (10 x 100) and 10 variables.
You now have come from ten original dataframes to one pooled dataset ready to analyze!
Final remarks
This has been a quick tutorial to help you navigate some pragmatic steps when aiming to work in data pooling for Health research. We looked at how to extract and harmonize variables from the original datasets we want to pool, and how to combine (pool or append) the extracted dataframes into one pooled dataset ready for the analysis. This is not a comprehensive guide and stay tuned for more content on this topic with more details, recommendations and practical tips!
If you found this code helpful, please share it with your friends and colleagues! Also, please give this story a thumbs up and leave a comment! Let me know if you’d like a full course on this subject. Feel free to connect with me on LinkedIn and follow me in Medium for more stories.