The world’s leading publication for data science, AI, and ML professionals.

Analyzing Health Surveys Made Easy with Functions in R

Solving the issue of having missing data in the variables for sampling design

Have you ever pooled many health surveys which have a complex sampling design and include variables such as primary sampling unit (PSU), stratum, and sampling weights? Only to find that some of these surveys did not have any of these variables at all (e.g., PSU and sampling weights only, not stratum)? Here’s an efficient solution packing the datasets in a list and applying functions that will deliver the results you’re looking for.

Photo by Scott Graham on Unsplash
Photo by Scott Graham on Unsplash

In a pooled dataset of many health Surveys, some of which may not have all three variables (PSU, stratum, and sampling weights), if you were to conduct a complete-case analysis, you will only keep the datasets with data for the three variables. This might lead to a substantial sample size reduction as we will need to exclude the datasets with, for example, PSU and sampling weights but without data for the stratum variable.

My solution? To analyze each dataset independently, leveraging whatever variable they have (e.g., PSU and sampling weights only, not stratum). Unfortunately, analyzing each dataset at a time will be inefficient. However, having faced this issue several times in over seven years of professional experience in research and dozens of peer-reviewed papers utilizing health surveys, I came up with a set of functions and steps.


  1. Drop missing observations in PSU, stratum, and sampling weights, if the number of missing values is different from the sample size of such a dataset. In other words, if there is a dataset of 100 observations and 20 had missing data in PSU, I will exclude these 20 observations. Conversely, if the number of missing values in PSU were 100 (i.e., PSU completely empty), I will keep the dataset as is. Because this would suggest that the PSU variable was not available by design or simply not included with the dataset, as it may be the case with datasets we download.
  2. In each of the surveys, I will create a new variable called type which will identify whether one of the three variables was completely missing, two were completely missing, three were completely missing, and all possible combinations. For example, type = 1 if all three variables were present, type = 2 if stratum and sampling weights were missing, and so forth.
  3. Compute the metrics of interest using whatever variable(s) was available. This will leverage the information in the type variable created in the previous step. For example, if type flagged this dataset as only having PSU and sampling weights, for that specific dataset, we will compute the results leveraging those two variables, whereas for other datasets which have the three variables complete, the results will be computed using the PSU, stratum, and sampling weights.

The trick is to put every dataset inside a list and apply the functions to each element in the list (i.e., to each dataset or survey in the list). That way, you can also work with each dataset independently if needed.

First, I will explain each function and what they do. Then, I will provide a practical example using the functions.


A. Deal with the missing values

These functions, when applied to each dataset, will exclude missing observations if the number of missing values is different from the total number of observations in the dataset.

For example, if the dataset has 100 people and 20 were missing the PSU variable, these 20 observations will be dropped. Conversely, if 100 people were missing the PSU variable, we will keep the dataset as is.

This approach assumes that, if all observations had missing data for the PSU variable, it is because that was the case by design, or this information was not provided with the dataset. In either case, there is little or nothing we can do. But we can still leverage whatever information is available. That is, if PSU is completely missing but the dataset has stratum and sampling weights, we can leverage these variables in our analysis.

countNA <- function(x)sum(is.na(x))

cleaning_svy <- function(df){
  if (countNA(df$psu) != nrow(df)){
    df <- df[which(!is.na(df$psu)), ]
  } 
  if (countNA(df$stratum) != nrow(df)){
    df <- df[which(!is.na(df$stratum)), ]
  }
  if (countNA(df$wstep2) != nrow(df)){
    df <- df[which(!is.na(df$wstep2)), ]
  }
  return(df)
}

B. Flagging the datasets

We have excluded the few observations with missing data for either PSU, stratum, and sampling weights, and yet we have kept the datasets that had all missing values for either of these variables.

In this step, we will create a new variable that will flag each dataset according to the variables it has. For example, we will identify if the dataset has all three variables, only one (e.g., PSU only), only two (e.g., PSU and sampling weights only), and all possible pair-wise combinations (e.g., PSU and sampling weights but not stratum).

This new variable, which we will name type, will be used later to compute the results using the available data. For instance, if a dataset is flagged as having all three variables, we will use the three variables to compute the results, whereas if a dataset is flagged as only having PSU and sampling weights, we will compute the results using these two variables only.

grouping_svy <- function(df){
  df$type <- ifelse(!is.na(df$psu) &amp; !is.na(df$stratum) &amp; !is.na(df$wstep2), 1,  # all three
             ifelse(is.na(df$psu) &amp; !is.na(df$stratum) &amp; !is.na(df$wstep2), 2,   # stratum &amp; weights
             ifelse(!is.na(df$psu) &amp; is.na(df$stratum) &amp; !is.na(df$wstep2), 3,   # psu &amp; weights
             ifelse(!is.na(df$psu) &amp; !is.na(df$stratum) &amp; is.na(df$wstep2), 4,   # psu &amp; stratum
             ifelse(!is.na(df$psu) &amp; is.na(df$stratum) &amp; is.na(df$wstep2), 5,    # psu
             ifelse(is.na(df$psu) &amp; !is.na(df$stratum) &amp; is.na(df$wstep2), 6,    # stratum
             ifelse(is.na(df$psu) &amp; is.na(df$stratum) &amp; !is.na(df$wstep2), 7,    # weights
             ifelse(is.na(df$psu) &amp; is.na(df$stratum) &amp; is.na(df$wstep2), 8, NA)))))))) # none
  return(df)
}

C. Computing results

We excluded missing data while keeping the dataset that only had missing values for either PSU, stratum, or sampling weights. We also flagged the datasets depending on which variables they have. We will now compute the results of interest.

In this example, we will compute the survey-adjusted mean for a variable called age. This is a numeric variable for which we will compute the mean and the 95% confidence intervals leveraging the complex sampling design. That means that the results will be representative of the underlying population. For example, if we were working with a health survey designed to deliver results representative of a country, we will compute the mean age in that country.

First, this function will be applied to each dataset and will look at the variable type to understand what sampling design variables are available, whether PSU, stratum, sampling weights. Second, the function will compute the results of interest. In this case, the mean of age. Third, the results will be stored in a dataframe, and the dataframe will be returned.

You can edit the second step to compute your metric of interest, whether it’s the mean of another numeric variable or the proportion of a categorical variable. In this example, the function results_svy_mean_age will generate the mean age by sex.

results_svy_mean_age <- function(df){
  options(survey.lonely.psu = "adjust", survey.adjust.domain.lonely = TRUE)
  if (unique(df$type == 1)) {
    dsub <- svydesign(id = ~psu, 
                      strata = ~stratum, 
                      weights = ~wstep3, 
                      data = df, 
                      nest = TRUE) } 
  if (unique(df$type == 2)) {
    dsub <- svydesign(id = ~1, 
                      strata = ~stratum, 
                      weights = ~wstep3, 
                      data = df, 
                      nest = TRUE)} 
  if (unique(df$type == 3)) {
    dsub <- svydesign(id = ~psu, 
                      strata = NULL, 
                      weights = ~wstep3, 
                      data = df, 
                      nest = TRUE)} 
  if (unique(df$type == 4)) {
    dsub <- svydesign(id = ~psu, 
                      strata = ~stratum, 
                      weights = NULL, 
                      data = df, 
                      nest = TRUE)} 
  if (unique(df$type == 5)) {
    dsub <- svydesign(id = ~psu, 
                      strata = NULL, 
                      weights = NULL, 
                      data = df, 
                      nest = TRUE)}
  if (unique(df$type == 6)) {
    dsub <- svydesign(id = ~1, 
                      strata = ~stratum, 
                      weights = NULL, 
                      data = df, 
                      nest = TRUE)}
  if (unique(df$type == 7)) {
    dsub <- svydesign(id = ~1, 
                      strata = NULL, 
                      weights = ~wstep3, 
                      data = df, 
                      nest = TRUE)}
  if (unique(df$type == 8)) {
    dsub <- svydesign(id = ~1, 
                      strata = NULL, 
                      weights = NULL, 
                      data = df, 
                      nest = TRUE)}

  age_m <- svyby(~age,
                              ~sex,
                              design = dsub, 
                              vartype = c("ci"),
                              svymean)

  age_m <- as.data.frame(age_m)

  return(age_m)
}

D. Put it all together.

Let’s assume that your pooled dataset containing many health surveys is called pooleddata, and that there is a variable called study_id that identifies each individual survey.

df <- list()
for (i in unique(pooleddata$study_id)){
  df[[i]] <- pooleddata[which(pooleddata$study_id == i),]
}
df <- lapply(df, cleaning_svy)

We will create an empty list and inside the list store each dataset as an independent element of the list. We will use the study_id to identify each unique survey. For instance, if our pooleddata is a pool of 10 surveys, our list will have ten elements.

We will apply the function cleaning_svy to each element of the list. In other words, to each individual dataset that was stored in the list. Remember we explained the function cleaning_svy in the step A above.

At the end of step A, we will have the same list containing our individual surveys but having excluded the few missing observations in the sampling design variables yet keeping the datasets with complete missing data in any of these variables (PSU, stratum, and sampling weights).

We will now create the type variable in each dataset inside the list (i.e., in each survey). As a result, we will have the same list containing the unique surveys, each with a new variable named type that will flag what sampling design variables are available.

df <- lapply(df, grouping_svy)

As a sanity check, I like to print the value of the type variable in each dataset. This print should only show one value for each survey.

for(i in 1:length(df)){        # each MUST only have one value
  print(unique(df[[i]]$type)) 
}

Next, you can apply our function results_svy_mean_age to all the datasets in our list.

A <- lapply(df, results_svy_mean_age)

The output of A will look like:

Screenshot by the author. The names of the surveys were blurred for privacy.
Screenshot by the author. The names of the surveys were blurred for privacy.

You can further process A so that it looks like a tidy dataframe. Now you have a dataframe where the first column is the unique identifier of each original survey, the sex indicator which we used to stratify the quantification of the mean age, and the mean age together with the lower and upper 95% confidence intervals. Notably, the mean age and the 95% confidence intervals account for the complex sampling design variables (PSU, stratum, and sampling weights) available in each survey.

Screenshot by the author. The names of the surveys were blurred for privacy.
Screenshot by the author. The names of the surveys were blurred for privacy.

In summary

If you are pooling multiple health surveys with a complex sampling design, chances are that some of those surveys do not have any of the three variables needed to compute your results accounting for the sampling design (PSU, stratum, and sampling weights). Instead of dropping some surveys where these variables are completely missing, you can analyze each survey independently leveraging the variables they do have (e.g., PSU and sampling weights but not stratum). In three simple steps, we provide R functions to do so!


If you found this code useful, share it with your friends and colleagues! Also, please give this story your love with a thumbs up and leave a comment! Feel free to connect with me on LinkedIn.


Related Articles