The world’s leading publication for data science, AI, and ML professionals.

Spatial Challenges in RCTs

Location, Location, Location

Images created by using AI tools such as Photo Realistic GPT and Super Describe.
Images created by using AI tools such as Photo Realistic GPT and Super Describe.

Randomized Controlled Trials (RCTs) are a standard approach to studying cause-effect relationships and identifying the impact or effectiveness of new treatments, interventions, and policies. Still, the reliability and applicability of their outcomes may be significantly influenced by spatial factors (i.e., features related to geographical contexts in which the studies are implemented). Understanding and tackling these spatial issues, mainly where treatments are applied in real-world settings, is critical to preventing and mitigating potential distortions and biases from RCT results. But what exactly are these spatial factors, and how can they skew the results of an RCT? More importantly, how can researchers effectively manage these spatially induced variations to maintain the integrity of their studies?

Why Do Spatial Factors Matter?

When I refer to spatial factors in the context of RCTs, I mean that geographical elements often play a role in those studies, and not accounting for them can lead to severe misinterpretations. These factors can include the location’s climate, population density, cultural practices, health infrastructure, and even socioeconomic conditions.

Spatial heterogeneities may lead to significant variations in RCT outcomes across different regions that are not purely attributable to the treatment under study. Those variations pose challenges, for example, in generalizing the findings across different settings.

Let’s imagine a medication X that works well in a temperate climate but may have different effects in a tropical climate due to differences in disease transmission patterns, storage conditions of the medication, or genetic differences in the population.

In this case, the ("true") outcomes will have been shadowed if regional factors are not taken into account. Thus, medication X will be wrongly suggested in all locations, even directly threatening the lives of people in the tropical area.

Now imagine it was your responsibility… How does this make you feel? Do you think spatial factors matter? Well, it is becoming clearer that RCTs could produce results that are not universally applicable, leading to ineffective or suboptimal recommendations.

Addressing spatial factors ensures that researcher findings can effectively translate into practical advice or policies adjusted to diverse environments and population features. This reveals the relevance of designing studies that are as representative and inclusive as possible to accommodate the variability introduced by these factors.

Spatial factors can influence RCTs in multiple ways. For example:

  • Contagion Effects => Treatment effects spill over to nearby control groups or units. This is known as spatial contagion or interference and may happen when individuals from control groups interact or are in close contact with individuals in the affected group. Let’s simulate it!

Assume that we must evaluate the implementation of program Z over time across a specific squared area. We predefined control and treatment groups, with the treatment groups initially primarily located in the bottom-right quarter of the grid.

This setup allows us to visualize how treatment groups can influence adjacent units cumulatively during program Z (i.e., there is occasionally a "wave" interference effect from treated to untreated units).

To simplify, here, the contagion effect is determined purely by the number of treated neighbors. If this is the case, we can simulate multiple scenarios accounting for, for example, a dynamic contagion probability for each subsequent wave, where the spreading probability (positively) depends on a predetermined number of affected neighboring cells.

The following Python script simulates the progression of the contagion based on arbitrary parameters that you can change at your convenience (e.g., colors, grid size, pathname, probabilities, seed). To run this script, you must have access to numpy and matplotlib libraries.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

def plot_grid(grid, title, filename):
    # Define custom colormap
    cmap = mcolors.ListedColormap(['#1f77b4', '#ff8000', '#ffb266'])
    bounds = [0, 0.5, 1.5, 2]
    norm = mcolors.BoundaryNorm(bounds, cmap.N)

    # Plotting
    plt.figure(figsize=(8, 5))
    plt.imshow(grid, cmap=cmap, norm=norm)
    plt.colorbar(ticks=[0, 1, 2], label='Group', 
                 format=plt.FuncFormatter(lambda val, loc: ['Control', 'Treatment', 'Interference'][loc]))
    plt.title(title)
    plt.grid(False)
    plt.savefig(filename)
    plt.close()

# Apply contagion based on neighboring treatment
def apply_contagion(grid, base_probability=0.3):
    new_grid = grid.copy()
    grid_size = grid.shape[0]
    for x in range(grid_size):
        for y in range(grid_size):
            # Adjust probability based on the number of contagion effect neighbors
            affected_neighbors = 0
            for dx in [-1, 0, 1]:
                for dy in [-1, 0, 1]:
                    if dx == 0 and dy == 0:
                        continue
                    nx, ny = x + dx, y + dy
                    if 0 <= nx < grid_size and 0 <= ny < grid_size and grid[nx, ny] == 2:
                        affected_neighbors += 1
            contagion_probability = base_probability + 0.05 * affected_neighbors

            # Apply contagion if the cell is initially untreated
            if grid[x, y] == 0:
                for dx in [-1, 0, 1]:
                    for dy in [-1, 0, 1]:
                        if dx == 0 and dy == 0:
                            continue
                        nx, ny = x + dx, y + dy
                        if 0 <= nx < grid_size and 0 <= ny < grid_size and grid[nx, ny] > 0:
                            if np.random.rand() < contagion_probability:
                                new_grid[x, y] = 2  # Mark as contagion effect
                                break
    return new_grid

def calculate_percentages(grid):
    total_cells = grid.size
    control_count = np.sum(grid == 0)
    treatment_count = np.sum(grid == 1)
    contagion_count = np.sum(grid == 2)
    return (control_count / total_cells * 100, treatment_count / total_cells * 100, contagion_count / total_cells * 100)

# Parameters
grid_size = 40
np.random.seed(0)

# Initialize the grid with treatment concentrated in the bottom-right quarter
initial_grid = np.zeros((grid_size, grid_size), dtype=int)
for x in range(grid_size):
    for y in range(grid_size):
        if x >= grid_size // 2 and y >= grid_size // 2:
            initial_grid[x, y] = np.random.choice([0, 1], p=[0.3, 0.7])  # Higher probability of treatment in the bottom-right
        else:
            initial_grid[x, y] = np.random.choice([0, 1], p=[0.9, 0.1])  # Mostly control in other areas

# Calculate percentages for the initial setup
control_percent, treatment_percent, contagion_percent = calculate_percentages(initial_grid)
print(f'Initial Setup: Control: {control_percent:.2f}%, Treatment: {treatment_percent:.2f}%, Interference: {contagion_percent:.2f}%')

# Generate initial setup image
plot_grid(initial_grid, 'Initial Setup: No Interference', '/Pathname/1initial_setup.png')

# First contagion wave
first_wave = apply_contagion(initial_grid, base_probability=0.3)
control_percent, treatment_percent, contagion_percent = calculate_percentages(first_wave)
print(f'First Wave: Control: {control_percent:.2f}%, Treatment: {treatment_percent:.2f}%, Interference: {contagion_percent:.2f}%')
plot_grid(first_wave, 'Outcome: First Contagion Wave', '/Pathname/2first_wave.png')

# Second contagion wave from the result of the first wave
second_wave = apply_contagion(first_wave, base_probability=0.3)
control_percent, treatment_percent, contagion_percent = calculate_percentages(second_wave)
print(f'Second Wave: Control: {control_percent:.2f}%, Treatment: {treatment_percent:.2f}%, Interference: {contagion_percent:.2f}%')
plot_grid(second_wave, 'Outcome: Second Contagion Wave', '/Pathname/3second_wave.png')

# Third contagion wave from the result of the second wave
third_wave = apply_contagion(second_wave, base_probability=0.3)
control_percent, treatment_percent, contagion_percent = calculate_percentages(third_wave)
print(f'Third Wave: Control: {control_percent:.2f}%, Treatment: {treatment_percent:.2f}%, Interference: {contagion_percent:.2f}%')
plot_grid(third_wave, 'Outcome: Third Contagion Wave', '/Pathname/4third_wave.png')

Below are the outputs from my code. As you noticed, a contagion effect from implementing program Z may lead to more widespread treatment over time if not corrected or accounted for.

Contagion Effects: Outputs from my Python script.
Contagion Effects: Outputs from my Python script.

In the initial setup, the control groups represented 73.6% of the area and the treatment groups are the rest.

In the first wave of the contagion, there was a 22.4% interference over the control group.

In the second wave, that figure increased to 50.9%.

At the end of the simulation, almost the entire control group was affected by the treatment: 67.3% of spatial contagion of the total area and a control group reduced to just 6.3% of the area.

  • Heterogeneity of Treatment Effects => Variations in the economic and social environment depending on the location can also change the results of the same intervention.

For example, consider an RCT examining the impact of a new agricultural technology on crop yields. Researchers must consider soil quality, climate conditions, and market proximity if the technology is implemented in randomly selected villages. Those variables can differ spatially and influence the technology’s effectiveness, and ignoring them could lead to incorrect inferences about the technology’s overall efficacy.

  • External Validity => RCT outcomes may be sensitive to the specific locations in which they were conducted. This is one of the most conventional critiques of an RCT approach. In this case, the generalizability of the results to other settings is limited, particularly if assuming that the locations were not selected randomly.

Addressing Spatial Issues in RCTs

At this point, you likely have a valid doubt about how we often deal with these spatial factors to avoid inferential biases or causal misunderstandings.

Some common techniques are:

  • Stratified (Geographic) Randomization => Stratify the sample based on geographical features before randomly assigning treatments. This geographic randomization mitigates the presence of spatial autocorrelation. It helps us to minimize the bias that may arise from the uneven distribution of spatial factors, thus facilitating that each stratum is well-represented in both the treatment and control groups (i.e., treatment and control groups are "evenly" distributed across different spatial contexts).

For a better understanding, I invite you to check two of my previous posts about spatial cross-validation in geographic data analysis and synthetic control and Spatial Analysis.

Spatial Cross-Validation in Geographic Data Analysis

Synthetic Control and Spatial Analysis: Insights for Causality and Policy

The following Python script exposes an example of geographic randomization. Feel free to change parameters such as average income, colors, latitude, longitude, population density, and seed at your convenience. To run this script, you must have access to pandas, numpy, and matplotlib libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulate data for 200 villages
np.random.seed(232)
data = {
    'village_id': range(1, 201),
    'latitude': np.random.uniform(-10, 10, 200),
    'longitude': np.random.uniform(-10, 10, 200),
    'average_income': np.random.uniform(1000, 5000, 200),
    'population_density': np.random.uniform(50, 500, 200)
}

# Create a DataFrame
villages = pd.DataFrame(data)

# Stratify villages into groups based on income and density quantiles
villages['income_strata'] = pd.qcut(villages['average_income'], 4, labels=False)
villages['density_strata'] = pd.qcut(villages['population_density'], 4, labels=False)

# Combine strata into a single stratification key
villages['strata'] = villages['income_strata'].astype(str) + villages['density_strata'].astype(str)

# Assign treatment randomly within each stratum
def assign_treatment(group):
    group['treatment'] = np.random.binomial(1, 0.5, size=len(group))  # 50% probability
    return group

villages = villages.groupby('strata').apply(assign_treatment)

# Plotting the villages with treatment assignment
plt.figure(figsize=(10, 6))
for _, row in villages.iterrows():
    color = 'blue' if row['treatment'] == 1 else 'red'
    marker = '^' if row['treatment'] == 1 else 'o'
    plt.scatter(row['longitude'], row['latitude'], color=color, marker=marker, s=50)
plt.title('Geographic Distribution of Villages by Treatment Assignment')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True)
plt.legend(['Treatment', 'Control'], loc='upper right')
plt.show()

Below is the output from this code.

Geographic Randomization: Output from my Python script.
Geographic Randomization: Output from my Python script.
  • Spatial Econometric Methods => Incorporate spatial econometric techniques to also account for spatial autocorrelation. In particular, we can use Spatial Lag Models (SLMs) or Spatial Error Models (SEMs) to adjust for the influence of nearby units or groups. If you have not done it, I invite you to read my previous post about addressing spatial dependencies using an SEM.

Addressing Spatial Dependencies

  • Cluster Randomization => Focus on randomizing groups such as communities, schools, villages, and regions rather than individuals. This approach is useful when there is a risk of spatial contagion between treated and untreated at the individual level and interventions are aimed at community-wide effects. For example, if individuals are likely to influence each other, randomizing entire clusters may contain the intervention within controlled boundaries and reduce the likelihood of interference.
  • Adjusting for Spatial Heterogeneity => Conduct subgroup analyses using statistical techniques such as multilevel modeling or geographically weighted regression (GWR) to examine how the effects may vary across different geographic contexts. This could allow us to obtain deeper insights into the drivers of RCT outcomes and slightly reduce the concerns about external validity.
  • Geographic Information Systems (GIS) and Spatial Analysis => Visualize and quantitatively analyze the geographical distribution of treatment effects. In this sense, it is possible to map a priori the locations of study areas and analyze spatial data to ease the pre-identification of patterns , potential biases and anomalies that may influences on the study outcomes. I now invite you to read another post but, in this case, related to five key techniques in spatial analysis.

Five Key Techniques in Spatial Analysis

A Real Case: The "Thibela TB" Trial in South Africa

Let’s expose an example with real implications. This trial was a large-scale RCT to evaluate the effect of a community-wide isoniazid preventive therapy (IPT) to reduce the incidence of tuberculosis (TB) in gold miners in South Africa from 2006 to 2011 (for details, check Fielding et al., 2011 and The Aurum Institute).

Here, the spatial factors were geographical, environmental, and socioeconomic. These mines were known for their high TB incidence rates, which were exacerbated by conditions such as crowded work and living environments, and high levels of silica dust exposure. The transmission dynamics in densely populated mines could have affected the outcome of IPT intervention, requiring different strategies in those places compared to less crowded settings.

There was also a high human immunodeficiency virus (HIV) prevalence in specific areas. The IPT was highly effective in preventing TB among HIV-positive individuals, but, unfortunately, its efficacy did not significantly reduce TB incidence at the community (mine-wide) level as expected.

Moreover, the RCT was implemented in different mines and clusters with different environmental conditions and medical infrastructure, all of which could have influenced the trial’s outcomes.

However, the researchers moved forward with addressing some spatial issues. How? They increased surveillance to monitor TB incidence and tracked its spread within and between mines. They also adopted a cluster-randomized design, where mining areas (or groups of areas) rather than individuals were randomized to be either in the control or treatment group. Finally, they stratified the data by mine, HIV status, and previous TB history to gain a better understanding and adjust for the variable effects of IPT across different groups.

At the end…

Spatial factors, indeed, can significantly affect the outcomes and interpretations of RCTs.

Acknowledging and effectively managing spatial dependencies and geographic variables in RCTs is imperative for developing robust inferences and dependable analyses. This is particularly relevant when translating RCT findings into policy recommendations.

As the world grows more connected, the importance of considering spatial factors in research will only increase, reinforcing the need for thoughtful, well-designed RCTs accounting for the complex world in which we live.


If you found the time to read this post and consider it useful, please share it with your peers. You can leave comments and/or reach me to let me know your thoughts. Do not hesitate to contact me or directly say hi on LinkedIn. My X (formerly Twitter) is @ljmaldon. My personal website is www.leonardojmaldonado.com


Related Articles