The world’s leading publication for data science, AI, and ML professionals.

SynthDiD 101: A Beginner’s Guide to Synthetic Difference-in-Differences

On the method's advantages and disadvantages, demonstrated with the synthdid package in R

Title image generated by author using Nightcafe
Title image generated by author using Nightcafe

In this blog post, I give a quick introduction to the Synthetic Difference-in-Differences (SynthDiD) method and discuss its relation to the traditional Difference-in-Differences (DiD) and Synthetic Control Method (SCM). SynthDiD is a generalized version of SCM and DiD that combines the strengths of both methods. It enables Causal Inference with large panels, even with a short pretreatment period. I discuss advantages and disadvantages of this method while demonstrating the approach using the synthdid package in R. I provide bullet points for a quick introduction.

Synthetic Control Method vs. Synthetic Difference-in-Differences

The synthetic control method and the synthetic difference-in-differences method are closely related, but differ in how they estimate causal effects. The **** synthetic control method is a statistical technique that creates a "synthetic" control group by combining multiple control units that are similar to the treatment unit in all relevant characteristics. The synthetic control group is constructed to match the pre-treatment outcomes of the treated unit as closely as possible. The treatment effect is then estimated by comparing the post-treatment outcomes of the treated unit to those of the synthetic control group.

On the other hand, synthetic DiD combines the synthetic control method with the difference-in-differences approach [1]. In this method, a synthetic control group is constructed using the same approach as in the synthetic control method. However, the treatment effect is estimated by comparing the change in outcomes between the treated unit and the synthetic control group before and after the treatment is introduced. This approach allows for a more robust estimation of the treatment effect by accounting for pre-existing differences between the treatment and control groups.

In summary, while both methods use a synthetic control group, the synthetic control method estimates treatment effects by comparing the post-treatment outcomes of the treated unit to those of the synthetic control group, while synthetic DiD estimates treatment effects by comparing the change in outcomes between the treated unit and the synthetic control group before and after the treatment is introduced.

Synthetic DiD in bulletpoints:

  • SynthDiD is a generalized version of SCM and DiD.
  • It borrows strengths from the DiD method as well as the synthetic control method [2][3].
  • It constructs a counterfactual for the treated group by optimally weighting the control group units to minimize the difference between the treated and control groups in the pretreatment period as in SCM.
  • Then, the treatment effect is estimated by comparing the outcome changes in the treated unit and synthetic control group pre- and post-intervention as in DiD.
  • SynthDiD accounts for unit-level changes in outcome as in DiD [4].
  • It facilitates inference in extensive panels, even when the pretreatment phase is brief, which sets it apart from the synthetic control method (SCM necessitates a lengthy pretreatment period).
  • Same as in SCM, the units become the "variables" and we represent the outcome as a weighted average of the units (i.e., synthetic control).

Example

Suppose that we are a company that sells plant-based food products, such as soy milk or soy yogurt, and we operate in multiple countries. Some countries implement new legislation that prohibits us from marketing our plant-based products as ‘milk’ or ‘yogurt’ because it is claimed that only animal products can be marketed as ‘milk’ or ‘yogurt’ (thanks to one of my former students for the inspiration for this example :). Thus, due to this new regulation in some countries, we have to market soy milk as soy drink instead of soy milk, etc. We want to know the impact of this legislation on our revenue as this might help guide our lobbying efforts and marketing activities in different countries.

I simulated a balanced panel dataset that shows the revenue of our company in 30 different countries for 30 periods. Three of the countries implement this legislation in period 20. In the figure below, you can see a snapshot of the data. treat is a dummy variable indicating whether a country has implemented the legislation in a given period. revenueis the revenue in millions of EUR. You can find the simulation and estimation code in this Gist.

# Install and load the required packages
# devtools::install_github("synth-inference/synthdid")
library(synthdid)
library(ggplot2)
library(data.table)

# Set seed for reproducibility
set.seed(12345)

source('sim_data.R') # Import simulation function and some utilities

dt <- sim_data()
head(dt)
Snapshot of the data, image by author.
Snapshot of the data, image by author.

Next, we convert our panel data into a matrix required by the synthdid package. Given the outcome, treatment and control units and pretreatment periods, a synthetic control is created and treatment effect is estimated with synthdid_estimate function. To make inference, we also need to calculate the standard errors. I use jacknife method as I have more than one treated units. placebo method is the only option if you have one treatment unit. Given the standard errors, I also calculate the 95% confidence interval for the treatment effect. I will report these in the figure below.

# Convert the data into a matrix
setup = panel.matrices(dt, unit = 'country', time = 'period', 
                       outcome = 'revenue', treatment = 'treat')

# Estimate treatment effect using SynthDiD
tau.hat = synthdid_estimate(setup$Y, setup$N0, setup$T0)

# Calculate standard errors 
se = sqrt(vcov(tau.hat, method='jackknife'))
te_est <- sprintf('Point estimate for the treatment effect: %1.2f', tau.hat)
CI <- sprintf('95%% CI (%1.2f, %1.2f)', tau.hat - 1.96 * se, tau.hat + 1.96 * se)

Let’s also plot the results with some additional info on the data.

# Check the number of treatment and control countries to report
num_treated <- length(unique(dt[treat==1]$country))
num_control <- length(unique(dt$country))-num_treated

# Create spaghetti plot with top 10 control units
top.controls = synthdid_controls(tau.hat)[1:10, , drop=FALSE]
plot(tau.hat, spaghetti.units=rownames(top.controls),
     trajectory.linetype = 1, line.width=.75, 
     trajectory.alpha=.9, effect.alpha=.9,
     diagram.alpha=1, onset.alpha=.9, ci.alpha = .3, spaghetti.line.alpha =.2,
     spaghetti.label.alpha = .1, overlay = 1) + 
  labs(x = 'Period', y = 'Revenue', title = 'Estimation Results', 
       subtitle = paste0(te_est, ', ', CI, '.'), 
       caption = paste0('The number of treatment and control units: ', num_treated, ' and ', num_control, '.'))

In the image below, the estimation results are displayed. Observe how the treated countries and the synthetic control exhibit fairly parallel trends on average (it might not look like a perfect parallel trends but that is not necessary for the sake of this example). The average for treated countries is more variable, primarily due to the presence of only three such countries, resulting in less smooth trends. Transparent gray lines represent different control countries. Following the treatment in period 20, a decline in revenue is observed in the treated countries, estimated to be 0.51 million EUR as indicated in the graph. This means that the new regulation has a negative impact on our company’s revenues and necessary actions should be taken to prevent further declines.

Results, image by author.
Results, image by author.

Let’s plot the weights use to estimate the synthetic control.

# Plot control unit contributions
synthdid_units_plot(tau.hat, se.method='jackknife') +
  labs(x = 'Country', y = 'Treatment effect', 
       caption = 'The black horizontal line shows the actual effect; 
       the gray ones show the endpoints of a 95% confidence interval.')

In the image below, you can observe how each country is weighted to construct the synthetic control. The treatment effects differ based on the untreated country selected as the control unit.

Country weights, image by author.
Country weights, image by author.

Now that we understand more about SynthDiD let’s talk about pros and cons of this method. There are some advantages and disadvantages to SynthDiD like every method. Here are some pros and cons to keep in mind when Getting Started with this method.

Advantages of SynthDiD method:

  • The synthetic control method is usually used for a few treated and control units and needs long, balanced data before treatment. SynthDiD, on the other hand, works well even with a short data period before treatment, unlike the synthetic control method [4].
  • This method is being preferred especially because it doesn’t have a strict parallel trends assumption (PTA) requirement like DiD.
  • SynthDiD guarantees a suitable quantity of control units, considers possible pre-intervention patterns, and may accommodate a degree of endogenous treatment timing [4].

Disadvantages of SynthDiD method:

  • Can be computationally expensive (even with only one treated group/block).
  • Requires a balanced panel (i.e., you can only use units observed for all time periods) and that the treatment timing is identical for all treated units.
  • Requires enough pre-treatment periods for good estimation, so, if you don’t have enough pre-treatment period might be better to use just the regular DiD.
  • Computing and comparing the average treatment effects for subgroups is tricky. One option is to split the sample into subgroups and compute the average treatment effects for each subgroup.
  • Implementing SynthDiD where the treatment timing varies might be tricky. In the case of staggered treatment timing, as one solution, one can estimate the average treatment effect for each treatment cohort and then aggregate cohort-specific average treatment effects to an overall average treatment effects.

Here are also some other points that you might want to know when getting started.

Things to note:

  • SynthDiD employs regularized ridge regression (L2) while ensuring that the resulting weights have a sum of one.
  • In the process of pretreatment matching, SynthDiD tries to determine the average treatment effect across the entire sample. This approach might cause individual time period estimates to be less precise. Nonetheless, the overall average yields an unbiased evaluation.
  • The standard errors for the treatment effects are estimated with jacknife or if a cohort has only one treated unit with placebo method.
  • The estimator is considered consistent and asymptotically normal, given that the combination of the number of control units and pretreatment periods is sufficiently large relative to the combination of the number of treated units and posttreatment periods.
  • In practice, pre-treatment variables play a minor role in Synthetic DiD, as lagged outcomes hold more predictive power, making the treatment of these variables less critical.

Conclusion

In this blog post, I introduce the SynthDiD method and discuss its relationship with traditional DiD and SCM. SynthDiD combines the strengths of both SCM and DiD, allowing for causal inference with large panels even when the pretreatment period is short. I demonstrate the method using the synthdid package in R. Although it has several advantages, such as not requiring a strict parallel trends assumption, it also has drawbacks, like being computationally expensive and requiring a balanced panel. Overall, SynthDiD is a valuable tool for researchers interested in estimating causal effects using observational data, providing an alternative to traditional DiD and SCM methods.


References

[1] D. Arkhangelsky, S. Athey, D.A. Hirshberg, G.W. Imbens, and S. Wager, Synthetic Difference in Differences (2021), American Economic Review.

[2] A. Abadie, A. Diamond, J. Hainmueller, Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program (2010), Journal of the American Statistical Association.

[3] A. Abadie, Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects (2021), Journal of Economic Perspectives.

[4] Berman, R., & Israeli, A., The value of descriptive analytics: Evidence from online retailers (2022), Marketing Science.

Helpful Links

Causal inference for brave and true, synthetic difference-in-differences.

Matteo Courthhoud, Understanding Synthetic Control Methods.


Thank you for reading!

If you liked the post and would like to see more of my articles consider following me.

Disclaimer: I write to learn so it might be that you spot an error in the article or code. If you do so, please let me know.


Related Articles