Unhappy with statistical significance (p-value)? Here is a simple solution

Why and how to use forest plots efficiently?

Quentin Gallea, PhD
Towards Data Science

--

From regression to forest plot, image by the author

Humans like to put things into bins in a simple way: Statistically significant or not statistically significant. This dichotomous view of statistical results must stop! For almost a decade, I heard the same debate. On one hand, people argued that statistical significance has the advantage of simplicity and clarity (and many people are already struggling to achieve this). On the other hand, there were those who argued for more advanced models (e.g. the Bayesian approach) because of the limitations of the p-value (see next section). I was somewhat in between. I realised that it was useful because of its simplicity and, pragmatically, and I accepted that I could not ask more of many students (including some scientific researchers). On the other hand, I was also bothered by the problematic nature of this vision.

However, today I am using a solution that seems to satisfy both sides: it is simple and includes the p-value while addressing the main weaknesses of using statistical significance alone. I hope that this proposal can help to take a step away from the excessive focus on statistical significance.

This article has two main parts. First, I present the concept: the main problems related to statistical significance (p-value), the solution (forest diagram) and a conclusion. Second, I include a complete example with data and a real research question (the relationship between temperature and environmental policy implementation).

What are the issues with statistical significance (p-value)?

First things first. What is the p-value and why it is a problem to use statistical significance (only)?

“The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.”— Wikipedia

It would take a whole article to explain the p-value. So, if you are not yet familiar with this key concept, I let you check quickly this nice summary on Towards Data Science (link).

Here are the two main issues:

1. Practical relevance:

The statistical significance says nothing about the size of the effect. Let’s say I see an advertisement for a shampoo that helps reduce hair loss. The advert claims that the results have been proven in the laboratory by a large experiment and that the results are statistically significant (those who use the shampoo lose less hair). However, when you read the underlying research paper, you discover that on average, the person who uses this shampoo will have five more hairs on their head. Who cares? Right?

Here’s another example from my work as a researcher. I worked for two years on a paper to assess the effect of covid containment on the spread of the virus (Bonardi et al. (2022) forthcoming). Here, the magnitude of the effect is key. Lockdowns have been very costly for the economy, for the psychological health of many people, etc. Therefore, we want to know not only whether they have reduced the spread of the virus, but by how much to assess the cost/benefit of such measures.

That is why I always advocate using statistical significance simply as a necessary condition for interpreting the effect size. And then, on the basis of the effect size, one can conclude whether the effect has practical significance or not. The emphasis is on magnitude, while statistical significance provides evidence that the results are not random.

2. Manipulation:

The focus on a fixed threshold (e.g. 5%) leads to two main problems. When working with data, there are sometimes a few degrees of freedom in the choice of statistical test, sample, etc. Therefore, this may allow some people to manipulate the results slightly to fall below the threshold.

On the other hand, if you are slightly above the threshold, you may not be able to publish your results. However, basically, a p-value of 4.9% and a p-value of 5.1% show similar evidence on the hypothesis being tested. This leads to serious problems in research. Here is an illustrative example I found on LinkedIn (Florian Weigert): ”60 researchers examined the effect of medical treatment X on the reduction of disease Y. 57 researchers found no effect and could not publish the results. 3 researchers documented a statistically significant relationship and were published. Conclusion: A meta-study provides overwhelming support for the fact that medical treatment X leads to a reduction in disease Y. “. If you want to go further, read ”The significance filter, the winner’s curse and the need to shrink “*by Zwet and Cator (2021)¹ or the American Statistical Association’s statement on p-values².

A simple solution: Forest plots

To get a full understanding of a test result, you need to do three things: assess statistical significance (with a grain of salt), focus on the magnitude (if it is statistically significant), and account for the variability.

Why are forest plots an elegant solution? Forest plots allow us to see all this information and make it easier to compare the coefficients. The picture in top of this article show an example of a forest plot next to a regression table. More example can be found with in the example in the second part of this article.

1. Statistical significance: Statistical significance remains an essential aspect of a statistical test. If the result is not statistically significant, it is not worth interpreting. Forest plots allow you to see statistical significance. If the bar of the 95% confidence intervals does not include 0, it means that we can reject the two-sided null hypothesis that the true coefficient is 0. However, the representation is not oversimplified. In many journals, coefficients from a regression or statistical test are accompanied by small stars to indicate their statistical significance (for example, * if the p-value is less than 5%). The problem with this simplification is that we do not know whether the results are far from statistically significant or very close. Forest plots allow us to visually see the part of the CI that goes beyond the 0 lines, and thus to be a little less obsessed with a clear cut-off.

2. Magnitude: The magnitude is easy to read, as well as the relative magnitude between coefficients. However, sometimes the coefficient cannot be compared simply. For example, in a regression, the variables may have very different scales. Normalising the variables by their standard deviation will give a comparable scale (see example below).

3. Variability: Uncertainty is highlighted by the size of the confidence intervals. Again, it allows us to quickly see the lower and higher boundary of the coefficient and get an idea of variability.

Conclusion

Forest plots make all the key elements of a statistical test (statistical significance, magnitude and variability) easily accessible without the need to be an expert in statistics. In addition, this richer representation naturally gives less weight to statistical significance by providing other pieces of information. If you are able to run a statistical test correctly, you should have the skills to produce a forest plot and interpret the results (this is not the case with the Bayesian approach for example).

However, it is important to note the weakness of these charts compared to a table: the loss of precision. Tables remain a necessity (kept in the appendix for example), in order to obtain the exact value of a coefficient. Furthermore, in the case of a regression, if you standardise the coefficient by the standard deviation, the interpretation may be less intuitive even if it favours the comparison of the relative magnitude.

Application in Python: Temperature and Environmental Policies:

Let me build on my previous article, where I explored the relationship between temperature (heat-waves) and environmental policies.

Here is the list of the variables:

  • cname: Country name
  • year: Year
  • oecd_eps: Environmental Policy Stringency Index⁴ (from Botta and Kozluk (2014))
  • cckp_temp: Annual average temperature in Celsius⁵ (from the Climate Change Knowledge Portal)
  • cckp_rain: Annual average rainfall in mm⁵ (from the Climate Change Knowledge Portal)
  • wdi_fossil: Fossil fuel energy consumption (% of total)⁶ (from the World Development Indicators, World Bank)
  • gdp_pc_ppp: GDP per capita in power purchase parity⁶ (from the World Development Indicators, World Bank)
  • pop: Population⁶ (from the World Development Indicators, World Bank)

Model

Here is the baseline model I will estimate:

Regression equation

with i and t respectively for country and year. EPS is the environmental policy stringency index (an index taking value from 0 to 5). Temperature is the annual average temperature in Celsius. I have subtracted the average value for each country for each variable. As observed in the preliminary analysis, the idea is to compare the warmest years in France with the coldest years in France and not comparing France with Canada (exploiting the variation within the country). X is a vector of controls including population and GDP per capita (in power purchase parity). ɛ is an error term clustered at the country level (standard procedure to account for the correlation within the errors between the same countries in the covariance matrix).

The methodology I used for plotting the the coefficient is based on this blog post https://zhiyzuo.github.io/Python-Plot-Regression-Coefficient/ by Zhiya Zuo.

Forest plot for one regression model

Note: We can see that an increase of 1°C is associated with an increase in the environmental policy stringency index of about 0.27 (mean value of the environmental policy stringency index: 1.58) with a CI between 0.2 and 0.35. The coefficient is statistically significant at the 5% level.

Forest plot with two regression models

Here I will compare the coefficient of temperature for two different samples: years with above-median rainfall and years with below-median rainfall. In the first paper, I found that the effect of temperature was significantly stronger in dry years.

Note: We are able to confirm the finding from the previous article that the effect is larger when rainfall is low (below median) and actually not statistically significant at the 5% threshold for the sample with rain above median. Moreover, we can see that the difference between the two coefficients is statistically significant at the 5% threshold. This comes from the fact that the confidence intervals of each coefficient are not going over the other coefficient.

Forest plot with two regression models and multiple variables

Here, I will compare the coefficient for temperature for two different models: without and with control variables (GDP and fossil fuel consumption).

Note: We can see that the coefficient on temperature is no longer statistically significant once we control for GDP and fossil fuel consumption. It seems therefore that our main effect suffered from an omitted variable bias. The correlation table below reveals that GDP is strongly negatively correlated with temperature (-0.45) and strongly positively correlated with the environmental policy stringency index (0.65). Thus, the basic positive coefficient with temperature is rather due to the fact that richer countries are in colder locations and that these countries tend to implement more environmental policies.

Note also that it is difficult to compare the coefficients here because they are at different scales (GDP in dollars, fossil fuels as a share of consumption, temperature in °C). It is therefore preferable to normalise the coefficients by the standard deviation of each variable. See next section.

Forest plot with two regression models and multiple variables standardized

Note: Now that the coefficients are standardised, it is easier to compare their magnitude and see the statistical significance of each. However, as we saw in the first part of this article, it is more difficult to interpret the coefficient. Let me illustrate this with the temperature coefficient for the model without controls. A one standard deviation (7.3°C) increase in temperature is associated with a 2 point increase in the policy stringency index (mean value of the index: 1.58).

References

[1] van Zwet, E. W., & Cator, E. A. (2021). The significance filter, the winner’s curse and the need to shrink. Statistica Neerlandica, 75(4), 437–452.

[2] Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129–133.

[3] Povitkina, Marina, Natalia Alvarado Pachon, and Cem Mert Dalli. “The Quality of Government Environmental Indicators Dataset, version Sep21.” University of Gothenburg: The Quality of Government Institute, https://www.gu.se/en/quality-government (2021).

[4] Botta, Enrico, and Tomasz Koźluk. “Measuring environmental policy stringency in OECD countries: A composite index approach.” (2014).

[5] The World Bank Group. 2021. Climate Change Knowledge Portal. https://climateknowledgeportal.worldbank.org

[6] World Bank, World Development Indicators. (2022).

--

--

Passionate about causality | researcher/lecturer/consultant | >10k students | connect with me on LinkedIn for daily content | Instagram: @stats_with_quentin