Millennial Suicides | a Probabilistic Change-Point Analysis

A simple, yet meaningful probabilistic Pyro model to uncover change-points over time.

Richard Michael
Towards Data Science

--

Solitude. A photo by Sasha Freemind on Unsplash

One profound claim and observations by the media is, that the rate of suicides for younger people in the UK have risen from the 1980s to the 2000s. You might find it generally on the news , in publications or it is just an accepted truth by the population. But how can you make this measurable?

Making an assumption tangible

In order to make this claim testable we look for data and find an overview of the suicide rates, specifically England and Wales, at the Office for National Statistics (UK) together with an overall visualization.

https://www.ons.gov.uk/visualisations/dvc661/suicides/index.html

Generally, one type of essential questions to ask — in everyday life, business or academia — is when changes have occurred. You are under the assumption that something has fundamentally changed over time. In order to prove that you have to quantify it. So you get some data on the subject-matter and build a model to display points at which changes in values have occurred as well as their magnitude. We are going to look at exactly how to do that.

Millennial Suicides Specifically

With our data at hand, we take the average for the age-group of Millennials. That entails ages in the range of 25 to 35, for the sake of simplicity. Given our dataset, we can visualize the average suicide count for Millennials since the 1980s like this:

suicide_millenials_df = suicide_data[suicide_data.Age.isin(range(23,35))].groupby(["Year"]).mean().reset_index()
# round values to closest integer
suicide_millenials_df["Suicides"] = suicide_millenials_df.Suicides.round()
suicide_millenials_df.head()
What the head of your dataframe should look like.
Fig. 1 — The average suicides (per 100.000 people) in the age group 25 to 35 for England and Wales from 1981 to 2017.

From this data we can clearly see a steady increase over time until the very early 2000s followed by a decline until the 2010s. The inquisitive side of us wants to know when exactly that happened so that we can go out and look for the root causes and the underlying issues associated with those impacting trends. Let’s build a model to find out exactly in which years to look for that substantial change. To look for a specific change-point we will use the Pyro PPL as a probabilistic framework.

The How

The method of choice is a Poisson regression with rates for our change coming from a positive half-Gaussian. We will use the mathematical underpinnings that are described in [2]. This entails: a uniform prior distribution over the years T , the rate of change μ from a half-normal distribution (the positive side of the Gaussian) in inversion-bracket notation. Looking at the data we set a large scale for our rate as we want to capture the change from 75 to 130 — something around 50. Lastly we find the observed rate of suicides through our Poisson regression.

Note: The regression entails that we will sample from a normal distribution.

T ∼ U(1981,2017), μ_0∼N+(0, 50), μ_1∼N+(0, 50), n_t∼Poisson(μ_[t>T])

These parameters can be adjusted — maybe you look for different scales of changes, either across time or effect-size. One might consider a half-Cauchy distribution for more extreme values or simply an increase/decrease on the scale.

The Pyro Model

def model(years, suicides):
σ = pyro.param('σ', torch.ones(data_len))
T = pyro.sample('change', dist.Uniform(1981, 2017))
grp = (years > T) * 1
# Independent events
with pyro.plate('rate', 2):
μ = pyro.sample('μ', dist.HalfNormal(scale=50))
y_obs = pyro.sample('obs', dist.Normal(μ[grp], σ), obs=suicides)

The grp gives us the index for our bracketed μ .
The model observations are then sampled from the Normal distribution taking into account μ and a scale=σ=1(the last line in the model).
Before we can run this model we convert our dataframe to Pyro readable tensors using PyTorch, like so:

years = torch.from_numpy(suicide_millenials_df.Year.values)
suicides = torch.from_numpy(suicide_millenials_df.Suicides.values)
data_len = len(suicide_millenials_df)

For the fitting procedure we perform Bayesian Inference utilizing MCMC methods. We sample for 1000 iterations and let the sampler burn-in up for 300 iterations.

SAMPLES = 1000
WARMUP = 300
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=SAMPLES, warmup_steps=WARMUP)
mcmc.run(years, suicides)

We specifically use the hands-off NUTS sampler to perform inference and find values for our Pyro parameters and recommend that you do the same.
Beware that in contrast to Variational inference approaches the MCMC takes its time. After training we find a relatively high acceptance probability of around 91.2% and a stepsize ϵ ≈0.00198.

From checking the model summary we can see that everything is in order - no divergences occured. Though our R-hat values are a bit too high.

Now with the samples at hand, we can display when a change-point has occurred most frequently over time:

Fig. 2 — Histogram of the occurrence of the change-point across years.

We go ahead and plot our different rates over the given time to illustrate the points at which the changes occured.

def pl(pt):
grp = (years > pt['change']) * 1
line = plt.plot(years, pt['rate0'] * (1 - grp) + pt['rate1'] * grp,
color='red', alpha=0.005)
df.apply(pl, axis=1)
Fig. 3 — Change-rate over time (red-line) for 1000 samples. There is a change around 1995 to 1996 to mark the downwards trend.
Fig. 4 — Another rate of change over time as the red line. We see over three events between 1986 to 1989 where the change has manifested itself. 2000 samples run, 1000 would have sufficed.

First we ran the previously described model with a 1000 samples. This uncovered the change-point around the year 1995 (see Fig. 3) to indicate the upcoming downwards trend. The rate is indicated as a red-line. Here the model has quantified the change-point for the better development.
We took another shot at running the model for which we took 2000 samples and the model converged on a time-range in the years of ’86 to ’89 for yet another change-point. This marks the potential turning-point for an upwards trend. If we want to uncover the cause behind an increase in suicides among Millennials we have to look for reasons in that time-range or earlier changes, of which the effects have manifested themselves into this time.
Note for replicability: Set a random seed or initial parameters for your MCMC inference, such that you can replicate your findings.

Open Questions

We have shown that there is not just one change happening over time. There is a range of points for an increase in the rate and a range of points for a decreasing rate. One might extend the model to pick up more than one change-point, making the μ three dimensional or fitting an entirely different model.

Concluding the Model

We have shown how to get from the data, to a concrete formulation of a change in time and their underlying probabilities. This was directly translated in Pyro and through sampling we could infer concrete change-points and the years at which those changes occurred.

Probabilistic Programming is a powerful engine.

The complete notebook can be found here:

The Gravity of the Subject

Though this was intended as an introductory piece in Data Science, the topic of self-harm and death is a grave one. In case you or a loved one are affected there is help and resources out there. So I will leave you with this message here:

References

  1. Office for National Statistics, UK. Middle-aged generation most likely to die by suicide and drug poisoning. Dataset on suicide-rates. 2019.
  2. C. Scherrer. Bayesian Changepoint Detection with PyMC3. 2018.
  3. M. Hoffman, A. Gelman. The No-U-Turn Sampler. JMLR v15. 2014.

--

--

I am a Data Scientist and M.Sc. student in Bioinformatics at the University of Copenhagen. You can find more content on my weekly blog http://laplaceml.com/blog