Change Point Detection — A Bayesian Approach

Identifying a change point in time series analysis could provide you much more information about the series than you previously thought

Everton Almeida
Towards Data Science

--

Change point analysis has become a field of interest in many areas of study. This kind of analysis refers to the problem of finding abrupt or sudden changes in a given time series. According to the definition made by Iwata et al. (2018), change point analysis is “the method for identifying change points, which are the moments when the probability distribution of a time series changes.” According to Van den Burg and Williams (2020), “moments of abrupt change in the behavior of a time series are often the cause of alarm, as they can suggest a significant change in the data-generating process.”

As demonstrated by Aminikhanghahi and Cook (2017) and Iwata et al. (2018), the increasing attention paid to this kind of analysis is due to recent technological developments. These developments generate large amounts of data that often need to be closely monitored, such as robotics, medicine, meteorology, voice and image recognition, etc. There is a wide variety of models and methodologies to deal with such a problems. But since it is not the goal of this article to make a descriptive analysis of these models, I would like to suggest the work of Van den Burg and Williams (2020) to you to get a better overview of those methodologies. There you can find the differences between online and offline change point detection; univariate and multivariate approaches, which could be parametric or nonparametric; and supervised or unsupervised models.

Photo by Tech DailyUnsplash

From the beginning, a time series, according to Ehlers (2007), “is a collection of observations made sequentially over time,” whose main characteristic is the dependence that a given observation has on the neighboring observations. A time series can be continuous or discrete, and in the first case, according to Ehlers (2007), having the set T={t∶ t1< t < t2}, the series is denoted by {X(t):t ∈ T}. Therefore, taking an observation window of the time series T with n observations, we have a time series denoted by {Xm, X(m+1),…, Xn }. According to Aminikhanghahi and Cook (2017, p. 3), “the detection of the change point can be defined as a problem of hypothesis testing, between two alternatives”, namely “the null hypothesis H0: ‘no change occurs’ ‘ and the alternative hypothesis H1: ‘a change occurs’”.

From the Beginning

So if you like to code, it’s time to run Jupyter Notebook and start making some simulations and analysis as we take a “random walk” throughout this approach. Let’s import the following packages:

import numpy as np
from numpy.random import seed
from numpy.random import randn
import random
import datetime

import matplotlib.pyplot as plt
import seaborn as sns

import math
import decimal
from scipy import stats
np.seterr(divide='raise') #Make sure you set this

We will start with a time series y(t) consisting of two-time series, which are y(t1) with mean μ=1 plus some noise and y(t2) with mean μ=2 plus some noise, both with 30 observations. As you might think, the proposed time series will have a considerable change point.

yt1 = u1 + 0.1*randn(30)
yt2 = u2 + 0.1*randn(30)

y = np.concatenate((yt1, yt2), axis=0)
plt.figure(figsize=(16, 10))
plt.plot(y)
A time series with two differing averages — Image by author

If you know Bayesian statistics, you know that the construction of any model is fundamentally composed of 3 distributions. The prior distribution h(θ), reflects our prior knowledge about the problem. The likelihood f(x|θ) reflects the data obtained and has to be incorporated into the prior distribution. And that will result in a posterior distribution h(θ|x), which we are interested in. That’s how we use Bayes’ theorem.

Fuzzy Clustering

At this point, the first (and most important?) problem we have, is to obtain a prior distribution from the time series we built — that’s the first piece of the model. The problem is: We do not have one! And if you are dealing with a time series, once we get a prior distribution, most of the tasks are already done.

D’Angelo et al. (2011) took an interesting approach to get around this problem. He used a Kohonen Network to clusterize the time series. Unlike hard clustering, the Kohonen Network is a fuzzy clustering algorithm, which means that any given point X is associated with group A with probability p. This association is given by the function fA(X) which associates with each point in A a real number in the interval [0, 1], representing the grade of membership of X in A.

For a complete and better explanation of the Kohonen Network, you can refer to Kohonen (1990) and Haykin (2007). Using Python, I built such a network using two functions:

def Center_Kohonen(y, X=0, K=2, alfa=0.8, C=500):    
# This kohonen network implementation finds two centers in a time series
#Params:
#Y = Time Series
#M = Number of input patterns
#N = Dimension of the input pattern (for our case it will be 1)
#K = number of neurons, for the proposed problem (number of centers)
M = y.shape[0]
N = 1
f = 0

#Initializin the neurons' weights
I = y.argsort(axis=0) # Sorted Indexes
Y = np.sort(y) # Sortes Points in the Time Series
c1 = Y[0:7] # Beginning of the series
c2 = Y[M-7:M] # End of the series


#Adjusting the values
while np.std(c2) > 0.1: #As long as the standard deviation is greater than 0.1, replace the highest value with the mean
ma = c2.argmax()
c2[ma] = np.mean(c2);
y[I[ma+60-7]] = np.mean(c2);

while np.std(c1) > 0.1: #As long as the standard deviation is greater than 0.1, replace the lowest value with the mean
mi = c1.argmin()
c1[mi] = np.mean(c1)
y[I[mi]] = np.mean(c1);

#Definition of weight values
W = [np.mean(c1), np.mean(c2)]

#Finding centers from Kohonen's network training
for l in range(1, C+1):
alfa=alfa*(1-(C-(C-l))/C)

for i in range(0, M): #For each value in the time series
a=999

for j in range(0, K): #Where K is the number of cluster
if np.linalg.norm([(y[i]-W[j])], 2) < a:
a = np.linalg.norm([(y[i]-W[j])], 2)
f = j

W[f] = W[f]+alfa*(y[i]-W[f])


return c1, c2, I, Y, W, a, alfa
def Fuzzy_Set(y, W):
# This program finds membership values for a time series from previously specified centers.
# Where y is the time series and c a vector with the found centers
center_1 = []
center_2 = []
n = y.shape[0]
l = 2
# Finding the membership association for each point in the time series
for i in range(0, l):
for t in range(0, n):
sum=0;
for k in range(0, l):
sum = sum+(y[t]-W[k])*(y[t]-W[k])
if i == 0:
center_1.append(np.round(1-((y[t]-W[i])*(y[t]-W[i]))/sum, 3))
else:
center_2.append(np.round(1-((y[t]-W[i])*(y[t]-W[i]))/sum, 3))
return center_1, center_2

If you call both functions with our time series sequentially, you may plot the results and get something like this:

c1, c2, I, Y, W, a, alfa = Center_Kohonen(y, X=0, K=2, alfa=0.8, C=500)
center_1, center_2 = Fuzzy_Set(y, W)
plt.figure(figsize=(16, 10))
plt.plot(center_1, 'b') # plotting t, a separately
plt.plot(center_2, 'r') # plotting t, b separately
plt.show()
The fuzzy clusterization for the time series y(t) using Kohonen Network — Image by author

And that’s pretty interesting! With the Kohonen Network, we are able to split the time series y(t). This chart shows we have two clusters since we set K=2. In the xlabel, we have each point of the time series, while the ylabel shows the probability that a given point is associated with one of the two clusters. As you might see, the blue line shows us that until we reach point 30 all the points are more likely (around ~99%) to belong to the first group or the set μ1(t). The red line shows the opposite since it represents the associations to the second group, namely the set μ2(t). That’s reasonable since we built a time series with two different averages, and illustratively this plot is related to the first one.

Although interesting we did not really find a change point until now (we have some clues) and there’s nothing Bayesian here.

By the way, most of the time, the distinction between the points of a time series will not be that simple. For instance, if we built the time series y(t) with y(t1) with mean μ=1 plus some noise and y(t2) with mean μ=1.3 (instead of 2) plus some noise, would that split be that good? I will let you try this exercise…

The Metropolis-Hastings Algorithm comes to the Game

If you tried the exercise above, you found yourself in trouble using just the Kohonen Network to find any indication of a change point in an alternative time series. That is because the Kohonen network does not provide us with a change point, but two sets of continuous variables, representing the association of each point to a given cluster.

But recall that the sets μ1(t) and μ2(t) have its values within the interval [0,1]. This implies saying that μ1(t) and μ2(t) are approximations of Beta distributions with different parameters (Have you ever heard about Kullback–Leibler?). According to D’Angelo et al. (2011), assuming that the change point is denoted by m, then for t≤m, we will have a Beta(a,b) distribution and a Beta(c,d) distribution for t>m. Given the properties of a Beta distribution, if there’s any change point in the time series, parameter a will be greater than parameter b in Beta(a,b) and parameter c will be smaller than parameter d in Beta(c,d).

The problem is: How do you build those two Beta Distributions? The Metropolis-Hastings Algorithm is a form of a Markov Chain Monte Carlo initially proposed by Metropolis et al. (1953) and then generalized by Hastings (1970). According to Gelman et al. (2003), the goal of any Markov chain simulation “is to create a Markov process whose stationary distribution is specified by p(θ | x).” Running the simulation enough allows us to obtain a distribution close enough to the stationary and posterior distribution. The posterior could be resumed in terms of expectations of particular functions of a parameter θ, that is, ∫g(θ)p(θ | x)dθ = E [g(θ) | x]. Such integral is not trivial and that’s why the MCMC methods are good to approximate the posterior distribution.

Metropolis-Hastings’ algorithm uses the idea of rejection, which means it generates a value from an auxiliary distribution and accepts it with a given probability. If you are unfamiliar with MCMC methods, you might be questioning how the algorithm may reject a drawn value. We use the transition rule given by Hastings (1970):

Putting it in order words, we can use the two sets of continuous variables, given by the fuzzy clustering to reject randomly drawn values from a given prior distribution for the change point detection. If you want to learn more about MCMC methods, I suggest Gamerman and Lopes (2018).

Let’s get back to the Jupyter Notebook. The function below is an implementation of the Metropolis-Hastings Algorithm for this problem. Although powerfoul, this algorithm requires some things. The first is to set a prior distribution for every parameter that needs to be found. For the parameter m, we are using a uniform distribution between 1 and 60, which means the algorithm randomly selects a change point candidate in the time series. For the parameters a, b, c and d, I selected weakly information gamma distributions. The function also needs the arguments, which are a set of random variables (μ1(t) or μ2(t)) and the number of simulations.

def Metropolis_Hastings(center_kohonen, n_sims=1000):
n = len(y)
m = 1 + round((n-1) * np.random.uniform(0, 1))
shape, scale, loc = 10, 0.1, 0
#Lists to save the date for each parameter
a_params = []
b_params = []
c_params = []
d_params = []
m_params = []
#Prior Distributions for the Parameters
a = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]
b = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]
c = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]
d = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]

for i in range(0, n_sims):
m1 = 1+round((n-1) * np.random.uniform(0, 1));
a1 = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]
b1 = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]
c1 = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]
d1 = stats.erlang.rvs(shape, loc=loc, scale=scale, size=1, random_state=None)[0]
#PARAM A
aux1 = 1
for j in range(0, m):
try:
aux1 = aux1 * (center_kohonen[j] ** (a1-1))
except:
aux1 = aux1
aux2 = 1
for j in range(0, m):
try:
aux2 = aux2 * center_kohonen[j] ** (a-1)
except:
aux2 = aux2
try:
ra = ((math.gamma(a1+b)/math.gamma(a1))**m)*aux1*((((a/a1)**.9)*math.exp(-.1*(a1-a)))**2)/(((math.gamma(a+b)/math.gamma(a))**m)*aux2)
if (min(1, ra) > np.random.uniform(0, 1)):
a=a1
except:
pass
#PARAM B
aux1 = 1
for j in range(0, m):
try:
aux1 = aux1*(1-center_kohonen[j])**(b1-1)
except:
aux1 = aux1
aux2 = 1
for j in range(0, m):
try:
aux2 = aux2*(1-center_kohonen[j])**(b-1)
except:
aux2 = aux2

try:
rb = ((math.gamma(a+b1)/math.gamma(b1))**m)*aux1*((((b/b1)**.9)*math.exp(-.1*(b1-b)))**2)/(((math.gamma(a+b)/math.gamma(b))**m)*aux2)
if (min(1, rb) > np.random.uniform(0, 1)):
b = b1
except:
pass
#PARAM C
aux1 = 1
for j in range(m, n):
try:
aux1=aux1*center_kohonen[j]**(c1-1)
except:
aux1 = aux1
aux2 = 1
for j in range(m, n):
try:
aux2=aux2*center_kohonen[j]**(c-1)
except:
aux2 = aux2
try:
rc = ((math.gamma(c1+d)/math.gamma(c1))**(n-m))*aux1*((((c/c1)**.9)*math.exp(-.1*(c1-c)))**2)/(((math.gamma(c+d)/math.gamma(c))**(n-m))*aux2)
if (min(1, rc) > np.random.uniform(0, 1)):
c = c1
except:
pass
#PARAM D
aux1 = 1
for j in range(m, n):
try:
aux1=aux1*(1-center_kohonen[j])**(d1-1)
except:
aux1 = aux1
aux2 = 1
for j in range(m, n):
try:
aux2=aux2*(1-center_kohonen[j])**(d-1)
except:
aux2 = aux2
try:
rd = ((math.gamma(c+d1)/math.gamma(d1))**(n-m))*aux1*((((d/d1)**.9)*math.exp(-.1*(d1-d)))**2)/(((math.gamma(c+d)/math.gamma(d))**(n-m))*aux2)
if (min(1, rd) > np.random.uniform(0, 1)):
d = d1
except:
pass
#PARAM M
aux1 = 1
for j in range(0, m1):
try:
aux1 = aux1*(center_kohonen[j]**(a-1))*((1-center_kohonen[j])**(b-1))
except:
aux1 = aux1
aux2 = 1;
for j in range(m1, n):
try:
aux2 = aux2*(center_kohonen[j]**(c-1))*((1-center_kohonen[j])**(d-1))
except:
aux2 = aux2
aux3 = 1
for j in range(0, m):
try:
aux3 = aux3*(center_kohonen[j]**(a-1))*((1-center_kohonen[j])**(b-1))
except:
aux3 = aux3
aux4 = 1
for j in range(m, n):
try:
aux4 = aux4*(center_kohonen[j]**(c-1))*((1-center_kohonen[j])**(d-1))
except:
aux4 = aux4
try:
rm = (((math.gamma(a+b)/(math.gamma(a)*math.gamma(b)))**m1)*((math.gamma(c+d)/(math.gamma(c)*math.gamma(d)))**(n-m1))*aux1*aux2)/(((math.gamma(a+b)/(math.gamma(a)*math.gamma(b)))**m)*((math.gamma(c+d)/(math.gamma(c)*math.gamma(d)))**(n-m))*aux3*aux4)
if (min(1, rm) > np.random.uniform(0, 1)):
m = m1
except:
pass

a_params.append(a)
b_params.append(b)
c_params.append(c)
d_params.append(d)
m_params.append(m)
return a_params, b_params, c_params, d_params, m_params

Call that function with the two required parameters: here, I sent center_1 given by the function Fuzzy_Set and n_sims=1000

a_params, b_params, c_params, d_params, m_params = Metropolis_Hastings(center_1, n_sims=1000)
fig_dims = (16, 10)
fig, ax = plt.subplots(figsize=fig_dims)
plt.plot(m_params, 'r')
ax.set(xlabel='# Simulations', ylabel='Change Point Candidates (m)')
The Metropolis-Hastings Algorithm Simulations — Image by author

You finally found the change point now. This chart is interesting because it shows us how the drawing process was made. The first value drawn given by the uniform distribution was m=55. The algorithm rejected it and then tried another until it get a satisfactory and stationary result. After ~150 additional runs the value of m=30 could not be rejected by the algorithm anymore.

Since the function returns the sampled values for every parameter, we may plot their values as well. Starting with the parameter m, which is all the change points drawn. To see the density plot, you may discard the first 200 simulations as “burn-in”:

fig_dims = (16, 10)
fig, ax = plt.subplots(figsize=fig_dims)
ax.set(xlabel='Change Point Candidates', ylabel='Density')
sns.kdeplot(m_params[200:])
Density Plot for the Change Point Candidates — Image by author

We may also create Beta distributions using the mean of the four other parameters, namely the parameters a, b, c, and d. As we discussed previously these parameters are essential for the Metropolis-Hastings algorithm once the reject rule must assert that for t≤m, we will have a Beta(a,b) distribution and a Beta(c,d) distribution for t>m.

Let’s go ahead and build a representation of that using the variables a_params, b_params, c_params and d_params, which contains the sampled values for a, b, c and d

fig, ax = plt.subplots(1, 1, figsize=(16, 8))

ax.set_ylabel('Density')

beta_1 = np.random.beta(np.mean(a_params[200:]), np.mean(b_params[200:]), size=1000)
beta_2 = np.random.beta(np.mean(c_params[200:]), np.mean(d_params[200:]), size=1000)

sns.kdeplot(beta_1, color='b', fill=True, common_norm=False, alpha=.5, linewidth=0)
sns.kdeplot(beta_2, color='r', fill=True, common_norm=False, alpha=.5, linewidth=0)

The first Beta Distribution built with the mean of the parameters a and b is in red and the second with the mean of the parameters c and d is the blue one. In the middle, where both are less dense, we found the change point. Undoubtedly one of the great advantages of using such an approach is the availability of such distributions, since we can use them to make Bayesian inferences, enrich a predictive model or even use other types of Monte Carlo simulations.

Representations of Two Beta Distribuitions for the A, B, and D parameters — Image by author

Conclusion

Finding change points in time series could prevent systems from getting into serious trouble. Just imagine a machine whose temperature has to be controlled. Any abrupt changes must be recognized as soon as possible in order for an engineer to inspect them. Or the energy consumption in various production facilities. Any excessive consumption has to be analyzed because it may indicate some deviation in production or even energy leakage, significantly impacting production costs.

That said, the approach developed by D’Angelo (2011) and implemented here in Python proved to be of great value for detecting change points in a given time series. In addition, as previously pointed out, another great aspect of this approach is precisely the fact we obtain two beta distributions as outputs, which can be really useful.

References

AMINIKHANGHAHI, Samaneh e COOK, J. Diane. A Survey of Methods for Time Series Change Point Detection. Knowledge and Information Systems, 51, 2017.

D’ANGELO, Marcos Flávio et al. A Fuzzy/Bayesian Approach for the Time Series Change Point Detection Problem. Pesquisa Operacional, 31(2), 2011.

EHLERS, Ricardo S. Inferência Bayesiana. 5ª Ed. Departamento de Estatística da Universidade Federal do Paraná, 2007.

FAMA, Eugene. Random Walks in Stock Market Prices. Financial Analysts, 21, 1965.

_____________ Efficient Capital Markets: A Review of Theory and Empirical Work. The Journal of Finance, 25, 1970.

GAMERMAN, Dani. LOPES, Hedibert Freitas. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. 2º Ed. Florida : Chapman & Hall/CRC, 2006. 315p.

GELMAN, Andrew et al. Bayesian Data Analysis. 2ª Ed. Florida : Chapman & Hall/CRC, 2003. 668p.

HASTINGS, W. K. Monte Carlo Sampling Methods Using Markov Chains and their Applications. Biometrika, 57, 1970.

HAYKIN, Simon. Redes Neurais: Princípios e Práticas. 2ª Ed. Porto Alegre : Bookman, 2007. 900p.

IWATA, Takuma et al. Accelerating Online Change-Point Detection Algorithm using 10GbE FPGA NIC. Trabalho apresentado na 24ª Conferência Internacional de Computação Paralela e Distribuída. Turim : 2018.

KOHONEN, Teuvo. Self-Organizing Maps. Nova Iorque : Springer, 1990, 501p.

METROPOLIS, Nicholas et al. Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics, 21, 1953.

OH KJ, et al. Developing Time-Based Clustering Neural Networks To Use Change-Point Detection: Application To Financial Time Series. Asia-Pacific Journal of Operational Research, 22(1), 2005.

PAULINO, Carlos Daniel et al. Estatística Bayesiana. 2º Edição. Lisboa : Fundação Calouste Gulbenkian, 2018, 601p.

VAN DEN BURG, Gerrit J. J., WILLIAMS, Christopher K. I. An Evaluation of Change Point Detection Algorithms. stat.ML, arXiv:2003.06222, 2020

--

--

Data Scientist and Researcher. Master of Economics (IPE) with strong background in the capital markets. Sports and coffee lover.