Article Purpose
In this article, we’re diving into the critical concepts of unit roots and stationarity. Buckle up for an exploration into why checking stationarity is crucial, what unit roots are, and how these elements play a key role in our predictive maintenance arsenal. We will also master the chaos! This article is part of the series Understanding Predictive Maintenance. I plan to create the entire series in a similar style.
Check the whole series in this link. Ensure you don’t miss out on new articles by following me.
Data Stationarity – hide and seek analytical game
Ever wondered if your data is playing a game of hide and seek? Let’s cut to the chase – we’re talking about stationarity. It’s not just a fancy term; it’s the secret sauce to understanding how stable and predictable your time-dependent data really is. Buckle up as we explore why data stationarity is the game-changer in modeling and forecasting.
Key Rules of Stationarity
- Constant Mean: A stationary time series should exhibit a consistent average value over time. If the mean changes, it suggests a shift in the underlying behavior of the process.
- Constant Variance: The variance of the time series, representing the spread of data points, should remain constant. Fluctuations in variance can make it challenging to make accurate predictions.
- Constant Autocorrelation: Autocorrelation measures the correlation between a time series and its lagged values. In a stationary series, the strength and pattern of autocorrelation should be consistent throughout.
Just "stability" of statistical properties.
Why Stationarity is a Big Deal
Imagine your predictive models as expert navigators sailing through the sea of data. To navigate smoothly, they prefer calm waters – that’s where stationarity comes in. Stationary data is like a serene ocean, where patterns stay consistent. But, if your data is a stormy sea with waves of ups and downs (non-stationary), accurate predictions become a real challenge. That’s why we need to spot these storms and transform our data into a peaceful pond for effective time series analysis.
Real-world Implications
Data stationarity isn’t just a tech thing; it’s everywhere, influencing decisions from finance to predicting the weather. In finance, where precision is key for risk and return estimates, assuming stationarity is like having a reliable compass. Climate scientists rely on stationary models to predict long-term weather patterns – it’s like having a trustworthy weather app for Earth’s future.
Journey to Insightful Analysis
Getting our data stationary is more than a tech quest; it’s an adventure toward clarity. It’s like transforming a chaotic treasure map into a clear guide that helps analysts and decision-makers make sense of it all. In the dynamic world of time-dependent data, stationarity becomes our trusty map, guiding us to understand the patterns beneath the surface and making our journey through data waters much smoother.
Alright, now that we get why it’s cool to have calm data, let’s learn how to make it chill. But wait, before we get our hands dirty by code, let me introduce you to something called "unit roots." Think of them as the special ingredients that affect how our data behaves. Knowing about unit roots is like having a secret recipe to turn our wavy, wild data into a smooth pond, ready for us to dive in and explore. So, get ready for the next part of our journey!
Unit Roots – Mischievous Time Travelers in Data’s History Book
Unit roots are fundamental concepts in time series analysis, playing a pivotal role in understanding the behavior and characteristics of real-world data. In this exploration, we’ll delve into what unit roots are, why they are important in real data analysis, and how they influence the predictive maintenance landscape. Of course, we will do some experiments in the hands-on section.
What are unit roots?
A unit root in a time series variable implies a stochastic process where the variable’s value at any given time is influenced by its past values. Formally, a unit root suggests non-stationarity, indicating that the variable does not revert to a constant mean over time.
The presence of unit roots introduces persistence into the time series, leading to challenges in modeling and forecasting. The Augmented Dickey-Fuller (ADF) test and other statistical methods are employed to detect the existence of unit roots, providing a quantitative measure of non-stationarity.
Unit roots are like the storytellers of our data, weaving narratives that extend beyond individual moments and create a continuous storyline. They signify the persistence of historical influences, introducing an element of memory into the numerical fabric of our datasets.
Imagine your dataset as a historical novel, with each data point representing a chapter in the unfolding tale. Unit roots, in this context, are the recurring motifs and characters that leave an indelible mark on the narrative, guiding the plot with a subtle yet consistent influence.
Why it is important for us?
Understanding unit roots is fundamental for time series analysts and modelers. Non-stationary data poses challenges, as traditional models often assume stationarity for accurate predictions. Analysts must address unit roots by employing transformations, such as differencing, to induce stationarity and facilitate model development.
In predictive maintenance scenarios, unit roots play a crucial role in ensuring the accuracy of forecasting models. The long-term influence embedded in unit roots can significantly impact the reliability of predictions, making their identification and mitigation paramount for effective maintenance strategies.
As we navigate this technical exploration, we will delve deeper into unit root testing methodologies, interpret the results, and explore strategies for handling non-stationary time series data. The theoretical underpinnings of unit roots provide a solid foundation for the practical applications that follow in our analytical journey.
Augumented Dickey Fuller (ADF) helps us
Imagine you have a line of ants moving in a certain direction. The ADF test checks if the ants are marching with a purpose (stationary) or if they’re randomly scattered all over the place (non-stationary).
The ADF test involves a bit of math, but let’s simplify it:
-
Null Hypothesis (
H0
): This is like the default assumption. The null hypothesis for ADF is that the data has a unit root, which means it’s non-stationary. It’s like saying the ants are wandering randomly. H0:The data has a unit root (non-stationary) -
Alternative Hypothesis (
H1
): This is what we’re trying to prove. The alternative hypothesis is that the data is stationary, like the ants marching in a clear line. H1:The data is stationary - Test Statistic: The ADF test calculates a number called the test statistic. If this number is very small, it suggests that the data is likely stationary. P-value: This is a probability score. If the p-value is small (less than a certain threshold, like 0.05), we reject the null hypothesis and accept the alternative, saying our data is probably stationary.
This is not very complicated, just run the tests and check P-value
from statsmodels.tsa.stattools import adfuller
# Perform the Augmented Dickey-Fuller (ADF) test for stationarity
adf_statistic, adf_p_value, adf_lags,
adf_nobs, adf_critical_values, adf_reg_results = adfuller(stationary_series)
# Check if the series is stationary based on the p-value
is_stationary = adf_p_value < 0.05 # Using a significance level of 0.05
You will probably most of the time use adf like this:
# What youy will probably will use most of the time
_, adf_p_value, _, _, _, _= adfuller(stationary_series)
But I will explain what is behind these variables
adf_statistic
: The test statistic from the ADF test, indicating the strength of evidence against the null hypothesis of non-stationarity.adf_p_value
: The p-value associated with the null hypothesis. A lower p-value suggests stronger evidence against non-stationarity.adf_lags
: The number of lags used in the test.adf_nobs
: The number of observations used in the ADF test.adf_critical_values
: The critical values for the test statistic at various significance levels.adf_reg_results
: The regression results, which provide additional information about the linear regression performed during the test.
While chaos might seem daunting, we can transform it into our ally by understanding and harnessing its patterns. In the realm of data and analysis, chaos can be a powerful force that, when properly channeled, provides insights, predictions, and a clearer path forward. It’s all about turning the unpredictable into an advantage, making chaos our strategic companion in the journey of exploration and understanding.
How "random" is your random?
Let’s kick things off by generating a straightforward stationary series, but here’s a heads-up: not all "random" is created equal. There are two main flavors of randomness – true random and pseudorandom. Chances are, you’ve been hanging out with pseudorandom more often because that’s the go-to for computers.
In computing, generating truly random numbers is a challenge because computers are deterministic machines. Pseudorandom numbers, as the name suggests, are not genuinely random but instead are generated by algorithms that simulate randomness. These algorithms start with an initial value called a seed and use it to produce a sequence of numbers that appears random.
Seeds
The seed is a crucial element in pseudorandom number generation. It serves as the starting point for the algorithm. If you use the same seed, you’ll get the same sequence of pseudorandom numbers every time. This determinism can be advantageous in scenarios where you want reproducibility. For example, if you’re running a simulation or an experiment that involves randomness, setting the seed allows you to recreate the exact sequence of random numbers.
On the flip side, changing the seed results in a different sequence of pseudorandom numbers. This property is often used to introduce variability in simulations or to provide different initial conditions for algorithms that use randomness.
In summary, pseudorandom numbers are generated by algorithms, and the seed is the starting point for these algorithms. Controlling the seed allows you to control the sequence of pseudorandom numbers, providing a balance between determinism and variability in computer-generated randomness.
Time to generate our pseudorandom distribution.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1992) # WOW this is our deterministic seed.
def generate_stationary_series_pseudorandom(size=100):
stationary_series = np.random.randn(size)
return stationary_series
Can we use true randomness?
Now we might feel suprised that even randomness we are affecting most of the time is the deterministic random. But can we make true randomness, ensuring that no determism is behind it?
Well, good news! We can tap into something truly physical – atmospheric noise. Remember those flickering black and white dots on your TV screen? That’s our atmospheric noise, and we’re going to harness it to whip up some genuine randomness. So, your TV’s not just for shows; it’s your ticket out of the deterministic world.
import requests
def generate_stationary_series_random(size=100):
# Fetch truly random values from random.org atmospheric noise API
response = requests.get(f'https://www.random.org/integers/?num={size}&min=-10000&max=10000&col=1&base=10&format=plain&rnd=new')
if response.status_code == 200:
stationary_series = [int(value) for value in response.text.strip().split('n')]
return stationary_series
else:
raise Exception(f"Failed to fetch random values. Status code: {response.status_code}")
Using this function we can genarate true randomness, Horray!
Stationarity check
First lets generate the series.
# Generate series
stationary_series_pseudorandom = generate_stationary_series_pseudorandom()
stationary_series_random = generate_stationary_series_pseudorandom()
Create results plot.
titles = [
'stationary_series_pseudorandom',
'stationary_series_random'
]
plot_multiple_series(stationary_series_random, stationary_series_pseudorandom,
titles=titles)
_, adf_p_value, _, _, _, _= adfuller(stationary_series_pseudorandom)
print(f'PseudoRandom adf p-value: {adf_p_value}')
_, adf_p_value, _, _, _, _= adfuller(stationary_series_random)
print(f'TrueRandom adf p-value: {adf_p_value}')
When the p-value is very small (<0.05), it provides evidence against the null hypothesis, suggesting that your data is likely stationary.
So, in this case, with a p-value much smaller than 0.05, you have the confidence to say, "Yes, our data is stationary."
Now, let’s take a moment to crunch the numbers. Our pseudorandom boasts a P-value approximately 2 million times smaller than the truly random one.
Why does this happen? Pseudorandom numbers are generated by algorithms, introducing a level of determinism. These algorithms can unintentionally introduce patterns or structure into the data. On the other hand, truly random data, like atmospheric noise, is more likely to exhibit the characteristics of pure randomness. The ADF test, keen on detecting patterns indicative of non-stationarity, may find less evidence of such patterns in truly random data, leading to a relatively higher P-value.
Hand`s on experience
Now it is the time to make hands dirty by code. We will run some experiments to help you get fammilarized with article concepts. I reccomend you to reproduce it. Before we will dive deeply into stationarity I want to ask you question.
And now we will add couple of examples how we can make this data not stationary, we are going to break our key rules of stationarity. After explanation we will plot all of them.
Linear Trend (Non-Constant Mean)
def generate_non_stationary_linear_trend(size=100):
time = np.arange(size)
linear_trend = 0.5 * time
non_stationary_series = np.random.randn(size) + linear_trend
return non_stationary_series
Introducing a linear trend to violate the constant mean rule means adding a systematic increase or decrease over time. In the case of the non-stationary linear trend series, the values linearly increase over time. This violates the constant mean rule because the average value of the series is changing, indicating a shift in the underlying behavior of the process. Unit roots, in this context, contribute to the persistence of the linear trend, causing the variable’s value at any given time to be influenced by its past values.
Sine Amplitude (Non-Constant Variance)
def generate_non_stationary_sin_amplitude(size=100):
time = np.arange(size)
amplitude = 0.5 + 0.02 * time
sin_amplitude_component = amplitude * np.sin(2 * np.pi * time / 10)
non_stationary_series = np.random.randn(size) + sin_amplitude_component
return non_stationary_series
Adding a sinusoidal component with increasing amplitude violates the constant variance rule. In the non-stationary seasonal component series, the amplitude of the sinusoidal component grows linearly with time. This results in fluctuations in the spread of data points, making the variance non-constant. Unit roots contribute to the persistence of the seasonal component, influencing the variance to vary as the amplitude changes.
Exponential Growth (Non-Constant Autocorrelation)
def generate_non_stationary_exponential_growth(size=100, growth_rate=0.05):
time = np.arange(size)
exponential_growth_component = np.exp(growth_rate * time)
non_stationary_series = np.random.randn(size) + exponential_growth_component
return non_stationary_series
Incorporating an exponential growth pattern violates the constant autocorrelation rule. The non-stationary expanding amplitude series exhibits exponential growth, causing the autocorrelation pattern to change with increasing values. Unit roots play a role in introducing persistence into the time series, leading to challenges in modeling and forecasting. The presence of unit roots implies non-stationarity, indicating that the variable does not revert to a constant mean over time.
Start the experimets
Execute the code and generate the timeseries and plot the results.
# Example usage
stationary_series_pseudorandom = generate_stationary_series_pseudorandom()
non_stationary_linear_trend_series = generate_non_stationary_linear_trend()
non_stationary_sin_amplitude_series = generate_non_stationary_sin_amplitude()
non_stationary_exponential_growth_series = generate_non_stationary_exponential_growth()
# Visualize the examples
plot_multiple_series(stationary_series_pseudorandom,
non_stationary_linear_trend_series,
non_stationary_sin_amplitude_series,
non_stationary_exponential_growth_series,
titles=[
'Stationary series',
'Linear Trend (Non-Constant Mean)',
'Sinusoidal Amplitude (Non-Constant Variance)',
'Exponential Growth (Non-Constant Autocorrelation)'
])
Spotting a linear trend or exponential growth during exploratory data analysis is relatively straightforward, as these patterns exhibit clear visual cues. However, distinguishing between stationary and non-stationary states becomes challenging when dealing with sinusoidal amplitude. Visually, it’s hard to differentiate whether the amplitude is stationary or non-stationary just by looking at the data.
This case will show the power of statistical tests. We have powerfull tools in our hands.
_, adf_p_value_stationary, _, _, _, _ = adfuller(stationary_series_pseudorandom)
_, adf_p_value_linear_trend, _, _, _, _ = adfuller(generate_non_stationary_linear_trend())
_, adf_p_value_sin_amplitude, _, _, _, _ = adfuller(generate_non_stationary_sin_amplitude())
_, adf_p_value_exponential_growth, _, _, _, _ = adfuller(generate_non_stationary_exponential_growth())
# Print the results
print(f'PseudoRandom ADF P-value (Stationary Series): {adf_p_value_stationary}')
print(f'PseudoRandom ADF P-value (Linear Trend): {adf_p_value_linear_trend}')
print(f'PseudoRandom ADF P-value (Sinusoidal Amplitude): {adf_p_value_sin_amplitude}')
print(f'PseudoRandom ADF P-value (Exponential Growth): {adf_p_value_exponential_growth}')
The ADF test provides a clear distinction between stationary and non-stationary time series. In the first case, we can confidently reject the null hypothesis, indicating that the time series is stationary. However, for the other cases, we must accept the null hypothesis, concluding that the data is non-stationary. Specifically, in the case of sinusoidal amplitude, even though the non-stationarity is visually evident, the ADF test confirms our observation by not allowing us to reject the null hypothesis.
Practice the transformation
Now, let’s have some fun with transformations and attempt to convert our non-stationary time series into a stationary one – like a bit of reverse engineering. In real-life scenarios, determining the exact transformation needed is often a trial-and-error process. I recommend conducting exploratory data analysis, plotting the time series, and making empirical attempts. If a transformation renders the series stationary, you not only achieve stationarity but also gain valuable insights into the characteristics of your data.
def make_linear_trend_stationary(series):
# Subtract the linear trend to make the mean constant.
time = np.arange(len(series))
linear_trend = 0.5 * time # Somehow we have found this trend :)
stationary_series = series - linear_trend
return stationary_series
def make_sin_amplitude_stationary(series):
# Apply differencing to stabilize and make the variance constant.
diff_series = np.diff(series)
return diff_series
def make_exponential_growth_stationary(series, epsilon=1e-8):
# Add a small constant to avoid zero or negative values
series = np.where(series <= 0, epsilon, series)
# Add a small constant to avoid non-finite values
series += epsilon
# Apply the log for stabilization
series = np.log(series)
# Take the first difference to remove the exponential growth
stationary_series = np.diff(series)
return stationary_series
Having defined our transformation functions, it’s time to put them to work. Let’s apply these transformations to our non-stationary time series and see if we can successfully induce stationarity.
# Apply transformations to make non-stationary examples stationary
stationary_linear_trend = make_linear_trend_stationary(generate_non_stationary_linear_trend())
stationary_sin_amplitude = make_sin_amplitude_stationary(generate_non_stationary_sin_amplitude())
stationary_exponential_growth = make_exponential_growth_stationary(generate_non_stationary_exponential_growth())
# Perform ADF test for the transformed series
adf_p_value_stationary_linear_trend = adfuller(stationary_linear_trend)[1]
adf_p_value_stationary_sin_amplitude = adfuller(stationary_sin_amplitude)[1]
adf_p_value_stationary_exponential_growth = adfuller(stationary_exponential_growth)[1]
# Print the results
print(f'ADF P-value (Stationary Linear Trend): {adf_p_value_stationary_linear_trend}')
print(f'ADF P-value (Stationary Sinusoidal Amplitude): {adf_p_value_stationary_sin_amplitude}')
print(f'ADF P-value (Stationary Exponential Growth): {adf_p_value_stationary_exponential_growth}')
And how this data looks:
Great news! With our data now stationary, we confidently reject the null hypothesis in each case. Now, for a bit of fun, I’ll take on the challenge of reverse engineering your random generation iteration with the given seed. Let’s see if I can unravel the mystery! 😄
Check the whole series in this link. Ensure you don’t miss out on new articles by following me.