Employ cluster algorithms to handle missing time-series data

(If you haven’t read Part 1 yet, check it out here.)
Missing data in time-series analysis is a recurring problem.
As we explored in Part 1, simple imputation techniques or even regression-based models-linear regression, decision trees can get us a long way.
But what if we need to handle more subtle patterns and capture the fine-grained fluctuation in the complex time-series data?
In this article we will explore K-Nearest Neighbors. The strengths of this model include few assumptions with regards to nonlinear relationships in your data; hence, it becomes a versatile and robust solution for missing data imputation.
We will be using the same mock energy production dataset that you’ve already seen in Part 1, with 10% values missing, introduced randomly.
We will impute missing data in using a dataset that you can easily generate yourself, allowing you to follow along and apply the techniques in real-time as you explore the process step by step!
This time, we will see how Knn can be more effective to fill up the gaps over simpler methods.
Hello there!
My name is Sara Nóbrega, and I am a Data Scientist specializing in AI Engineering. I hold a Master’s degree in Physics and I later transitioned into the exciting world of Data Science.
I write about data science, artificial intelligence, and data science career advice. Make sure to follow me and subscribe to receive updates when the next article is published!
Contents
- Data Note: Mock Energy Production Dataset
- Why and When to Use Non-Linear Machine Learning for Imputation?
-
Part 2: KNN for Time-Series Imputation 3.1 Statistical Comparison 3.1.1 Autocorrelation, STL Decomposition and Residual Analysis
- Potential Limitations of KNN
- When to use KNN for Time-Series Imputation
Data Note: Mock Energy Production Dataset
In Part 1, we used a mock energy production dataset with 10-minute intervals from January 1, 2023, to March 1, 2023. About 10% of the data points are missing values to simulate a realistic scenario for imputation.
Here is the data:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
# Generate the mock energy production data
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 3, 1)
datetime_index = pd.date_range(start=start_date, end=end_date, freq='10T')
# Create energy production values with day-night cycles
np.random.seed(42)
base_energy = []
for dt in datetime_index:
hour = dt.hour
if 6 <= hour <= 18:
energy = np.random.normal(loc=300, scale=30)
else:
energy = np.random.normal(loc=50, scale=15)
base_energy.append(energy)
energy_production = pd.Series(base_energy)
# Introduce missing values
num_missing = int(0.1 * len(energy_production))
missing_indices = np.random.choice(len(energy_production), num_missing, replace=False)
energy_production.iloc[missing_indices] = np.nan
mock_energy_data_with_missing = pd.DataFrame({
'Datetime': datetime_index,
'Energy_Production': energy_production
})
# Reset index for easier handling
data_with_index = mock_energy_data_with_missing.reset_index()
data_with_index['Time_Index'] = np.arange(len(data_with_index)) # Add time-based index
plt.figure(figsize=(14, 7))
plt.plot(mock_energy_data_with_missing['Datetime'], mock_energy_data_with_missing['Energy_Production'],
label='Energy Production (With Missing)', color='blue', alpha=0.7)
plt.scatter(mock_energy_data_with_missing['Datetime'], mock_energy_data_with_missing['Energy_Production'],
c=mock_energy_data_with_missing['Energy_Production'].isna(), cmap='coolwarm',
label='Missing Values', s=10) # Reduced size of the markers
plt.title('Mock Energy Production Data with Missing Values (10-Minute Intervals)')
plt.xlabel('Datetime')
plt.ylabel('Energy Production')
plt.legend()
plt.grid(True)
plt.show()

Why and When to Use Non-Linear Machine Learning for Imputation?
K-Nearest Neighbors takes advantage of the proximity between data points. These methods have especially proved useful when:
- Nonlinear Patterns Exist: KNN handle unusual or periodic trends better than simple polynomials or basic smoothing.
- More Data Features: If you have several correlated features (temperature, humidity, day-of-week indicators), KNN can take advantage of those.
- Larger Gaps: These models capture a wider context and, therefore, are extendable for longer missing intervals.
Working with time-series data? Then you need key techniques to master its analysis:
K-Nearest Neighbors for Time-Series Imputation
We will use KNNImputer
from scikit-learn
to fill in missing values based on the nearest neighbors’ known observations. Below is an example of how to do it:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from datetime import datetime
# Assuming data_with_index is already defined as per your provided data code
# Columns: ['index', 'Datetime', 'Energy_Production', 'Time_Index']
# Step 1: Feature Engineering
data_knn = data_with_index.copy()
# Extract hour, day of week, and encode cyclical features
data_knn['Hour'] = data_knn['Datetime'].dt.hour
data_knn['DayOfWeek'] = data_knn['Datetime'].dt.dayofweek # 0=Monday, 6=Sunday
# Cyclical encoding for 'Hour'
data_knn['Hour_Sin'] = np.sin(2 * np.pi * data_knn['Hour'] / 24)
data_knn['Hour_Cos'] = np.cos(2 * np.pi * data_knn['Hour'] / 24)
# Cyclical encoding for 'DayOfWeek'
data_knn['Day_Sin'] = np.sin(2 * np.pi * data_knn['DayOfWeek'] / 7)
data_knn['Day_Cos'] = np.cos(2 * np.pi * data_knn['DayOfWeek'] / 7)
# Optional: Remove 'Time_Index' if not adding value
# You can comment out the next line if you want to keep 'Time_Index'
data_knn = data_knn.drop(columns=['Time_Index'])
# Step 2: Prepare Features and Target
# Select relevant features for imputation
feature_columns = ['Hour_Sin', 'Hour_Cos', 'Day_Sin', 'Day_Cos']
features = data_knn[feature_columns]
target = data_knn['Energy_Production']
# Combine features and target for imputation
combined_data = pd.concat([features, target], axis=1)
# Step 3: Feature Scaling
scaler = StandardScaler()
scaled_data = scaler.fit_transform(combined_data)
# Step 4: Apply KNNImputer
knn_imputer = KNNImputer(n_neighbors=5, weights='distance')
imputed_array = knn_imputer.fit_transform(scaled_data)
# Inverse transform the scaled data to original scale
imputed_data = scaler.inverse_transform(imputed_array)
# Step 5: Update the 'Energy_Production' column with imputed values
knn_filled_data = data_knn.copy()
knn_filled_data['Energy_Production'] = imputed_data[:, -1]
# Step 6: Visualization of Imputation Results
# Define the time range for visualization
start_month = datetime(2023, 1, 1)
end_month = datetime(2023, 1, 31)
# Extract original and imputed data for the specified range
original_month_data = mock_energy_data_with_missing[
(mock_energy_data_with_missing['Datetime'] >= start_month) &
(mock_energy_data_with_missing['Datetime'] <= end_month)
]
knn_month_data = knn_filled_data[
(knn_filled_data['Datetime'] >= start_month) &
(knn_filled_data['Datetime'] <= end_month)
]
# Plotting
plt.figure(figsize=(14, 7))
plt.plot(knn_month_data['Datetime'], knn_month_data['Energy_Production'],
label='Imputed Data (KNN)', color='purple', alpha=0.8)
plt.plot(original_month_data['Datetime'], original_month_data['Energy_Production'],
label='Original Data (with Missing)', color='red', alpha=0.5)
plt.scatter(original_month_data['Datetime'], original_month_data['Energy_Production'],
c=original_month_data['Energy_Production'].isna(), cmap='coolwarm',
label='Missing Values', s=10)
plt.title('Original vs. KNN-Imputed Energy Production Data (January 2023)')
plt.xlabel('Datetime')
plt.ylabel('Energy Production')
plt.legend()
plt.grid(True)
plt.show()

In Figure 2, we can see how missing values have been replaced with realistic estimates based on patterns in the data.
This code preprocesses the dataset in order to fill in missing energy production values, using the time-related patterns present in the data.
Features are created for the hour of day and day of week and are encoded to represent their natural cycle, for example, midnight being closer to 11PM.
Cyclical encoding is a transformation that takes any feature related to time, such as the hour of the day or day of the week, and represents that feature in such a way that the natural cycle in these variables is captured.
Consider, for instance, that the hour of the day cycles from 0, representing midnight, to 23, representing 11 PM; these two hours are actually quite proximal to each other in terms of time.
These features are helpful in capturing meaningful time-based trends.
In preparation for imputation, all features are standardized so that they contribute equally in the KNN algorithm, which fills missing values based on the most similar data points.
Observations
- Local Adaptation: KNN is based on the distances among points depending on features like Time_Index and Hour; it does capture day-night cycles provided that this cycling is reflected in the Hour feature.
- Nonlinear Preservation: Unlike a straight line, KNN predictions take the shape of the data and will retain peaks and troughs if neighbor points show such patterns.
For now, it seems the KNN imputation method fairly restored missing values using the surrounding time-series pattern! We need, however, more advanced methods than just that plot.
Next, let’s evaluate the results of this KNN imputation.
Statistical Comparison
Just as in Part 1, we evaluate mean, std, min, max, and quartiles:
original_stats = mock_energy_data_with_missing['Energy_Production'].describe()
knn_stats = knn_filled_data['Energy_Production'].describe()
stats_comparison_knn = pd.DataFrame({
'Metric': original_stats.index,
'Original Data': original_stats.values,
'Imputed Data (KNN)': knn_stats.values
})
stats_comparison_knn

We can see that the descriptive statistics of the imputed data closely resemble the distribution of the original data, with the overall characteristics preserved. That is a good news: the imputed data keeps realistic variability in energy production values!
Autocorrelation, STL Decomposition and Residual Analysis
We will create Autocorrelation Functions (ACF) plots and perform STL decomposition, as in part 1, to assess the underlying seasonal patterns, trends, and autocorrelations in the time series data that provide a deeper understanding of the imputation results.
# ACF Comparison Function
def plot_acf_comparison(original_series, imputed_series, lags=50):
plt.figure(figsize=(14, 5))
# Original Data ACF (using linear interpolation to handle missing values)
original_interpolated = original_series.interpolate(method='linear')
plt.subplot(1, 2, 1)
sm.graphics.tsa.plot_acf(original_interpolated, lags=lags, ax=plt.gca(),
title="ACF of Original Data (Interpolated)")
plt.grid(True)
# Imputed Data ACF
plt.subplot(1, 2, 2)
sm.graphics.tsa.plot_acf(imputed_series, lags=lags, ax=plt.gca(),
title="ACF of KNN-Imputed Data")
plt.grid(True)
plt.tight_layout()
plt.show()
# Perform ACF Comparison
plot_acf_comparison(
mock_energy_data_with_missing['Energy_Production'],
knn_filled_data['Energy_Production']
)
# STL Decomposition Function
def perform_stl_decomposition(series, period, title_suffix=''):
decomposition = seasonal_decompose(series, model='additive', period=period)
return decomposition
# Define the seasonal period
# Since data is at 10-minute intervals, and there are 144 intervals in a day (144 * 10 minutes = 1440 minutes = 24 hours)
seasonal_period = 144
# Handle missing values in original data by linear interpolation for decomposition
original_interpolated = mock_energy_data_with_missing['Energy_Production'].interpolate(method='linear')
# Perform STL Decomposition on Original Data
original_decompose = perform_stl_decomposition(original_interpolated, period=seasonal_period, title_suffix='Original')
# Perform STL Decomposition on KNN-Imputed Data
knn_decompose = perform_stl_decomposition(knn_filled_data['Energy_Production'], period=seasonal_period, title_suffix='KNN-Imputed')
# Visualization of STL Decomposition Components
# Plot Trend Comparison
plt.figure(figsize=(14, 5))
plt.plot(knn_decompose.trend, label='KNN-Imputed Trend', color='green')
plt.plot(original_decompose.trend, label='Original Interpolated Trend', color='blue', alpha=0.7)
plt.title('Trend Comparison: Original vs. KNN-Imputed Data')
plt.xlabel('Datetime')
plt.ylabel('Energy Production Trend')
plt.legend()
plt.grid(True)
plt.show()
# Plot Seasonal Comparison
plt.figure(figsize=(14, 5))
plt.plot(original_decompose.seasonal, label='Original Interpolated Seasonality', color='blue', alpha=0.7)
plt.plot(knn_decompose.seasonal, label='KNN-Imputed Seasonality', color='green')
plt.title('Seasonality Comparison: Original vs. KNN-Imputed Data')
plt.xlabel('Datetime')
plt.xlim(0, 2000)
plt.ylabel('Energy Production Seasonality')
plt.legend()
plt.grid(True)
plt.show()
# Optional: Plot Residuals Comparison
plt.figure(figsize=(14, 5))
plt.plot(knn_decompose.resid, label='KNN-Imputed Residuals', color='green')
plt.plot(original_decompose.resid, label='Original Interpolated Residuals', color='blue', alpha=0.7)
plt.title('Residuals Comparison: Original vs. KNN-Imputed Data')
plt.xlabel('Datetime')
plt.ylabel('Residuals')
plt.legend()
plt.grid(True)
plt.show()

Figure 4 displays the ACF plots of the original and imputed datasets. We can see that both plots show very similar patterns, confirming that the imputation indeed preserved temporal dependencies!
We can conclude from these plots that the imputed dataset captured well the periodic nature of the original data.


- Figure 5: The trend component extracted by STL decomposition on the KNN-imputed data follows the original data very closely, reflecting little deviation. This confirms that long-term patterns in energy production a were well retained in the imputation process!
- Figure 6: Similarly, we cannot even visually see any difference in seasonal components between the original and imputed datasets. This shows that the inherent daily and weekly periodicity of energy production cycles was really well preserved. Indeed, this aspect is very important from the point of view of time-series analysis and forecasting tasks in case further downstream works are required.
Residual Analysis

Indeed, the residuals from the STL decomposition of the KNN-imputed data do show a noise pattern similar to the original dataset interpolated for this analysis, showing that very few extra irregularities or artifacts were introduced in the imputation process to the dataset.
As the residual variance remains similar across both data sets, it indicates that the imputation method preserved the randomness inherent in the original time-series data.
How to handle outliers in time-series data? Save this article for later:
The Ultimate Guide to Finding Outliers in Your Time-Series Data (Part 1)
Potential Limitations of KNN
We have seen that KNN is actually a useful model to perform time-series imputation. Of course, every model comes with its limitations:
- Can Become Computational Complex: KNN requires calculating distances for all observations, which can of course become computationally expensive if you are working with large datasets or high-dimensional feature spaces.
- Sensitive to Parameters: The choice of
n_neighbors
and weighting scheme significantly affects results. - Assumes Similarity: KNN assumes that missing values can be inferred by proximity to other observations in feature space. When there is too little similarity in the data, or if it’s too noisy, this may break down.
- Stationarity Assumption: KNN does not natively support non-stationary data – time-series data with shifting trends or seasonalities – apart from perhaps additional preprocessing. Our data used here is stationary.
- Scalability: KNN does not scale well for very large data sets or time-series streaming data.
When to Use KNN for Time-Series Data Imputation
KNN imputation is useful when certain criteria are met in time-series data:
Moderate Levels of Missing Data:
- KNN performs well when the proportion of missing values is moderate (e.g., 5–20%). If the missing data exceeds these levels, the model might struggle.
Cyclic or Periodic Data:
- It is best applied to those datasets featuring distinct periodic patterns (say, daily, weekly, or yearly cycles). Energy production, traffic flow, or seasonal sales data are examples of such. The performance of KNN is improved by this cyclic feature engineering.
Limited Non-Stationarity:
- KNN will do well where the dataset is fairly stationary or preprocessed for the removal of non-stationarity, such as detrending or deseasonalizing.
Small to Medium Dataset Sizes:
- It provides a good trade-off between computational feasibility and model accuracy for non-excessively large time-series datasets.
Missing Not at Random (MNAR):
- If the missing data are related to the presence of certain temporal patterns or periodic events, then KNN can exploit those relationships effectively, especially when the features contain time indices or cyclical components.
The application of KNN to impute the missing values in time-series data was very effective.
The imputation preserved the nature of the statistics – central tendency, variability, distribution shape-along with main temporal features like trend, seasonality, and autocorrelation patterns. Residual analysis confirmed that no significant irregularities or distortions were introduced in the imputation process.
We saw it is important to fabricate new cyclical features (if not already present on the dataset) so that the KNN imputer can preserve the inherent periodic nature of the energy production dataset, so that are new imputed values integrate well into the data. These results demonstrate that KNN can be a robust and reliable method for handling missing data in time-series datasets.
If you found value in this post, I’d appreciate your support with a clap. You’re also welcome to follow me on Medium for similar articles!
Book a call with me, ask me a question or send me your resume here:
References
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn & TensorFlow. O’Reilly.
- scikit-learn documentation: KNNImputer