
(If you are not subscribed, read this story here).
Missing Data in time-series analysis – sounds familiar?
Does missing data in your datasets due to malfunctioning sensors, transmission, or any kind of maintenance sound all too familiar to you?
Well, missing values derail your forecast and skew your analysis.
So, how do you fix them?
Traditional methods may seem like the solution-forward fill or interpolation – but is that good enough?
What happens when your data has complex patterns, nonlinear trends, or high variability? Simple techniques would fail and render unstable results.
What if there were wiser ways to face this challenge?
Machine learning does just that: from regression analysis through K-Nearest Neighbors to neural networks, which do not assume anything but adapt and fill in the gaps with precision.
Curious? Let’s look deeper at how those advanced methods will change your time-series analysis.
We will impute missing data in using a dataset that you can easily generate yourself, allowing you to follow along and apply the techniques in real-time as you explore the process step by step!
In this article we will employ two popular models: Linear Regression and Decision Trees.
Hello there!
My name is Sara Nóbrega, and I am a Data Scientist specializing in AI Engineering. I hold a Master’s degree in Physics and I later transitioned into the exciting world of Data Science.
I write about data science, artificial intelligence, and data science career advice. Make sure to follow me and subscribe to receive updates when the next article is published!
Contents:
- Data Note: Mock Energy Production Dataset
2. Why and When to Use Machine Learning for Time-Series Imputation?
3. Part 1: Regression-Based Imputations
3.1 Linear Regression for Time-Series Imputation
3.2 Decision-Tree Regressors for Time-Series Imputation
4. Conclusion: Comparison between Linear Regression and Decision-Tree Regressor
Data Note: Mock Energy Production Dataset
Here I simulated a mock energy production dataset with 10-minute intervals, starting from January 1, 2023, and ending on March 1, 2023. The dataset simulates realistic day-night cycles in energy production.
In order to make this dataset a bit more realistic, 10% of the data points were randomly selected and set as missing values (NaN).
This allows us to test various imputation methods for handling missing data in time-series datasets.
Take a look:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
# Generate the mock energy production data
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 3, 1)
datetime_index = pd.date_range(start=start_date, end=end_date, freq='10T')
# Create energy production values with day-night cycles
np.random.seed(42)
base_energy = []
for dt in datetime_index:
hour = dt.hour
if 6 <= hour <= 18:
energy = np.random.normal(loc=300, scale=30)
else:
energy = np.random.normal(loc=50, scale=15)
base_energy.append(energy)
energy_production = pd.Series(base_energy)
# Introduce missing values
num_missing = int(0.1 * len(energy_production))
missing_indices = np.random.choice(len(energy_production), num_missing, replace=False)
energy_production.iloc[missing_indices] = np.nan
mock_energy_data_with_missing = pd.DataFrame({
'Datetime': datetime_index,
'Energy_Production': energy_production
})
# Reset index for easier handling
data_with_index = mock_energy_data_with_missing.reset_index()
data_with_index['Time_Index'] = np.arange(len(data_with_index)) # Add time-based index
plt.figure(figsize=(14, 7))
plt.plot(mock_energy_data_with_missing['Datetime'], mock_energy_data_with_missing['Energy_Production'],
label='Energy Production (With Missing)', color='blue', alpha=0.7)
plt.scatter(mock_energy_data_with_missing['Datetime'], mock_energy_data_with_missing['Energy_Production'],
c=mock_energy_data_with_missing['Energy_Production'].isna(), cmap='coolwarm',
label='Missing Values', s=10) # Reduced size of the markers
plt.title('Mock Energy Production Data with Missing Values (10-Minute Intervals)')
plt.xlabel('Datetime')
plt.ylabel('Energy Production')
plt.legend()
plt.grid(True)
plt.show()

Why and When to Use Machine Learning for Time-Series Imputation?
Machine learning provides a formidable approach toward missing value imputation by leveraging the finding of patterns and relationships within the data.
While other conventional methods consider simple assumptions of linear trends, ML learns complex nonlinear and multivariable dependencies to always produce more accurate imputation.
ML methods are particularly useful when:
- There are nonlinear patterns that cannot be captured by traditional methods.
- High-dimensional datasets provide additional features for imputation.
- Large gaps: they have to be understood much more comprehensively with the general tendency and pattern of the data.
- Robustness is needed **** especially when data is missing not at random.
Though ML requires more resources, flexibility makes it ideal to handle challenging time-series imputation tasks.
Part 1: Regression-Based Imputations
Regression-based imputation methods use predictive models – a class of machine learning to be more specific, such as linear regression or decision tree regressors – to estimate values from known relationships between other features or temporal patterns, such as lagged values in time series.
These dependencies allow it to fill in gaps using its learned underlying trends and relationships within the data.
We will impute the missing data points in time series data using two models, namely Linear Regression and Decision Tree. Both methods will be evaluated and compared.
Let’s begin:
Linear Regression For Time-Series Imputation
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Separate data into features (time) and target (energy production)
features = data_with_index[['Time_Index']]
target = data_with_index['Energy_Production']
# Identify missing and non-missing data
non_missing_data = data_with_index.dropna(subset=['Energy_Production'])
missing_data = data_with_index[data_with_index['Energy_Production'].isna()]
# Fit a regression model to predict the energy production
regressor = LinearRegression()
regressor.fit(non_missing_data[['Time_Index']], non_missing_data['Energy_Production'])
# Predict missing values
predicted_values = regressor.predict(missing_data[['Time_Index']])
# Fill in the missing values in the original dataset
filled_data = data_with_index.copy()
filled_data.loc[filled_data['Energy_Production'].isna(), 'Energy_Production'] = predicted_values
# Display the imputed data
filled_data = filled_data[['Datetime', 'Energy_Production']]
# Plot original vs imputed data for one month (January 2023)
start_month = datetime(2023, 1, 1)
end_month = datetime(2023, 1, 31)
original_month_data = mock_energy_data_with_missing[
(mock_energy_data_with_missing['Datetime'] >= start_month) &
(mock_energy_data_with_missing['Datetime'] <= end_month)
]
imputed_month_data = filled_data[
(filled_data['Datetime'] >= start_month) &
(filled_data['Datetime'] <= end_month)
]
plt.figure(figsize=(14, 7))
plt.plot(imputed_month_data['Datetime'], imputed_month_data['Energy_Production'],
label='Imputed Data (Regression)', color='green', alpha=0.8)
plt.plot(original_month_data['Datetime'], original_month_data['Energy_Production'],
label='Original Data (with Missing)', color='red', alpha=0.9)
plt.title('Original vs. Regression-Imputed Energy Production Data (January 2023)')
plt.xlabel('Datetime')
plt.ylabel('Energy Production')
plt.legend()
plt.grid(True)
plt.show()

Here we are plotting only one month of data for visualization purposes.
Even though we can observe that the imputed values follow the general trend of the original data, it is not enough to evaluate how good our imputation worked.
We will employ the following methods to evaluate it:
- Statistical Comparison: Compares metrics (mean, standard deviation, min, max) to ensure imputed data aligns with the original data distribution.
- Autocorrelation: Check if the autocorrelation of the series is preserved after imputation.
- Trend Analysis: STL decomposition is used to check if long-term patterns, or trends, are preserved after imputation.
- Seasonality Analysis: Checks whether the imputation maintains the cyclical pattern, such as day-night cycles, in amplitude and periodicity.
Here is how you can do it:
Statistical Comparison
from statsmodels.tsa.seasonal import seasonal_decompose
# Step 1: Statistical Comparison for Linear Regression
original_stats = mock_energy_data_with_missing['Energy_Production'].describe()
imputed_stats = filled_data['Energy_Production'].describe()
stats_comparison = pd.DataFrame({
'Metric': original_stats.index,
'Original Data': original_stats.values,
'Imputed Data (Linear Regression)': imputed_stats.values
})
from IPython.display import display
display(stats_comparison)
Metric Original Data Imputed Data (Linear Regression)
0 count 7648.000000 8497.000000
1 mean 185.073509 185.073842
2 std 126.816229 120.313162
3 min -7.549833 -7.549833
4 25% 51.793304 54.186258
5 50% 256.996772 185.197681
6 75% 302.217789 298.324435
7 max 415.581945 415.581945
From the statistical comparison table, we can deduce the following:
- Count: The imputed dataset contains more data points, 8497, as compared to the original dataset, 7648, because of the filling of missing values in order to ensure a full dataset.
- Mean: The mean of the imputed data, 185.07, is close to that of the original data, which means that the central tendency of the data was preserved during imputation.
- Standard Deviation (std): Imputed data have lower dispersion (standard deviation = 120.31 against 126.82 of original). It can suggest that in the process of imputation the values have become smoothed to lower the dispersion-after all, linear regression did its predictable magic by always returning values closer to the average.
- Minimum and Maximum Values: The minimum (-7.55) and maximum (415.58) values are identical, which suggests that the imputation process did not introduce extreme outliers or deviate from the range of the original data.
- Quartiles (25%, 50%, 75%):
- The 25th percentile (lower quartile) is slightly higher in the imputed data, 54.19 versus 51.79, suggesting that the imputed values filled more gaps in the lower range.
- The median (50%) significantly changed from 256.99 in the original to 185.20 in imputed form, suggesting that imputed values reduced overall skew toward higher energy production values.
- The 75th percentile (upper quartile) remained similar, reflecting that the higher values were well-preserved.
Linear regression imputation conserved the overall distribution and range of the data, while it reduced variability and slightly smoothed the dataset.
However, the fact that the median shifted noticeably shows that this might not catch the skewness or extremes of the original data as well.
Autocorrelation
import statsmodels.api as sm
# Autocorrelation Function (ACF) Plot
def plot_acf_comparison(original_series, imputed_series, lags=50):
plt.figure(figsize=(14, 5))
# Original Data ACF
plt.subplot(1, 2, 1)
sm.graphics.tsa.plot_acf(original_series.dropna(), lags=lags, ax=plt.gca(), title="ACF of Original Data")
plt.grid(True)
# Imputed Data ACF
plt.subplot(1, 2, 2)
sm.graphics.tsa.plot_acf(imputed_series, lags=lags, ax=plt.gca(), title="ACF of Imputed Data")
plt.grid(True)
plt.tight_layout()
plt.show()
# Call the function to compare autocorrelations
plot_acf_comparison(mock_energy_data_with_missing['Energy_Production'], filled_data['Energy_Production'])

The following can be inferred from the plots for comparing the autocorrelation:
Preservation of Temporal Dependencies:
- There is a reasonably close resemblance in the overall shape and decay of the ACF of the imputed data with that of the original data, which indicates that the temporal dependencies are preserved reasonably after imputation.
Slight Smoothing Effect:
- ACF on the imputed data demonstrates slightly lower values for some lags compared to the real data. This could result from the fact that when performing imputation, a linear regression model smooths out extreme values, and such may reduce variability a little.
Cyclic Patterns in ACF:
- Any periodic spikes that may have existed in the ACF, such as daily seasonality, seem to remain consistent between the original and imputed data, showing that imputation has preserved periodic behaviors within the dataset.
Overall Robustness:
- The similarity in the two ACF plots gives good indications that the linear regression imputation method retains well the core temporal structure in the data.
STL Decomposition (Trend and Seasonality)
# Step 2: STL Decomposition (Trend and Seasonality)
original_series = mock_energy_data_with_missing['Energy_Production']
imputed_series = filled_data['Energy_Production']
# Decompose the original and imputed series
original_decompose = seasonal_decompose(original_series.interpolate(), model='additive', period=144) # Daily seasonality (144 10-min intervals in a day)
imputed_decompose = seasonal_decompose(imputed_series.interpolate(), model='additive', period=144)
# Plot decomposition results for trends and seasonality (Original vs. Imputed)
plt.figure(figsize=(14, 5))
plt.plot(original_decompose.trend, label='Original Trend', color='blue')
plt.plot(imputed_decompose.trend, label='Imputed Trend (Linear Regression)', color='green', linestyle='--')
plt.title('Trend Comparison: Original vs. Linear Regression Imputation')
plt.legend()
plt.grid(True)
plt.show()
plt.figure(figsize=(14, 5))
plt.plot(original_decompose.seasonal, label='Original Seasonality', color='blue')
plt.plot(imputed_decompose.seasonal, label='Imputed Seasonality (Linear Regression)', color='green', linestyle='--')
plt.xlim(0, 4000)
plt.title('Seasonality Comparison: Original vs. Linear Regression Imputation')
plt.legend()
plt.grid(True)
plt.show()


Trend Comparison
- The imputed trend (green, dashed) generally agrees overall with the original trend in blue, hence the linear regression imputation has preserved the general long-term movement in the data.
- However, the imputed trend appears smoother than the original, particularly where the original trend exhibits more variability. This is expected with linear regression, as it tends to underfit fluctuations and smooth extremes.
Seasonality Comparison
- The imputed seasonality shows periodic patterns that are quite close to the original seasonality, reflecting that the day-night production cycles are well preserved.
- The amplitudes of the imputed seasonality are a little dampened from the original, especially at the peaks and troughs. This indicates that imputation makes the extreme periodic variations not so intense.
Limitations of linear regression imputation:
Smoothing of Extremes:
- From the statistical comparison, the lower standard deviation in the imputed data (120.31 vs. 126.82) shows that variability was reduced.
- From the trend comparison, the imputed trend is smoother, missing some of the original variability in long-term patterns.
Linear Assumptions:
- The smoothing observed in the trend comparison suggests that the method struggles to catch nonlinear changes within the data, particularly around periods where sharp shifts happen.
Reduced Seasonal Amplitude:
- The seasonality comparison shows that the imputed seasonality has dampened the peaks and troughs shown through lower amplitudes compared to the original. This agrees with the tendency of regression to pull extreme values toward the mean.
Summary:
Linear regression imputation preserves overall trends and cyclical patterns, but it reduces variability and smooths out extreme values. It also slightly underrepresents the intensity of seasonal cycles, as reflected in both the statistical and decomposition analyses.
How to handle outliers in time-series data? Save this article for later:
The Ultimate Guide to Finding Outliers in Your Time-Series Data (Part 1)
Decision-Tree Regressors For Time-Series Imputation
from sklearn.tree import DecisionTreeRegressor
# Step 1: Imputation using Decision Tree Regressor
# Fit a Decision Tree model to predict the energy production
tree_regressor = DecisionTreeRegressor(max_depth=5, random_state=42)
tree_regressor.fit(non_missing_data[['Time_Index']], non_missing_data['Energy_Production'])
# Predict missing values
tree_predicted_values = tree_regressor.predict(missing_data[['Time_Index']])
# Fill in the missing values in the original dataset
tree_filled_data = data_with_index.copy()
tree_filled_data.loc[tree_filled_data['Energy_Production'].isna(), 'Energy_Production'] = tree_predicted_values
# Display the imputed data
tree_filled_data = tree_filled_data[['Datetime', 'Energy_Production']]
# Plot original vs imputed data for one month (January 2023)
tree_imputed_month_data = tree_filled_data[
(tree_filled_data['Datetime'] >= start_month) &
(tree_filled_data['Datetime'] <= end_month)
]
plt.figure(figsize=(14, 7))
plt.plot(tree_imputed_month_data['Datetime'], tree_imputed_month_data['Energy_Production'],
label='Imputed Data (Decision Tree)', color='orange', alpha=0.8)
plt.plot(original_month_data['Datetime'], original_month_data['Energy_Production'],
label='Original Data (with Missing)', color='red', alpha=0.9)
plt.title('Original vs. Decision Tree-Imputed Energy Production Data (January 2023)')
plt.xlabel('Datetime')
plt.ylabel('Energy Production')
plt.legend()
plt.grid(True)
plt.show()
# Step 2: Statistical Comparison for Decision Tree
tree_imputed_stats = tree_filled_data['Energy_Production'].describe()
# Update statistical comparison DataFrame
stats_comparison['Imputed Data (Decision Tree)'] = tree_imputed_stats.values
# Display updated stats comparison
import ace_tools as tools; tools.display_dataframe_to_user(name="Statistical Comparison for Decision Tree Imputation", dataframe=stats_comparison)
# Step 3: Autocorrelation Comparison for Decision Tree
plot_acf_comparison(mock_energy_data_with_missing['Energy_Production'], tree_filled_data['Energy_Production'])
# Step 4: STL Decomposition for Decision Tree
tree_imputed_series = tree_filled_data['Energy_Production']
tree_imputed_decompose = seasonal_decompose(tree_imputed_series.interpolate(), model='additive', period=144)
# Plot decomposition results for trends and seasonality (Original vs. Decision Tree Imputed)
plt.figure(figsize=(14, 5))
plt.plot(original_decompose.trend, label='Original Trend', color='blue')
plt.plot(tree_imputed_decompose.trend, label='Imputed Trend (Decision Tree)', color='orange', linestyle='--')
plt.title('Trend Comparison: Original vs. Decision Tree Imputation')
plt.legend()
plt.grid(True)
plt.show()
plt.figure(figsize=(14, 5))
plt.plot(original_decompose.seasonal, label='Original Seasonality', color='blue', alpha=0.7)
plt.plot(tree_imputed_decompose.seasonal, label='Imputed Seasonality (Decision Tree)', color='orange', linestyle='--', alpha=0.7)
plt.xlim(0, 4000)
plt.title('Seasonality Comparison: Original vs. Decision Tree Imputation')
plt.legend()
plt.grid(True)
plt.show()

- The imputed values by the decision tree are close to the original data, especially in capturing extreme values.
- The overall trend in the imputed data is general, while preserving variability much better than what was observed with linear regression.
Statistical Comparison
Metric Original Data Imputed Data (Linear Regression) Imputed Data (Decision Tree)
0 count 7648.000000 8497.000000 8497.000000
1 mean 185.073509 185.073842 184.979184
2 std 126.816229 120.313162 120.633636
3 min -7.549833 -7.549833 -7.549833
4 25% 51.793304 54.186258 53.797479
5 50% 256.996772 185.197681 185.545605
6 75% 302.217789 298.324435 298.531049
7 max 415.581945 415.581945 415.581945
- The mean (185.07) and range (min and max) of the decision-tree-imputed data match the original data, which means it preserves variability better and does not introduce outliers.
- The standard deviation (120.63) is closer to the original (126.82) than linear regression (120.31), indicating better preservation of variability.
- The quartiles (25%, 50%, 75%) for the decision-tree imputation are closer to the original, especially the median (256.99 vs. 252.19), reflecting less skew reduction compared to linear regression.
Autocorrelation

- The autocorrelation of the decision-tree-imputed data is closer to the original, hence preserving temporal dependencies well.
- Unlike linear regression, the decision tree captures periodicity without significant smoothing at certain lags, which may indicate better handling of cyclic patterns.
STL Decomposition (Trend and Seasonality)


Trend Comparison
The trend from the decision-tree imputation follows the original closely and captures fluctuations that linear regression tends to smooth over. Long-term patterns are better represented in terms of variability with decision-tree imputation.
Seasonality Comparison
The seasonal component from decision-tree imputation has similar amplitudes to the original data, with peaks and troughs more closely matching the original.
The decision tree maintains the intensity of the periodic variations, without any loss in amplitude and periodicity.
Conclusion
Comparison: Linear Regression vs. Decision-Tree Regressor
Preservation of Variability
- Linear Regression: Reduced variability (std lower than original).
- Decision Tree Regressor: Better preservation of variability (std closer to the original).
Handling of Extremes
- Linear Regression: Peaks and troughs are smoothed out.
- Decision Tree Regressor: Captures extreme values more precisely.
Temporal Dependencies
- Linear Regression: Slight smoothing in the autocorrelation at specific lags.
- Decision Tree Regressor: More accurate preservation of the autocorrelation.
Trend Representation
- Linear Regression: Preserves long-term trends, although over-smoothed.
- Decision Tree Regressor: Better captures fluctuations in trends.
Seasonality Representation
- Linear Regression: Amplitude dampened, peaks and troughs reduced.
- Decision Tree Regressor: Amplitude and periodicity closely match the original.
Computation Complexity
- Linear Regression: Simple and efficient.
- Decision Tree Regressor: Computationally heavier due to tree fitting.
The decision-tree regressor performed better overall for this dataset:
- It preserved variability, captured extreme values, and retained periodic patterns more effectively than linear regression.
- However, linear regression may still be preferred for simplicity in datasets where smoothness and computational efficiency are prioritized.
For datasets like this one, with significant variability and periodicity, the decision-tree regressor is the better choice for imputing missing values.
Decision Trees are better suited for complex, non-linear data with significant fluctuations, while Linear Regression is more efficient and works well with simpler, linear relationships.
Handling of Missing Data at Large Gaps:
While linear regression is good for relatively small gaps, for larger gaps, methods like KNN or interpolation (or a hybrid approach) may yield better results. Stay tuned for the next article, where we will apply KNN to impute missing data in time-series! 😉
If you found value in this post, I’d appreciate your support with a clap. You’re also welcome to follow me on Medium for similar articles!
Book a call with me, ask me a question or send me your resume here:
References
[2002.04236] A review on outlier/anomaly detection in time series data