The world’s leading publication for data science, AI, and ML professionals.

The Ultimate Guide to Finding Outliers in Your Time-Series Data (Part 3)

Outliers Found: Now What? A Guide to Treatment Options

Source: DALL·E
Source: DALL·E

If you are not a member you can read the story here.

Outliers: those data points that can throw off statistical models, mislead forecasts, and disrupt decision-making processes.

Outlier Treatment is a necessary step in data analysis that many find challenging.

With the insights from this article, I hope to ease the process and make it more enjoyable!

This article is the third of a four-part series dedicated to the identification and management of outliers in time-series data.

The first article of this series was about exploring both visual and statistical methods to identify outliers effectively in time-series data:

The Ultimate Guide to Finding Outliers in Your Time-Series Data (Part 1)

In the second article, we explored several machine learning techniques to identify outliers:

The Ultimate Guide to Finding Outliers in Your Time-Series Data (Part 2)

In this third article, we will explore various strategies on how to manage these outliers, considering special factors ** for time-series data. We will explore removing, retaining and capping technique**s, offering some practical ways to handle outliers.

In the next and final article, I will continue to explore ways of managing outliers, focusing on imputation and transformation methods, as well as evaluating the impact of outlier treatment.

Data Note

All datasets used in this article are synthetic and have been generated by me, specifically for this analysis. No external datasets have been used.


Hello there!

My name is Sara, and I am a Data Scientist specializing in AI Engineering. I hold a Master’s degree in Physics and later transitioned into the exciting world of Data Science.

I write about data science, artificial intelligence, and data science career advice. Make sure to follow me and subscribe to receive updates when the next article is published!


Contents:

Part I (first article):

  1. Why should you care?
  2. Outliers vs Anomalies
  3. How to choose the right Anomaly Detection Method?
  4. Univariate vs Multivariate Time-series Data
  5. Identification of Outliers
  • Visual Inspection
  • Statistical Methods
  1. Evaluation Metrics

Part II (second article):

  1. Understanding Univariate and Multivariate Data

2. Outlier Detection Using Machine Learning Methods

  • Isolation Forest for Outlier Detection
  • Prophet for Outlier Detection
  • Local Outlier Factor (LOF)
  • Clustering-based Anomaly Detection
  • Autoencoders

Part III (This article):

  1. Introduction: The Importance of Proper Outlier Treatment
  2. Assessing the Nature of Outliers

3. Special Considerations for Time-Series Data

  1. Basic Strategies for Handling Outliers
  • Retaining
  • Removing

5. Capping (Winsorization) for Outlier Treatment

  • Setting upper and lower bounds
  • Percentile-based capping

6. Alternative Capping Methods

7. Considerations and Best Practices

Part IV (Next article):

1. Transformation Techniques to Handle Outliers

  • Log transformation
  • Box-Cox transformation
  • Other relevant transformations

2. Imputation Methods

3. Evaluating the Impact of Outlier Treatment

  • Before and after comparisons
  • Sensitivity analysis

The Importance of Proper Outlier Treatment

Handling outliers right is crucial for keeping our research accurate and reliable, no matter the field.

Why? For several reasons. For instance, outliers can significantly skew your statistical analyses by distorting measures of central tendency and dispersion.

Figure 1: Example of a histogram showing the skewed distribution of data values. | Image by author.
Figure 1: Example of a histogram showing the skewed distribution of data values. | Image by author.

Research has found that even a single extreme value can dramatically affect means, standard deviations, and correlations.

In machine learning, outliers can lead to biased models and poor generalization, particularly in methods sensitive to extreme values.

When data scientists, researchers or data analysts don’t properly deal with outliers, it can result in misleading conclusions with real-world consequences.

Context is key in outlier treatment.

Domain knowledge is essential for distinguishing between erroneous data points and genuinely unusual observations that might offer important insights.

What constitutes an outlier in one field may be a critical data point in another!

Handling outliers right means picking the best methods, either by removing them, transforming them, or using strong statistical techniques.

It’s also super important that you document how you identified and dealt with outliers to keep things clear and reproducible.

Btw, if you work with time-series data, you need to check the article below:

5 Must-Know Techniques for Mastering Time-Series Analysis

Assessing the Nature of Outliers

Understanding the cause and significance of outliers

The cause and significance of outliers can vary widely.

Outliers can be categorized into global outliers (deviating from the entire dataset) and local outliers (deviating from nearby points).

Figure 2: Global and Local Outliers in Time-Series Data | Image by Author.
Figure 2: Global and Local Outliers in Time-Series Data | Image by Author.

There are many reasons why your data contains outliers: it could be due to natural variation in the data, measurement errors, or data processing mistakes.

The significance of outliers depends on the domain. Take this example: in finance, an outlier might indicate a fraudulent transaction, while in healthcare, it could represent a rare but critical medical conditions.

Determining if outliers are legitimate or erroneous

Determining whether outliers are legitimate or erroneous is a critical step in data analysis.

To identify outliers, both statistical and machine learning approaches can be used, together with domain knowledge and even data quality evaluation.

This means examining the data collection and processing methods to identify potential sources of error. It may involve checking for instrument malfunctions, transcription errors, or data corruption.

A data point that appears statistically anomalous might be a genuine observation of a rare event!

Special Considerations for Time-Series Data

Preserving temporal structure

Time-series data presents unique challenges in outlier detection due to its inherent temporal structure and potential seasonality.

Because of this, it’s important to keep the time patterns intact while spotting and handling anomalies.

Preserving the temporal structure of time-series data is crucial when dealing with outliers in time-series data.

Outliers can also distort autocorrelation patterns and trends, which might cause you to misinterpret the data’s real structure.

Handling seasonal outliers

Dealing with seasonal outliers can be a bit trickier.

There is a big challenge on distinguishing between legitimate seasonal fluctuations and true anomalies.

Figure 3: Decomposition of time-series data into trend, seasonal, and residual components | Image by Author.
Figure 3: Decomposition of time-series data into trend, seasonal, and residual components | Image by Author.

Seasonal decomposition techniques can help separate seasonal components from trend and residual, facilitating the outlier detection (Figure 3).

The time-series dataset can be broken down into three main components:

  • Trend: The long-term progression in the data.
  • Seasonal: The repeating short-term cycle within the data.
  • Residual: The remaining part of the data after removing the trend and seasonal components, which includes noise and potential outliers.

The residual component highlights irregular patterns in the data. By removing the predictable trend and seasonal variations, what remains are deviations that are not explained by these components, making outliers more evident.

Finding outliers in time-series data means walking a fine line between keeping the data’s time and seasonal patterns intact and spotting real anomalies.

More on spotting outliers in time-series data in the first and second article of this series.

Basic Strategies for Handling Outliers

When dealing with outliers, researchers typically face two basic strategies: retaining or removing them.

Retaining Outliers

Retaining outliers is often preferred when these data points represent genuine, although unusual, observations, as researchers say that outliers can provide valuable insights into the phenomenon being studied.

When outliers are kept rather than removed, you can apply robust statistical techniques to minimize their impact on the analysis.

However, you need to be cautious since these retained outliers can still affect statistical measures and model estimates. Therefore, it is important to report the presence and treatment of retained outliers in your research findings.

If you transform the outliers, for e.g. through a log transformation, are you still retaining them?

From my research, I concluded that there is no straightforward answer to this.

If you’re keeping the data points in your dataset but applying transformations to reduce their impact, this is generally considered as a form of retention, especially if the transformations are reversible or preserve the relative position of the data points.

However, the specific method used should be clearly reported, as some transformations (like winsorization, more on it below) fall into a grey area between retention and removal.

Removing Outliers

Removing outliers may be appropriate in certain situations.

The current literature in the field suggests that removal is justified in these scenarios:

  • Measurement Errors: Outliers caused by malfunctioning measurement instruments or inaccurate readings.
  • Sampling Errors: Outliers resulting from non-representative sampling processes, leading to extreme values that don’t reflect the true population distribution.
  • Experimental Errors: Outliers in scientific experiments due to procedural mistakes, contamination, or unexpected conditions that make the data point irrelevant.
  • Human Errors: Outliers due to mistakes in data entry, where values are recorded far outside the expected range.
  • Data Processing Errors: Outliers introduced during data preprocessing or transformation steps that are clearly incorrect due to computation errors or algorithmic issues.

Common removal techniques include trimming (removing extreme values) and capping/winsorization (replacing extreme values with less extreme ones). These methods are detailed in the next section.

Just deleting outliers without careful consideration can lead to losing valuable insights and possibly skewing results.

When outliers are removed, it’s crucial to document which data points were excluded and why they were taken out.

Capping (Winsorization) for Outlier Treatment

Capping, also known as winsorization, is a technique for handling outliers by limiting extreme values in a dataset.

This method aims to reduce the impact of outliers while retaining their presence in the data. This means we reduce its value to a certain limit.

For this method to work, we need to define upper and lower bounds.

Setting upper and lower bounds

These bounds define the range within which data values are considered acceptable.

There are several approaches to setting these bounds, including methods based on standard deviations or interquartile ranges. The choice of bounds will influence the data distribution.

Experts in the field caution that setting the bounds arbitrary may distort the data’s structure and relationships. Besides, fixed bounds may not be appropriate for all datasets, particularly those with naturally skewed distributions.

Percentile-based capping

Percentile-based capping offers a more data-driven approach to winsorization.

# Create a DataFrame with the time-series data
df = pd.DataFrame({'time': time, 'value': data})

# Apply percentile-based capping (Winsorization)
lower_percentile = 5
upper_percentile = 95
lower_bound = np.percentile(df['value'], lower_percentile)
upper_bound = np.percentile(df['value'], upper_percentile)

df['value_capped'] = np.clip(df['value'], lower_bound, upper_bound)

# Plot the original and capped time-series data
plt.figure(figsize=(14, 7))

plt.plot(df['time'], df['value'], label='Original Data with Outliers', color='blue', linestyle='--')
plt.plot(df['time'], df['value_capped'], label='Capped Data', color='green')

plt.title('Percentile-based Capping (Winsorization) on Time-Series Data')
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.show()
Figure 4: Capping (Winsorization) on Time-Series Data (5th/95th percentiles) | Image by Author.
Figure 4: Capping (Winsorization) on Time-Series Data (5th/95th percentiles) | Image by Author.

This method uses specific percentiles of the data distribution as capping points. Common choices include the 5th/95th or 1st/99th percentiles, depending on the desired level of conservatism.

Researchers say that percentile-based approaches are often more robust than fixed-value capping (explained below), as they adapt to the data’s natural distribution.

Alternative Capping Methods

As datasets grow more complex, alternative capping methods have emerged to address several outlier scenarios.

These methods offer different approaches to setting bounds, each with its own strengths and limitations.

Fixed-value Capping

Fixed-value capping involves setting specific numerical thresholds for upper and lower bounds.

It’s super straightforward and easy to understand, especially when you know the limits thanks to domain knowledge.

But keep in mind, this method isn’t very flexible and it ** can be a drawback in time-series datasets with evolving trend**s.


# Create a DataFrame with the time-series data
df = pd.DataFrame({'time': time, 'value': data_with_outliers})

# Define fixed-value capping thresholds
lower_bound = -15
upper_bound = 25

# Apply fixed-value capping
df['value_capped'] = df['value'].clip(lower=lower_bound, upper=upper_bound)

# Plot the original and capped time-series data
plt.figure(figsize=(14, 7))

plt.plot(df['time'], df['value'], label='Original Data with Outliers', color='blue', linestyle='--')
plt.plot(df['time'], df['value_capped'], label='Capped Data', color='red')

plt.title('Fixed-Value Capping on Time-Series Data')
plt.xlabel('Time')
plt.ylabel('')
plt.legend()
plt.show()
Figure 5: Fixed-Value Capping on Time-Series Data | Image by Author.
Figure 5: Fixed-Value Capping on Time-Series Data | Image by Author.

You can select the lower bound and upper bound in fixed-value capping based on domain knowledge, statistical measures such as quartiles and percentiles, and the typical range of the data.

Dynamic Capping

This method is particularly useful for time-series data where patterns change over time as it allows for the adjustment of bounds to accommodate evolving trends!

Researchers emphasize that dynamic capping requires regular updates and careful monitoring to maintain its effectiveness.

# Create a DataFrame with the time-series data
df = pd.DataFrame({'time': time, 'value': data})

# Define dynamic capping thresholds based on rolling statistics
window_size = 15  # Rolling window size for calculating dynamic bounds
df['rolling_mean'] = df['value'].rolling(window=window_size, min_periods=1).mean()
df['rolling_std'] = df['value'].rolling(window=window_size, min_periods=1).std()

# Define dynamic upper and lower bounds
df['lower_bound'] = df['rolling_mean'] - 2 * df['rolling_std']
df['upper_bound'] = df['rolling_mean'] + 2 * df['rolling_std']

# Apply dynamic capping
df['value_capped'] = np.clip(df['value'], df['lower_bound'], df['upper_bound'])

# Plot the original and capped time-series data with dynamic bounds
plt.figure(figsize=(14, 7))

plt.plot(df['time'], df['value'], label='Original Data with Outliers', color='blue', linestyle='--')
plt.plot(df['time'], df['value_capped'], label='Capped Data with Dynamic Bounds', color='green')
plt.fill_between(df['time'], df['lower_bound'], df['upper_bound'], color='gray', alpha=0.2, label='Dynamic Bounds (±2σ)')

plt.title('Dynamic Capping on Time-Series Data with Multiple Outliers')
plt.xlabel('Time')
plt.legend()
plt.show()
Figure 6: Dynamic Capping on Time-Series Data | Image by Author.
Figure 6: Dynamic Capping on Time-Series Data | Image by Author.

Dynamic capping adjusts upper and lower bounds based on rolling statistics, such as moving averages and standard deviations, allowing it to adapt to changing data patterns over time while mitigating the impact of outliers.

How to define the dynamic capping thresholds?

  • window_size: Should be large enough to capture the underlying patterns or cycles in your data but small enough to detect changes quickly. Common choices range from a few periods to a year or more, depending on the frequency of your data.
  • min_periods: Determines the minimum number of data points required to compute a valid rolling statistic and it ensures initial periods are represented in the analysis. Typically, it’s set to a value that provides meaningful insights while avoiding excessive sensitivity to early data.

These parameters are often selected through experimentation and consideration of the data’s inherent patterns and noise levels.

Keeping the bounds updated regularly makes sure they stay relevant and effective for maintaining data integrity.

Adaptive Capping

Adapting to the changing characteristics of time-series data is crucial.

This method ensures that the capping remains effective as new data points are added and patterns evolve.

Adaptive capping employs machine learning algorithms to determine optimal capping points as these methods can adapt to complex, multidimensional data structures.

Specific approaches include the Isolation Forest algorithm and Local Outlier Factor, which can identify outliers in high-dimensional spaces.


# Create a DataFrame with the time-series data
df = pd.DataFrame({'time': time, 'value': data_with_anomalies})

# Apply Isolation Forest for outlier detection
clf = IsolationForest(contamination=0.1, random_state=0)  # Contamination parameter can be adjusted
df['is_outlier'] = clf.fit_predict(df[['value']])

# Define dynamic capping bounds based on Isolation Forest predictions
window_size = 10
df['rolling_mean'] = df['value'].rolling(window=window_size, min_periods=1).mean()
df['rolling_std'] = df['value'].rolling(window=window_size, min_periods=1).std()

# Define dynamic upper and lower bounds
df['lower_bound'] = df['rolling_mean'] - 2 * df['rolling_std']
df['upper_bound'] = df['rolling_mean'] + 2 * df['rolling_std']

# Apply dynamic capping using np.clip()
df['value_capped'] = np.clip(df['value'], df['lower_bound'], df['upper_bound'])

# Plot the original and capped time-series data with outliers identified by Isolation Forest
plt.figure(figsize=(14, 7))

plt.plot(df['time'], df['value'], label='Original Data with Anomalies', color='blue', linestyle='-')
plt.scatter(df['time'][df['is_outlier'] == -1], df['value'][df['is_outlier'] == -1], color='red', label='Detected Outliers')
plt.plot(df['time'], df['value_capped'], label='Capped Data (Adaptive Isolation Forest)', color='green')

plt.title('Adaptive Capping using Isolation Forest on Time-Series Data with Anomalies')
plt.xlabel('Time')
plt.ylabel('')
plt.legend()
plt.show()
Figure 7: Adaptive Capping on Time-Series Data | Image by Author.
Figure 7: Adaptive Capping on Time-Series Data | Image by Author.

In the code snippet above:

  • Isolation Forest first identifies outliers by assigning anomaly scores to each data point. Points with scores below a certain threshold (often set to detect a specific contamination level, e.g., 10%).
  • Then, we define window_size as 10, which determines the size of the rolling window used to compute dynamic statistics (mean and standard deviation) for the time-series data.
  • Using these rolling statistics, we define dynamic upper and lower bounds (lower_bound and upper_bound). These bounds are set based on the rolling mean and standard deviation.

The plot in Figure 7 visualizes both the original data with outliers and the capped data, showing the how Isolation Forest can handle outliers in time-series analysis.

These methods can work really well, but they can make your data preprocessing a bit more complex.

Observation on how to tweak those parameters:

If the time-series data shows frequent short-term anomalies, consider reducing window_size to increase sensitivity to these anomalies. On the contrary, if the data has stable long-term trends with occasional large spikes, a larger window_size may help smooth out these spikes while still capturing significant changes.

Regarding bounds adjustment, experiment with different factors (e.g., 1.5, 2, 2.5) to multiply the rolling standard deviation when defining lower_bound and upper_bound. This lets you adjust the bounds to better handle the ups and downs you see in the data.

Considerations and Best Practices to Handle Outliers

Image by Fiona Murray-deGraaff on Unsplash.
Image by Fiona Murray-deGraaff on Unsplash.

Regardless of the chosen method, it’s essential to:

  1. Validate the impact of capping on the dataset
  • Compare descriptive statistics before and after capping to understand how measures of central tendency and dispersion have changed (more on evaluation metrics in the upcoming article).
  • Visualize the data distribution pre- and post-capping using histograms or box plots to identify any significant alterations in the data’s shape.
  • Conduct sensitivity analyses by varying the capping thresholds and observing how this affects your main analytical results.
  • Consider the effect on correlations between variables, since capping can alter relationships in multivariate datasets.

2. Ensure the chosen method aligns with the data’s natural distribution and research objectives.

Evaluate whether your research questions are sensitive to extreme values. In some cases, these extremes might be the focus of the study and should not be capped!

Remember that for time-series data, capping might obscure important temporal patterns or events.

3. Document the capping process transparently, including the rationale behind the selected method and bounds.

  • Make sure to clearly mention the capping method you used, including the specific thresholds or percentiles.
  • Explain why you picked that method, connecting it to your data characteristics and what you’re aiming to achieve with your research.
  • Also, be sure to report how many data points were affected by the capping, both in numbers and percentages.

In this article, we have explored the importance of proper outlier treatment, discussed special considerations for time-series data, and covered basic strategies such as retaining and removing outliers.

Additionally, we explored capping methods including setting upper and lower bounds, percentile-based capping, and alternative approaches. Lastly, we covered some best practices for maintaining data integrity and optimizing analytical outcomes.

While retaining or removing outliers are fundamental strategies, there is a middle ground that can often provide the best of both worlds: transforming outliers.

This approach involves applying mathematical transformations to the data to reduce the impact of outliers without completely discarding them.

Stay tuned for the next article coming soon, where I will explore several transformation techniques to handle outliers, as well as detail a crucial step in outlier treatment: evaluating the impact of outlier treatment, including conducting sensitivity analysis.


If you found value in this post, I’d appreciate your support with a clap. You’re also welcome to follow me!

If you want to support my work, you can also buy me my favorite coffee: a cappuccino. 😊


Don’t forget to check the first and second article of this series, where I deep dive on statistical and machine learning methods for outlier detection in time-series data:

The Ultimate Guide to Finding Outliers in Your Time-Series Data (Part 1)

The Ultimate Guide to Finding Outliers in Your Time-Series Data (Part 2)

Resources:

Sara’s Data Science Free Resources


References

  • Aggarwal, C. C. (2017). Outlier analysis (2nd ed.). Springer International Publishing. https://doi.org/10.1007/978-3-319-47578-3
  • Aguinis, H., Gottfredson, R. K., & Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods, 16(2), 270–301.
  • Barnett, V., & Lewis, T. (1994). Outliers in statistical data (3rd ed.). Wiley.
  • Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), 1–58.
  • Chen, C., & Liu, L. M. (1993). Joint Estimation of Model Parameters and Outlier Effects in Time Series. Journal of the American Statistical Association, 88(421), 284–297.
  • Ghosh, D., & Vogt, A. (2012). Outliers: An evaluation of methodologies. Joint Statistical Meetings, 3455–3460.
  • Osborne, J. W., & Overbay, A. (2004). The power of outliers (and why researchers should always check for them). Practical Assessment, Research, and Evaluation, 9(1), 6.

Related Articles