
In time series analysis, the presence of dirty and messy data can alter our reasonings and conclusions. This is true, especially in this domain, because the temporal dependency plays a crucial role when dealing with temporal sequences.
Noise or outliers must be handled with care following ad-hoc solutions. In this situation, the tsmoothie package can help us save a lot of time in preparing time series for our analysis. Tsmoothie is a python library for time series smoothing and outlier detection that can handle multiple series in a vectorized way. It’s useful because it can provide the preprocess steps we needed, like denoising or outlier removal, preserving the temporal pattern present in our raw data.
In this post, we use these trinks to improve a Clustering task. More precisely, we try to identify some changes in financial data carrying out an unsupervised approach. In the end, we will expect to point out clear patterns in the closing prices that can be used to inspect the hidden behavior of the market.
THE DATA
As introduced before, we operate with financial time series. There are a lot of tools or premade datasets that provide and store financial data. For our aims, we use a dataset collected from Kaggle. The Stock data 2000–2018 is a cleaned collection of stock prices from 2000 to 2018 of around 39 different stocks. It reports volumes, open, high, low, and close prices daily. We focus on the close prices.
For a demonstrative purpose, we consider the Amazon stock price but the same findings appear also in other stock signals.

TIME SERIES SMOOTHING
The first step in our workflow consists of time series preprocessing. Our strategy is very intuitive and effective. Given a time series of closing prices, we split it into small sliding pieces. Each piece is then smooth in order to remove outliers. The smoothing process is essential to reduce the noise present in our series and point out the true patterns that may present over time.
Tsmoothie provides different smoothing techniques for our purpose. It also has the built-in utility to operate a sliding smoothing approach. The raw time series is partitioned into equal windowed pieces which are then smoothed independently. We select the Locally Weighted Scatterplot Smooth (LOWESS) as the smoothing procedure.
LOWESS is a powerful non-parametric technique for fitting a smoothed line for given data either through univariate or multivariate smoothing. It implements a regression on a collection of points in a moving range, and weighted according to distance, around abscissa values in order to calculate ordinal values. The selection of the smoothing parameter (alpha) is often entirely based on a "repeated trial" basis. There is no specific technique for the selection of its exact value. The selection of a particular value may lead to "over-smoothing" or "under-smoothing".
Below the result of applying the mentioned procedure with sliding windows of length 20 (days) and alpha equal to 0.6. In other words, we are computing a LOWESS for every generated window.

TIME SERIES CLUSTERING
The second step involves the usage of a clustering algorithm to identify the behaviors in our time series. The creation of equal length windows is aimed to solve this task easily.
Generally speaking, clustering different time series into similar groups is challenging because each data point follows a temporal structure that we must respect in order to obtain satisfactory results. The distance measures used in standard clustering algorithms, such as Euclidean distance, are often not appropriate to time series. A stronger approach is to replace the default distance measure with a metric for comparing time series, such as Dynamic Time Warping.
The search of 4 clusters with K-means and Dynamic Time Warping metric produces the following results:

As we can see, it’s evident the creation of 4 different clusters that represent 4 different market movements: an increasing trend (cluster 0), a decreasing trend (cluster 1), a downward turning point (cluster 2), an upward turning point (cluster 3). We can do the same with our raw time windows without computing the smoothing and make a comparison.

Now the difference between the 4 groups is not so marked. It’s more difficult to provide an interpretation of the generated clusters. The ability to generate meaningfully groups from a clustering algorithm is the more important prerequisite of any unsupervised approach. If we can’t attribute an explanation, the results can’t be utilized to make a decision. In this sense, the adoption of a smoothing preprocess can help the analysis.

SUMMARY
In the financial domain, the concept of volatility is fundamental to take decisions. It measures the uncertainty, i.e. the risk, present in the market. Here we went deeper extending our idea of market regimes in the short term. We identified four clear market conditions, smoothing our time series blocks to better understand the real dynamic of the data. In this post, we took advantage of the time series smoothing in a financial clustering application but this approach is valid and useful in some other contests involving Time Series Analysis.
Keep in touch: Linkedin