The world’s leading publication for data science, AI, and ML professionals.

Practical Cython – Music Retrieval: Detrended Fluctuation Analysis (DFA)

This is my third tutorial on Cython. This time we are going to implement the Detrended Fluctuation Analysis (DFA), a widely used technique…

HANDS-ON TUTORIALS

This is my third tutorial on Cython. This time we are going to implement the Detrended Fluctuation Analysis (DFA), a widely used technique for time series analysis, ranging for music to finance

Can we unveil the hidden nature of long time events? Image by Isaac Smith on Unsplash
Can we unveil the hidden nature of long time events? Image by Isaac Smith on Unsplash

Join Medium with my referral link – Stefano Bosisio

Welcome back to my Cython tutorials! Today we will be dealing with a widely use technique, called Detrended Fluctuation Analysis (DFA). DFA has been intensively applied in music¹ ² and finance³, being able to capture correlations trends in time series and non-stationary signals – namely signals changing in time. The idea of this analysis methodology is simple and straight and well founded mathematically since the ’50’s. As we’ll see the implementation is straightforward, with some caveats and C tricks. Once you understood the basic implementation you will be able to expand this analysis further up, creating a multi-detrended analysis as well as an extension to chaotic systems fluctuation analysis. At the end of this tutorial you will master DFA techniques and you will get an understanding on about how to handle multiple C files in Cyhton and what to do at compilation time.

All the codes are stored in this repository:

Steboss/music_retrieval

This time there is ano sklearn counterpart, however many users have implemented their own Python DFA version, such as: https://github.com/dokato/dfa

Layman theory

The roots of DFA can be found in the 50’s, thanks to the seminal work done by Harold Hurst, a British hydrologist, who devoted several efforts in determining the optimal dam size for the Nile river, which supported civilisation since the Egyptian era. Even if this could seem a very simple task, mathematically determing the river Nile flow and a correct dam height has always been highly influenced by unforeseen rainy and draught conditions, making this a challenging problem for centuries. As an excellent engineer Hurst decided to tackle this problem, starting from the analyses of all possible data he could get, from the Niles and all the confluent rivers. Thus, Hurst analysed the cumulative flows of the Nile (690 different time series with 75 features like geophysical phenomena, rainfall, temperature etc…all by hand!) and its confluent streams over time, determining a mathematical parameter, called ‘adjusted range’ R, as the difference between the maximum and the minimal departure from a river at a specific point in time under defined environmental conditions. Dividing (normalising) this parameter with the data standard deviation, Hurst obtained a Rescaled Adjusted Range R/S, which is dimensionless number, that proved to be a proxy for the ideal dam height over a specific time period. Furthermore, Hurst found out that in each case, the statistical parameter R/S was proportional to the period of analysis n (n can be a year, 2 years etc.) with a power-like behaviour of nᵏ with a mean value of k=0.72 +/- 0.01.

All right, now what is peculiar in this k "Hurst" number? Well, if all these historical events had followed a normal distribution, as usually expected by many natural events, the number would have been equal to 0.5. However, this would have been a potential dangerous assumption in dam design, so that the dam would have been projected too low. As Hurst proved, k=0.72 was underlying a long-term behaviour, taking into account rainy and drought conditions, thus something hidden in the statistical distribution of these natural events. Due to the lack of knowledge at that time, Hurst was not able to clearly explain this result, but he was paving the road for the "long term" memory events.

After about 20 years, a great mathematician, Benoit Mandlebrot recognized in Hurst’s results the phenomenon of heavy tails and fractal behaviour. The Hurst number (called Hurst exponent – and denoted from now on with H) described a "self-similar" phenomenon in the analysed data. In this view, self-similar means that if a signal Y(t) is a signal (e.g. Amazon shares price, evolving with the time t) is analysed at some point in time ct (for example today), where c is any number, there is a proportionality like cᴴY(t), where H is the Hurst exponent.

Such a remarkable discovery led Mandelbrot and his collaborator Van Ness to investigate a large variety of phenomena, such as economics, music, hydrology. Believe or not, but all of these phenomena showed a 1/f noise fluctuation behaviour, namely all these series have a defined time behaviour, with long-range correlations, defining the robustness of a series during time according to these values:

  • H > 0.5: a long-time correlation in the signal is present
  • H ~ 0.5: the signal is totally random
  • H < 0.5: there is an anti-persistent/mean reverting behaviour in the signal

After decades of research, more and more signals were analysed and more and more techniques were found to find long-term correlations. The detrended fluctuation analysis (DFA) is one of these techniques, which determines the statistical self-affinity of a signal, so if a signal possesses a long-memory process/correlation/1/f fluctuation. In practice, a Hurst exponent is obtained, which proved to be valid also for non-stationary signals.

Mathematically DFA can be decomposed and defined in four steps:

  • Compute the averaged cumulative sum Xt of a signal x(t), composed of N samples
  • Subdivide the averaged cumulative sum Xt in n samples and for each sub-sample compute a least square linear fitting Yt. Repeat this process at different scales, namely from dividing the signal in n tiny samples to n large samples
  • From this compute the detrended fluctuation as the root-mean-square deviation between the averaged cumulative sum Xt and the fitting Yt, at each scale:
Eq.1 Detrended fluctuation computed as RMS between averaged cumulative sum Xt and fitting Yt
Eq.1 Detrended fluctuation computed as RMS between averaged cumulative sum Xt and fitting Yt
  • Express F(n) as a log function with the number of n samples and find the exponent behaviour (DFA-Hurst exponent)

Now we’re ready to deal with the DFA implementation!


DFA in Cython: let’s have fun

Plan of action

Fig.1: A signal-point of view to understand DFA implementation. At first the averaged cumulative sum Xt of a signal X(t) is computed. Xt is subdivided in N samples at different scales. A linear fitting is performed for each X chunk and the RMS is computed and accumulated. The final RMS is the total amount of RMS for each Xt chunk at different scales. Finally, the RMS is transformed with log2 and a linear fitting is computed to get the Hurst exponent.
Fig.1: A signal-point of view to understand DFA implementation. At first the averaged cumulative sum Xt of a signal X(t) is computed. Xt is subdivided in N samples at different scales. A linear fitting is performed for each X chunk and the RMS is computed and accumulated. The final RMS is the total amount of RMS for each Xt chunk at different scales. Finally, the RMS is transformed with log2 and a linear fitting is computed to get the Hurst exponent.

Fig.1 defines how DFA should be computed:

  • Compute the averaged cumulative sum Xt of a signal X(t), which can be thought as a simple process as:

  • Subdivide Xt in n subsamples at different scales. To determine the scales and into how many subsamples divide Xt we can start from an array scales with all the scales we want to use. Then for each scale compute how many subsamples we can obtain and populate a temporary array tmpXwith all the subsamples from Xt

  • For each tmpX with curr_scale samples we perform a linear least-square fitting and compute the goodness of fit for each chunk. The goodness of fit is helpful to have already the root-mean-square (RMS) error in place, so accumulate all these errors
  • Once each scale is subsample fit is completed, compute the final RMS from the partial rms stored at previous step
  • Transform the final RMS to log2 and compute a linear fitting. The slope of the fitting coincides with Hurst exponent

Now that we have a complete idea on how to compute DFA we can proceed to subdivide the implementation between Python-Cython and C:

  • The entrypoint is, as always, a Python code, which codifies and send the input signal to Cython.
  • Cython will have to prepare the ground for C. First converting the input signal to a readable C format like memory view and initializing all the necessary array to retrieve info from C, like RMS (or flucutations) and the fititng coefficients to retrieve Hurst exponent. Finally Cython calls the C function to perform DFA calculation, binding memoryviews with C pointers like &cython_memoryview[0]
  • For the first time we will have to deal with multiple C files: one for computing DFA and a second one for the linear fitting. Furthermore, we will need to write a function to compute the averaged cumulative sum, a function to perform a log2 transform and a floor division as well as a main function with DFA computation. All these info will then be sent back to Cython.
Fig. 4: Plan of action for implementing DFA in C and interacting with Cython and Python
Fig. 4: Plan of action for implementing DFA in C and interacting with Cython and Python

Python script

Code: https://github.com/Steboss/music_retrieval/blob/master/dfa/installer/tester.py

The Python code is the entrypoint for our application as usual. Here we receive the input data, convert them to a suitable format. As an example in this code we’re processing a random signal -the scipy.signal.hilbert transform was just to avoid negative magnitude values, but it is just for training purposes.

Once the input has being processed, dfa can be called with the following line: scales, fluct_to_array, coeff_to_array = dfa.play(X, 5, 9, 0.25) :

  • To activate a DFA analysis in Cython we are going to use play attribute
  • play receives as input: X the signal, 5, 9, 0.25 the smallest samples subdivision (2⁵) the highest sample (2⁹) and the interval spacing 0.25
  • the dfa.play function returns the scales the input signal X has been divided into, the RMS as fluct_to_array for all the samples and coeff_to_array which contains the Hurst exponent

As a final step we want to visualize the DFA function along with the log2 transform and fitting:

Cython Script

Code: https://github.com/Steboss/music_retrieval/blob/master/dfa/installer/dfa.pyx

The first thing that pops up immediately in the pyx code is the declaration of 2 C codes:

The first code is dfa.c which is where the main DFA algorithm runs, the second code is polyfit.c , which is used by dfa.c to run a polynomial fitting of any function. As you can see polyfit.c is an accessory element to the DFA routine, however we must declare it here, or in the setup.py file, in order for Cython to read all the dependencies for C. I decided not to declare polyfit.c in setup.py in order to keep the setup script clear and more digestible, so that we are just to be worried about dfa.c in the declarations (see last bit TODO). What if we do not declare polyfit.c anywhere? Well, you’ll be able compile the Cython code, however, at run time, you will get a flat namespaceerror:

- python tester.py
Traceback (most recent call last):
File "tester.py", line 2, in <module>
import dfa
ImportError: dlopen(.../dfa.cpython-38-darwin.so, 2): Symbol not found: _polyfit
Referenced from: .../dfa/installer/dfa.cpython-38-darwin.so
Expected in: flat namespace
in .../dfa/installer/dfa.cpython-38-darwin.so

This errors tell us that Cython cannot load the dynamic library file _polyfit through dlopen – notice that the underscore in _polyfit denotes that the polyfit function is protected and cannot be accessed by anyone. This is something common in Cython, so make sure to declare all your codes.

Secondly, it is worth stressing another source of error which is the use of cpdef. cpdefdefines a function which can be shared between C and Python, differently from def which can be read only by Python and not C and cdef which can only be read by C and not Python. Thus, writing cdef rather than cpdef would make the attribute play hidden to Python and dfa.play() would raise an error like dfa has not attribute play .

A third important thing in this code is the correct parsing of scales types from Numpy to Cython to C. From NumPy documentation it is possible to retrieve the following table:

NumPy dtype          Numpy Cython type         C Cython type 

np.bool_             None                      None
np.int_              cnp.int_t                 long
np.intc              None                      int       
np.intp              cnp.intp_t                ssize_t
np.int8              cnp.int8_t                signed char
np.int16             cnp.int16_t               signed short
np.int32             cnp.int32_t               signed int
np.int64             cnp.int64_t               signed long long
np.uint8             cnp.uint8_t               unsigned char
np.uint16            cnp.uint16_t              unsigned short
np.uint32            cnp.uint32_t              unsigned int
np.uint64            cnp.uint64_t              unsigned long
np.float_            cnp.float64_t             double
np.float32           cnp.float32_t             float
np.float64           cnp.float64_t             double
np.complex_          cnp.complex128_t          double complex
np.complex64         cnp.complex64_t           float complex
np.complex128        cnp.complex128_t          double complex

On line 55 scales are created in Numpy as: scales=(2**np.arange(scale_low, scale_high, scale_dense)).astype(np.intc) . The final casting, astype(np.intc) guarantees the correct conversion from NumPy type to C type int . This is often source of errors, as the wrong format is passed from Numpy/Cython to C which leads to Segmentation Fault errors

Eventually, the Cython code deals also with the correct conversion of input data to memory views, as well as the initialization of the RMS and coefficients array:

The main C function dfa is then called and memory views are passed as &memoryview[0] in order to retrieve all the final values from the C code. As explained here, since the size of the current memoryviews is known, the underlying data are properly contiguous, thus it is sufficient to pass a pointer to the first memoryview element ( &a[0] ) to make them work in C.

C code

Code: https://github.com/Steboss/music_retrieval/blob/master/dfa/c_code/dfa.c

The C code implements the main DFA calculation. At first the code may look scary, but let’s deal with it step by step.

Fig. 8 is the very first steps: import basic libraries, define constants and essential functions. The most important function is cumsum_averaged , which computes the averaged cumulative sum of an input signal float *X given its length int n_elems and the signal average float avg . The final result is associated with float *Xcumsum pointer, so the exact values are reported in the main dfa function

After these essential imports, the main dfafunction can be defined, thinking of 3 principal chunks. In the first bit the fitting_order , namely the type of fitting we want to perform, the average cumulative sum array Xcumsum, the RMS array rmsand the fitted line coefficients vector tmpCoeffare defined:

The second step (fig.10) is a for-cycle through all the scales scales – produced in Cython – defining:

  • the final dimension shape1of the signal subdivided by the current i-th scale curr_scale (e.g. the signal has 1000 elements, the first scale is 2⁵ = 32, so shape1 = floor_div(1000, 32) = 31 thus we will cycle through Xcumsum 31 times to subdivide it into 32 pieces )
  • an empty vector tmpX, whose dimensions are curr_scale , in order to store all the chunks of Xcumsum signal (e.g. in the 31 cycles through Xcumsum the vector tmpX will store 32 samples per cycle)
  • an empty vector tmpDiff , whose size is curr_scale , to compute the element-element difference between tmpX signal chunk and its linear fitting (e.g. once tmpX has the 32 samples in and the fitting against these samples is done, tmpDiff will store element-element difference, so element 0 with fitting element 0, element 1 with fitting element 1 and so on)
  • an empty vector tmpShape1 , whose size is shape1 to store the current RMS from tmpDiff vector, computed as: tmpShape1[j] = sort(vector_avg(tmpDiff, curr_scale))

    The final step is the heart of DFA, where all the quantities are computed:

  • for j=0 ... j=shape1-1 subdivided Xcumsum signal and store all the samples in tmpX (e.g. supposing shape1=31 , in the first iteration the signal Xcumsum will be read from 0 to 31 and these samples will be saved to tmpX )
  • perform a linear fitting of tmpX with polyfit . polyfit takes as input the x-axis ( scale_ax which simply goes from 0 to shape1 ), the y-axis function tmpX with the first curr_scales samples, the length of tmpX , the fitting_order and a pointer to store the fitted coefficients tmpCoeff
  • From the fitted coefficients retrieve the fitting line:

  • Determine the goodness of fit computing the RMS between tmpX elements and fitting_results ones. Finally, store these results in tmpShape1

  • At the end of this second cycle, remember to free all the necessary vectors and accumulate the total rms from tmpShape1 :

    Congrats! You managed to deal with the most complicated steps in the DFA implementation. The very last lines of the DFA C code are pretty straightforward and computes the log2 transform of the RMS, in order to return the Hurst exponent:

    As before, polyfit receives as input the x-axis, in this case values from 1 to scales_len log2-transformed, the y-axis function, in this case the log2-RMS, the length of x and y axis, the fitting_order and coeffs pointer to return pointer to the main function. Eventually, coeffs[1] contains Hurst exponent and the RMS log2 plot can be easily visualize in python:

    Prepare setup.py and install everything

Define Mac caveat so -arch and np.get_include()

As done in the previous tutorial, codes are divided in two folders: c_code, where dfa.c and polyfit.care stored and installer folder, where pyx and Python files are contained. setup.py is the installation file, which is saved in installer.The installer file follow the code here: https://github.com/Steboss/music_retrieval/blob/master/dfa/installer/setup.py. I ensured this setup can be run on Mac, adding the right architecture for the c compiler os.environ['ARCHFLAGS']='-arch x86_64 including the right NumPy headers (e.g. arrayobject.h) with include-dir=[numpy.get_include()] under Extension

Once the code is ready we can run setup.py as:

python setup.py build_ext --inplace

build_ext is the command to build the current extension and --inplaceallows to install the current package in the working directory. Copy the output file in dfa folder and you can start import your DFA Python package


DFA in practice

To conclude a final example on the usage of DFA for financial data. I will show a further example applied to music in one of my next post – so stay tuned!

The Etsy, Shopify, UPS case

One of the holy grail of statistics/Machine Learning/deep learning/super-intense learning is to predict the stock markets. Unfortunately, similar algorithms are not so predictive and many of them are suffering from leaking data, making predictions perfect – as often happens here on Medium. What I decided to do is to apply DFA, to check whether there is some long-term correlation, or a trend, in all the financial tickers listed by Yahoo (about 856 tickers). Not surprisingly the majority of stock series appeared to have a Brownian behaviour, namely no correlations in the signal. A surprise case emerged from Etsy, Shopify and UPS, with an almost perfect 1/f noise, namely with Hurst exponent (here denoted with α ) close to 1, which identifies a self-similarity behaviour in the time series:

Fig. 16 ETSY shares with a Hurst coefficient ~ 1.09
Fig. 16 ETSY shares with a Hurst coefficient ~ 1.09
Fig. 17 Shopify shares, Hurst coefficent ~ 1.12
Fig. 17 Shopify shares, Hurst coefficent ~ 1.12
Fig. 18: UPS shares, Hurst exponent ~ 1.17
Fig. 18: UPS shares, Hurst exponent ~ 1.17

I hope you like this new tutorial on Cython 🙂 feel free to send me an email for questions or comments at: [email protected]

Bibliography

  1. Detrended Fluctiation Analysis of Music Signals: Danceability Estimation and Further Semantic Characterization, S. Streich, P. Herrera, 2005
  2. Revealing Competitive Behaviours in Music by Means of the Multifractal Detrended Fluctuation Analysis: Application to Bach’s Sinfonias, L. Telesca, M. Lovallo, 2011
  3. Detrended Fluctuation Analysis of Multivariate Time Series, H. Xions, P. Shang, 2016
  4. Multifractal Detrended Fluctuation Analysis: Practical Applications to Financial Time Series, J. R. Thompson, J. R. Wilsonb, 2014

Related Articles