HANDS-ON TUTORIALS
This is my third tutorial on Cython. This time we are going to implement the Detrended Fluctuation Analysis (DFA), a widely used technique for time series analysis, ranging for music to finance

Welcome back to my Cython tutorials! Today we will be dealing with a widely use technique, called Detrended Fluctuation Analysis (DFA). DFA has been intensively applied in music¹ ² and finance³ ⁴, being able to capture correlations trends in time series and non-stationary signals – namely signals changing in time. The idea of this analysis methodology is simple and straight and well founded mathematically since the ’50’s. As we’ll see the implementation is straightforward, with some caveats and C tricks. Once you understood the basic implementation you will be able to expand this analysis further up, creating a multi-detrended analysis as well as an extension to chaotic systems fluctuation analysis. At the end of this tutorial you will master DFA techniques and you will get an understanding on about how to handle multiple C files in Cyhton and what to do at compilation time.
All the codes are stored in this repository:
This time there is ano sklearn
counterpart, however many users have implemented their own Python DFA version, such as: https://github.com/dokato/dfa
Layman theory
The roots of DFA can be found in the 50’s, thanks to the seminal work done by Harold Hurst, a British hydrologist, who devoted several efforts in determining the optimal dam size for the Nile river, which supported civilisation since the Egyptian era. Even if this could seem a very simple task, mathematically determing the river Nile flow and a correct dam height has always been highly influenced by unforeseen rainy and draught conditions, making this a challenging problem for centuries. As an excellent engineer Hurst decided to tackle this problem, starting from the analyses of all possible data he could get, from the Niles and all the confluent rivers. Thus, Hurst analysed the cumulative flows of the Nile (690 different time series with 75 features like geophysical phenomena, rainfall, temperature etc…all by hand!) and its confluent streams over time, determining a mathematical parameter, called ‘adjusted range’ R, as the difference between the maximum and the minimal departure from a river at a specific point in time under defined environmental conditions. Dividing (normalising) this parameter with the data standard deviation, Hurst obtained a Rescaled Adjusted Range R/S, which is dimensionless number, that proved to be a proxy for the ideal dam height over a specific time period. Furthermore, Hurst found out that in each case, the statistical parameter R/S was proportional to the period of analysis n (n can be a year, 2 years etc.) with a power-like behaviour of nᵏ with a mean value of k=0.72 +/- 0.01.
All right, now what is peculiar in this k "Hurst" number? Well, if all these historical events had followed a normal distribution, as usually expected by many natural events, the number would have been equal to 0.5. However, this would have been a potential dangerous assumption in dam design, so that the dam would have been projected too low. As Hurst proved, k=0.72 was underlying a long-term behaviour, taking into account rainy and drought conditions, thus something hidden in the statistical distribution of these natural events. Due to the lack of knowledge at that time, Hurst was not able to clearly explain this result, but he was paving the road for the "long term" memory events.
After about 20 years, a great mathematician, Benoit Mandlebrot recognized in Hurst’s results the phenomenon of heavy tails and fractal behaviour. The Hurst number (called Hurst exponent – and denoted from now on with H) described a "self-similar" phenomenon in the analysed data. In this view, self-similar means that if a signal Y(t) is a signal (e.g. Amazon shares price, evolving with the time t) is analysed at some point in time ct (for example today), where c is any number, there is a proportionality like cᴴY(t), where H is the Hurst exponent.
Such a remarkable discovery led Mandelbrot and his collaborator Van Ness to investigate a large variety of phenomena, such as economics, music, hydrology. Believe or not, but all of these phenomena showed a 1/f noise fluctuation behaviour, namely all these series have a defined time behaviour, with long-range correlations, defining the robustness of a series during time according to these values:
- H > 0.5: a long-time correlation in the signal is present
- H ~ 0.5: the signal is totally random
- H < 0.5: there is an anti-persistent/mean reverting behaviour in the signal
After decades of research, more and more signals were analysed and more and more techniques were found to find long-term correlations. The detrended fluctuation analysis (DFA) is one of these techniques, which determines the statistical self-affinity of a signal, so if a signal possesses a long-memory process/correlation/1/f fluctuation. In practice, a Hurst exponent is obtained, which proved to be valid also for non-stationary signals.
Mathematically DFA can be decomposed and defined in four steps:
- Compute the averaged cumulative sum Xt of a signal x(t), composed of N samples
- Subdivide the averaged cumulative sum Xt in n samples and for each sub-sample compute a least square linear fitting Yt. Repeat this process at different scales, namely from dividing the signal in n tiny samples to n large samples
- From this compute the detrended fluctuation as the root-mean-square deviation between the averaged cumulative sum Xt and the fitting Yt, at each scale:

- Express F(n) as a log function with the number of n samples and find the exponent behaviour (DFA-Hurst exponent)
Now we’re ready to deal with the DFA implementation!
DFA in Cython: let’s have fun
Plan of action

Fig.1 defines how DFA should be computed:
-
Compute the averaged cumulative sum Xt of a signal X(t), which can be thought as a simple process as:
-
Subdivide Xt in n subsamples at different scales. To determine the scales and into how many subsamples divide Xt we can start from an array
scales
with all the scales we want to use. Then for each scale compute how many subsamples we can obtain and populate a temporary arraytmpX
with all the subsamples from Xt - For each
tmpX
withcurr_scale
samples we perform a linear least-square fitting and compute the goodness of fit for each chunk. The goodness of fit is helpful to have already the root-mean-square (RMS) error in place, so accumulate all these errors - Once each scale is subsample fit is completed, compute the final RMS from the partial rms stored at previous step
- Transform the final RMS to log2 and compute a linear fitting. The slope of the fitting coincides with Hurst exponent
Now that we have a complete idea on how to compute DFA we can proceed to subdivide the implementation between Python-Cython and C:
- The entrypoint is, as always, a Python code, which codifies and send the input signal to Cython.
- Cython will have to prepare the ground for C. First converting the input signal to a readable C format like memory view and initializing all the necessary array to retrieve info from C, like RMS (or flucutations) and the fititng coefficients to retrieve Hurst exponent. Finally Cython calls the C function to perform DFA calculation, binding memoryviews with C pointers like
&cython_memoryview[0]
- For the first time we will have to deal with multiple C files: one for computing DFA and a second one for the linear fitting. Furthermore, we will need to write a function to compute the averaged cumulative sum, a function to perform a log2 transform and a floor division as well as a main function with DFA computation. All these info will then be sent back to Cython.

Python script
Code: https://github.com/Steboss/music_retrieval/blob/master/dfa/installer/tester.py
The Python code is the entrypoint for our application as usual. Here we receive the input data, convert them to a suitable format. As an example in this code we’re processing a random signal -the scipy.signal.hilbert
transform was just to avoid negative magnitude values, but it is just for training purposes.
Once the input has being processed, dfa
can be called with the following line: scales, fluct_to_array, coeff_to_array = dfa.play(X, 5, 9, 0.25)
:
- To activate a DFA analysis in Cython we are going to use
play
attribute play
receives as input:X
the signal,5, 9, 0.25
the smallest samples subdivision (2⁵) the highest sample (2⁹) and the interval spacing 0.25- the
dfa.play
function returns thescales
the input signalX
has been divided into, the RMS asfluct_to_array
for all the samples andcoeff_to_array
which contains the Hurst exponent
As a final step we want to visualize the DFA function along with the log2 transform and fitting:
Cython Script
Code: https://github.com/Steboss/music_retrieval/blob/master/dfa/installer/dfa.pyx
The first thing that pops up immediately in the pyx
code is the declaration of 2 C codes:
The first code is dfa.c
which is where the main DFA algorithm runs, the second code is polyfit.c
, which is used by dfa.c
to run a polynomial fitting of any function. As you can see polyfit.c
is an accessory element to the DFA routine, however we must declare it here, or in the setup.py
file, in order for Cython to read all the dependencies for C. I decided not to declare polyfit.c
in setup.py
in order to keep the setup script clear and more digestible, so that we are just to be worried about dfa.c
in the declarations (see last bit TODO). What if we do not declare polyfit.c
anywhere? Well, you’ll be able compile the Cython code, however, at run time, you will get a flat namespace
error:
- python tester.py
Traceback (most recent call last):
File "tester.py", line 2, in <module>
import dfa
ImportError: dlopen(.../dfa.cpython-38-darwin.so, 2): Symbol not found: _polyfit
Referenced from: .../dfa/installer/dfa.cpython-38-darwin.so
Expected in: flat namespace
in .../dfa/installer/dfa.cpython-38-darwin.so
This errors tell us that Cython cannot load the dynamic library file _polyfit
through dlopen
– notice that the underscore in _polyfit
denotes that the polyfit
function is protected and cannot be accessed by anyone. This is something common in Cython, so make sure to declare all your codes.
Secondly, it is worth stressing another source of error which is the use of cpdef.
cpdef
defines a function which can be shared between C and Python, differently from def
which can be read only by Python and not C and cdef
which can only be read by C and not Python. Thus, writing cdef
rather than cpdef
would make the attribute play
hidden to Python and dfa.play()
would raise an error like dfa has not attribute play
.
A third important thing in this code is the correct parsing of scales types from Numpy to Cython to C. From NumPy documentation it is possible to retrieve the following table:
NumPy dtype Numpy Cython type C Cython type
np.bool_ None None
np.int_ cnp.int_t long
np.intc None int
np.intp cnp.intp_t ssize_t
np.int8 cnp.int8_t signed char
np.int16 cnp.int16_t signed short
np.int32 cnp.int32_t signed int
np.int64 cnp.int64_t signed long long
np.uint8 cnp.uint8_t unsigned char
np.uint16 cnp.uint16_t unsigned short
np.uint32 cnp.uint32_t unsigned int
np.uint64 cnp.uint64_t unsigned long
np.float_ cnp.float64_t double
np.float32 cnp.float32_t float
np.float64 cnp.float64_t double
np.complex_ cnp.complex128_t double complex
np.complex64 cnp.complex64_t float complex
np.complex128 cnp.complex128_t double complex
On line 55 scales are created in Numpy as: scales=(2**np.arange(scale_low, scale_high, scale_dense)).astype(np.intc)
. The final casting, astype(np.intc)
guarantees the correct conversion from NumPy type to C type int
. This is often source of errors, as the wrong format is passed from Numpy/Cython to C which leads to Segmentation Fault errors
Eventually, the Cython code deals also with the correct conversion of input data to memory views, as well as the initialization of the RMS and coefficients array:
The main C function dfa
is then called and memory views are passed as &memoryview[0]
in order to retrieve all the final values from the C code. As explained here, since the size of the current memoryviews is known, the underlying data are properly contiguous, thus it is sufficient to pass a pointer to the first memoryview element ( &a[0]
) to make them work in C.
C code
Code: https://github.com/Steboss/music_retrieval/blob/master/dfa/c_code/dfa.c
The C code implements the main DFA calculation. At first the code may look scary, but let’s deal with it step by step.
Fig. 8 is the very first steps: import basic libraries, define constants and essential functions. The most important function is cumsum_averaged
, which computes the averaged cumulative sum of an input signal float *X
given its length int n_elems
and the signal average float avg
. The final result is associated with float *Xcumsum
pointer, so the exact values are reported in the main dfa
function
After these essential imports, the main dfa
function can be defined, thinking of 3 principal chunks. In the first bit the fitting_order
, namely the type of fitting we want to perform, the average cumulative sum array Xcumsum
, the RMS array rms
and the fitted line coefficients vector tmpCoeff
are defined:
The second step (fig.10) is a for-cycle through all the scales scales
– produced in Cython – defining:
- the final dimension
shape1
of the signal subdivided by the current i-th scalecurr_scale
(e.g. the signal has 1000 elements, the first scale is 2⁵ = 32, soshape1 = floor_div(1000, 32) = 31
thus we will cycle throughXcumsum
31 times to subdivide it into 32 pieces ) - an empty vector
tmpX
, whose dimensions arecurr_scale
, in order to store all the chunks ofXcumsum
signal (e.g. in the 31 cycles throughXcumsum
the vectortmpX
will store 32 samples per cycle) - an empty vector
tmpDiff
, whose size iscurr_scale
, to compute the element-element difference betweentmpX
signal chunk and its linear fitting (e.g. oncetmpX
has the 32 samples in and the fitting against these samples is done,tmpDiff
will store element-element difference, so element 0 with fitting element 0, element 1 with fitting element 1 and so on) -
an empty vector
tmpShape1
, whose size isshape1
to store the current RMS fromtmpDiff
vector, computed as:tmpShape1[j] = sort(vector_avg(tmpDiff, curr_scale))
The final step is the heart of DFA, where all the quantities are computed:
for j=0 ... j=shape1-1
subdividedXcumsum
signal and store all the samples intmpX
(e.g. supposingshape1=31
, in the first iteration the signalXcumsum
will be read from 0 to 31 and these samples will be saved totmpX
)- perform a linear fitting of
tmpX
withpolyfit
.polyfit
takes as input the x-axis (scale_ax
which simply goes from 0 toshape1
), the y-axis functiontmpX
with the firstcurr_scales
samples, the length oftmpX
, thefitting_order
and a pointer to store the fitted coefficientstmpCoeff
-
From the fitted coefficients retrieve the fitting line:
-
Determine the goodness of fit computing the RMS between
tmpX
elements andfitting_results
ones. Finally, store these results intmpShape1
-
At the end of this second cycle, remember to free all the necessary vectors and accumulate the total
rms
fromtmpShape1
:Congrats! You managed to deal with the most complicated steps in the DFA implementation. The very last lines of the DFA C code are pretty straightforward and computes the
log2
transform of the RMS, in order to return the Hurst exponent:As before,
polyfit
receives as input the x-axis, in this case values from 1 toscales_len
log2-transformed, the y-axis function, in this case the log2-RMS, the length of x and y axis, thefitting_order
andcoeffs
pointer to return pointer to the main function. Eventually,coeffs[1]
contains Hurst exponent and the RMS log2 plot can be easily visualize in python:Prepare setup.py and install everything
Define Mac caveat so -arch and np.get_include()
As done in the previous tutorial, codes are divided in two folders: c_code
, where dfa.c
and polyfit.c
are stored and installer
folder, where pyx
and Python files are contained. setup.py
is the installation file, which is saved in installer.
The installer file follow the code here: https://github.com/Steboss/music_retrieval/blob/master/dfa/installer/setup.py. I ensured this setup can be run on Mac, adding the right architecture for the c compiler os.environ['ARCHFLAGS']='-arch x86_64
including the right NumPy headers (e.g. arrayobject.h) with include-dir=[numpy.get_include()]
under Extension
Once the code is ready we can run setup.py
as:
python setup.py build_ext --inplace
build_ext
is the command to build the current extension and --inplace
allows to install the current package in the working directory. Copy the output file in dfa
folder and you can start import your DFA Python package
DFA in practice
To conclude a final example on the usage of DFA for financial data. I will show a further example applied to music in one of my next post – so stay tuned!
The Etsy, Shopify, UPS case
One of the holy grail of statistics/Machine Learning/deep learning/super-intense learning is to predict the stock markets. Unfortunately, similar algorithms are not so predictive and many of them are suffering from leaking data, making predictions perfect – as often happens here on Medium. What I decided to do is to apply DFA, to check whether there is some long-term correlation, or a trend, in all the financial tickers listed by Yahoo (about 856 tickers). Not surprisingly the majority of stock series appeared to have a Brownian behaviour, namely no correlations in the signal. A surprise case emerged from Etsy, Shopify and UPS, with an almost perfect 1/f noise, namely with Hurst exponent (here denoted with α ) close to 1, which identifies a self-similarity behaviour in the time series:



I hope you like this new tutorial on Cython 🙂 feel free to send me an email for questions or comments at: [email protected]
Bibliography
- Detrended Fluctiation Analysis of Music Signals: Danceability Estimation and Further Semantic Characterization, S. Streich, P. Herrera, 2005
- Revealing Competitive Behaviours in Music by Means of the Multifractal Detrended Fluctuation Analysis: Application to Bach’s Sinfonias, L. Telesca, M. Lovallo, 2011
- Detrended Fluctuation Analysis of Multivariate Time Series, H. Xions, P. Shang, 2016
- Multifractal Detrended Fluctuation Analysis: Practical Applications to Financial Time Series, J. R. Thompson, J. R. Wilsonb, 2014