The world’s leading publication for data science, AI, and ML professionals.

What If Standard Is Not the Best

The easy solution is not always the best

Photo by Sharon McCutcheon on Unsplash
Photo by Sharon McCutcheon on Unsplash

During researches, at some point, I needed to evaluate a particular measure (diversity index) on big rectangular matrices. It isn’t relevant for this story, but I am talking about single-cell RNA-Sequencing data.

In practise this is useful to estimate the number of diverse components found in a population (e.g. the number of different words in a book) and may explain [3] the emergence of particular power-law distributions in many complex systems [1,2,4].

Pandas

The problem of measuring this metric given data stored in a file can be solved in a single line of Python using the well-known pandas library.

pd.read_csv().apply(lambda x: len(np.unique(x)))

pandas

This is good for many application and it is quite straight forward not to require any effort by the data scientist to implement it, but..

I needed to measure this quantity on big tables (single-cell data has hundred thousands samples) and this simple task was slow with pandas. The pandas approach required much time to load data into the memory and then evaluate the function on every row.

A faster approach would be to process the rows as soon as they are loaded into the memory and then discard them (to save memory as a bonus), so I implemented it from scartch.

A C++ Solution

My solution to this particular problem used the Standard Template Library and Boost.

The idea

The idea is, as mentioned above, to parse the lines as soon as they are available. I defined a thread which read the file and stored it in a buffer and other threads which parsed the read lines available. Since my metric is estimated on columns, I needed multiple threads parsing different lines at the same time not to overlap, this was solved with an array of mutex (one per column).

Binding

Using Boost.Python I was able to load my newly created module into a python module, so it would be as easy as Pandas to use it.

Boost C++ Libraries

Here is an example of binding a function

#include <boost/python.hpp> 
BOOST_PYTHON_FUNCTION_OVERLOADS(loadFile_overloads, loadFile<long>, 1, 3) 
BOOST_PYTHON_MODULE(pydiversity){    
    namespace bp = boost::python;
    using namespace bp;
     def("diversity", loadFile<long>, loadFile_overloads(bp::args("file", "nThreads", "verbose", "out_filename")));
}

it can be easily imported in python using

import pydiversity
pydiversity.diversity("mainTable.csv")

Benchmark

I measured the time to parse a table with ten thousands of columns and 20000 rows (you can guess why) and it seems that my solution is 60% faster the the simple pandas apply. Moreover, my solution can exploit the full power of the machine which is running it and on a multi-core machine the speedup is of the 90% with respect to the Python implementation.

Times

  • Python using pandas: (8.05 ± 0,11)s
  • Python binding single Thread: (3.3 ± 0,4)s
  • Python binding with multi-thread: (0,71 ± 0,14)s
  • pure C++ single Thread: (3.2 ± 0,3)s
  • pure C++ with multi-thread: (0,66 ± 0,11)s

Speed up

Since my solution can be multithread I measured the gain when running it with more and more threads. I measured the speedup (1-time / (slowest time)) on a 6-core machine and there is an apprectiable gain.

Speedup on a 6-core machine. Image by author.
Speedup on a 6-core machine. Image by author.

Conclusions

Apart from this particular problem, I hope to have inspired someone to always investigate further and implement more efficient solutions when possible. Many times I heard "because this is the way Python does this", but as data scientist I think we should improve ourselves and always implement new solutions and we should not accept the most easy solution every time.

Code

GitHub – fvalle1/fastDiverse: A fast implementation of diversity estimation on big tables

References

  1. Lazzardi S, Valle F, Mazzolini A, et al Emergent Statistical Laws in Single-Cell Transcriptomic Data. (2021) bioRxiv 2021.06.16.448706.
  2. Mazzolini A, Gherardi M, Caselle M, et al Statistics of Shared Components in Complex Component Systems. (2018) Physical Review X
  3. Mazzarisi O, de Azevedo-Lopes A, Arenzon JJ, Corberi F Maximal Diversity and Zipf’s Law. (2021) arXiv:210309143
  4. Altmann EG, Gerlach M Statistical Laws in Linguistics. (2016) Springer International Publishing.

Related Articles