Advanced functional programming for data science: building code architectures with function operators

Vectorizing the read_csv function in Pandas using a function operator

Paul Hiemstra
Towards Data Science
6 min readMay 7, 2020

--

You can also read the article on Github including full reproducible code

Introduction

Of several programming paradigms, functional programming (FP) fits data science very nicely. The core concept in functional programming is a function, hence the name functional programming. Each function takes data as input and returns a modified version of that data. For example, the mean function takes a series of numbers and returns the mean of those numbers. Core in this context is that the function has no side effects, i.e. the outcome of the function does not change the state outside the function, nor is the outcome influenced by that outside state. This makes FP functions very predictable: given a certain input, the output is always the same.

So, at first glance, our notebooks will have two components: data, and functions which operate on that data. For the majority of cases, these two will be sufficient to write your notebooks. However, when writing more complex code, say a library like `sklearn`, a number of other FP concepts will come in handy. One such concept that we will focus on in this article is a so-called function operator, which is a higher order function. A function operator takes one or several functions as input and returns a new function as its result. A good example is a progress bar function operator which can add a progress bar to any data processing function. These function operators expand our options in being able to create flexible and reuseable FP code.

The focus of this artcile is on building a function operator, *vectorize*, which can vectorize any existing non-vectorized function. You will learn about the following topics:

  • how to build a function operator in Python using a closure
  • how to pass along any input arguments from one function to another using `*args` and `**kwargs`
  • how to build a vectorizer function operator
  • how function operators can be used to create a clear hierarchy of functions and function operators, comparable to a class hierarchy in object-oriented programming
  • how using function operators allows you to write clean code in your notebooks

In this article, we first build a simple function operator and expand that into the vectorisation function operator. Finally, I will wrap up with some final thoughts on how to use function operators to build a code architecture.

Photo by Roman Mager on Unsplash

Building our first function operator

To ease us into function operators, we will first build a very simple one. This function operator adds a counter to the input function which tracks how often the function is called:

The core trick here is to wrap input_function with internal_function. The return of internal_function is simply calling the input function, making the output of the new function equal to the input_function. The change in behavior is in the few lines of code before that: the amount of times called is printed to the screen, and the counter is incremented by one. Note that the number_of_times_called variable is defined outside the scope of the internal_function, and that we use nonlocal to be able to access that variable in spite of it being out of scope. The key to number_of_times_called being persistent across all the times the function is called is the fact that internal_function remembers the original scope it was created in, the scope of count_times_called. This functional programming technique is called a closure.

So, now we have two classes of functions: a function which performs an operation (do_nothing) and a function operator which alters its behavior (count_times_called). Given a set of operation functions and function operators, we can a build a quite complex and flexible hierarchy of functions that we can mix and match to write our notebooks. Good examples of potential function operators are:

  • a progress bar. Takes a function, and the number of times that function needs to be called for the overall operation to finish. For example, you know that 25 files need to be read and you would like to see a progress bar that shows how many of those 25 have already been read. By the way, the tqdm package already implements such a function operator.
  • a slow down function operator. Although it feels counter-intuitive to slow your code down, such a function could be very useful to modify a function that calls an API which has limits to the amount of times it can be called per minute.
  • a cache function operator. This function operator stores combinations of inputs and outputs, returning the cached version of a given input when the output already consists in the cache. This process is called memoisation.
  • Hyperparameter optimisation through cross-validation function operator. This function operator wraps a fitting function and searches for the optimal value of given parameters. This is essentially what GridSearchCV in sklearn already does.

Building a vectorisation function operator

Vectorisation means that you can pass a vector to an input argument of a function, and the function will perform the appropriate action. For example, if you pass a list of files to a function that reads csv files, it will read all the files. Sadly, pandas.read_csv does not support this kind of behavior. In this section, we will build a vectorisation function operator which will allow us to upgrade pandas.read_csv and allow you to pass a vector of input files.

First, we will build a base version of our vectorisation function operator, which will be heavily inspired by the R function Vectorize:

The key code here is the list comprehension at the end of the internal_function, which actually vectorizes the input_function by iterating over the list given as input. Note that we need to know which of the input variables should be vectorised in order to write the list comprehension accurately. So, we first grab the appropriate input argument from the input dictionary (kwargs), and then delete it from the list using del. This allows us to accurately construct the call to input_function using the list comprehension.

Applying it to read_csv:

This code gives us a version of `pandas.read_csv` to which you can pass a list of files, and the function returns a list of Pandas DataFrame’s with the contents of the six csv files. Note the following limitations of the function operator compared to its R counterpart:

  • the output function only supports named arguments, this is needed because I use the argument names to select the appropriate input argument to vectorise over.
  • the function cannot vectorise over multiple arguments at the same time
  • no aggregation is done on the output. This could for example be to automatically concatenate the list of pandas DataFrame’s into one big DataFrame

The code below adds this last feature:

Now our function operator adds both vectorisation and aggregation of the final result into one big DataFrame. This, in my opinion, makes pandas.read_csv a lot more expressive: with less code you convey the same intent. This expressiveness makes the code much more readable, and to-the-point in your notebooks.

Note that the code above is a bit more verbose than strictly needed, I could simply use `pd.concat` outright without the `try_to_simplify` function. However, in this current form the code can support simplifying in many ways depending on what the function returns.

Building a code architecture using function operators

Using data, functions that operate on that data, and function operators that operate on those functions we can build quite elaborate code architectures. For examples, we have the read_csv function that reads data, and the vectorize function operator that allows us to read multiple files. We could also apply a progress bar function operator to the vectorised read_csv function to also track the progress when reading a lot of csv files:

here, read_data is the composite function which combines all the properties of the function and both function operators. From the large set of functions and function operators we can compose complex functions. This allows us to create simple, easy to understand code components that can be combined to create much more complex code. This kind of function operator based code architecture can be very powerful in a data science context.

You can also read the article on Github including full reproducible code

--

--