
Motivation
A frequent type of tasks a Data Scientist is confronted with includes processing information as they arise throughout time. A typical example is monitoring of trades executed on a trading desk or post-processing of logs generated by a complex system.
If you catch yourself waiting three or five minutes many times a day waiting for the statistics to run on your complete set of data, it might have come to your mind that, often, a considerable time is spent on repetitive task, say, loading the data, be it from a database, a REST-compliant resource or a file system. If you have been doing it the naive way, you may even find yourself in the situation where the trade off between updating the information and the time it takes to wait for the update makes you hesitate to refresh the state of your system.
One way to tackle the issue more efficiently is to notice that, in a typical monitoring task, the information to be processed is the union of the information processed previously and the newly arrived information. In short, time runs forward and that, in many real practical cases, the information in the past is immutable (or reasonably close to immutable for one’s specific intent). The higher the frequency of monitoring, the less the amount of new information (assuming a Poisson-ish process). Hence, this large class of tasks is best processed incrementally.
Concretely, loading logs from midnight to the present instant is equivalent to concatenate the list of logs that happened since the last time one loaded them to the stored last retrieval. In the case where the operations dealing with the immutable part dominates the actual final processing (as it is often the case when retrieving data from a DB compared with say performing an aggregation on these data), this shift in perspective leads to a radical improvement from the previous situation: the more one refreshes, the less one waits.
The optimisation technique that consists in storing the result of lengthy operations instead of executing them again is known as lazy evaluation, caching or Memoization.
The latter term is a mid-sixties Americanism that stems from the Latin word memorandum (gerundive form of verb memorari: "to be remembered"). To publish a memorandum, i.e. something to be remembered, in American corporate jargon came to be designated colloquially as /to memo – ize/ and the term has taken its own life in computer science to indicate the act of making a function remember its own results. That may help the unfamiliar reader to understand why memoization is not a spelling mistake.
A not so gross description of the lazy evaluation strategy is that a function that receive the same parameters will calculate the result once and somehow will store this result. Presented with the same parameters at a later time, it will retrieve the stored result and return it. Of course, almost every single word in this description hides many cases, constraints and design choices. "Same" for example needs to be clearly defined. What "stored" means and how it is done in practice has given rise to many different possible algorithms and implementation strategies. In Python alone, there are score of packages dedicated in part or entirely to the various aspects of this technique.
Memoization of time series: more than meets the eyes
The naive approach to tackling the lazy evaluation of time series is as follow. Given the start and end of a time series, lazy evaluate the longest series already requested whose start and end timestamp are contained in the interval [start, end] and actually compute or retrieve only the difference between these timestamps. So, some way or another, save the start and end timestamps at each evaluation and compare them to the requested start and end timestamps.
This simple task can be handled with a direct implementation. Now, if one tries to tackle the case of non-contiguous requests (the case where the time series from two hours ago to one hour ago and half an hour ago to 10 minutes ago have already been processed), the problem surprisingly turns out to be intractable for the average mortal: the number of if cases comparing dates explodes. One is confronted with of one of these typical problems that is more complex than meets the eyes initially.
Interval arithmetic to the rescue
As often in these kinds of cases, an answer to the problem requires a step up in abstraction. Here, fortunately, the step in abstraction is not involved and the concepts it requires are very well covered in literature. More importantly, the packages to carry out primitive operations on these concepts are available as we shall see later in the article.
In many practical cases, a time series can be identified by its start and end timestamps. Taking the case of system logs, the time series of all logs generated is unambiguously given by all logs between two stamps. Rather than looking at two timestamps, as suggested earlier, we want to think as the time series as defined by a single interval. This seemingly mundane shift of perspective turns out to be the key to elegantly solve the problem.
Initially, we will focus on the case – a rather common one – where one wants to retrieve all the logs between two time stamps from a file system or all the incoming records in a database and, say, display an abridged version of each of them.
Suppose that some calls have been made previously to retrieve similar data for different timestamps. One would have taken care of storing the results of these different calls, each of them defining an atomic time interval. An atomic time interval is the spontaneous image one summons when thinking of an interval: a contiguous portion of the space/time being considered. This natural definition of an interval can be generalised to a union of such intervals. To distinguish between these two types, the former one is specified as atomic.
Given that the new call defines a new atomic interval, let us consider two simple cases:
-
the atomic interval is disjunct from the previously recorded intervals. That means one has never retrieved these data and one consequently needs to make a call to the underlying function. Once this is done, one duly stores the result and records the time interval along the previous ones. The mechanism is illustrated in the disjunct case figures below.
- the new atomic interval is contained in one of the previously recorded intervals. In this case, the superset of the desired result is stored already. One retrieves the stored results and apply a filter afterwards. This assumes naturally the filtering operation is faster than querying the data, a constraint that will be relaxed further on. The mechanism is illustrated in the three "full overlap" figures below.






So far, easy. The general case is hardly more difficult when armed with the two previous ones. In all generality, a new interval may either:
- be disjunct from all previous ones (a case already covered)
- or overlap one or several intervals.
To deal with the second case, one needs solely to keep in mind that if one interval overlaps another one:
- their difference is disjunct from the latter
- their intersection is contained in the latter
- the union of the difference and intersection is the entire interval.
This is summarised in the below pictures:



Now, if the newly requested interval overlaps several already stored intervals one needs just to apply the previous approaches on each of the stored intervals. The next three figures illustrate this general case:



Well enough of abstraction ! Let us see a code in action. There are several packages to deal with intervals and interval arithmetic. A carefully written one is portion by Alexandre Decan. The package deals with all the necessary concepts linked to interval arithmetic and that is the one we leveraged on.
It contains in particular, a convenient IntervalDict
which is exactly what
one would expect to be and fits perfectly the purpose of an interval recorder
with the idea of function calls in mind. There is a small subtlety here, though, that needs being brushed away at once. At first sight, one may wonder why not using a generalised interval to log the calls. Well, the interval is a clever object with some properties that particularly useful when dealing with interval arithmetic but that do not apply in the context at hand. Two adjacent intervals would be reduced to a single atomic interval, for instance, which in the context of lazy evaluation of functions would amount to forgetting the two function calls and registering one that never took place.
So without further ado, here is a first version of the RecordIntervals
object:
An instance of the RecordIntervals
class is a callable object which, upon
one passing an atomic interval, will return a list of calls to be made and
will update the internal lists of already made calls.
Let’s see it in action on a trivial example. First, let’s initialise an empty
recorder and make a first call with the interval [-2,0] (we’ll use integer
intervals in these initial simpler examples: one can add a time unit (minutes, hours, days) to set them back in a time series context). Note that closed
is the portion function call to create an atomic interval whose both extremities are included in the interval.
itvals = RecordIntervals()
calls = itvals(portion.closed(-2, 0))
print_calls(calls)
The output "[(-2, 0)]" tells us that a call to the function with the argument (-2,0) needs to happen. So if one has not retrieved any data so far and needs data from two hours to now, the DB (as an example) needs to be queried for two hours worth of data.
After this first retrieval, one may query immediately for the last hour worth of data.
calls = itvals(portion.closed(-1, 0))
print_calls(calls)
The output "[(-2, 0)]" reflects the fact that a superset of the data has already been retrieved and stored. So a second call to the function will just return the stored data (assuming the function is memoized).
The next case is slightly more interesting:
calls = itvals(portion.closed(-3, -1))
print_calls(calls)
The output "[(-3, -2), (-2, 0)]" indicates that a call leading to a retrieval from the DB for the data stored between three hours and two hours ago needs to happen, while the second call is again a call to the memoized result. That’s memoization of interval in action. We query only what is not already stored.
Finally, the following code:
calls = itvals(portion.closed(-5, -4))
print_calls(calls)
calls = itvals(portion.closed(-6, 0))
print("should be broken in 5 intervals: -6->-5 | -5->-4 | -4->-3 | -3->-2 | -2->0")
print_calls(calls)
yields:
[(-5, -4)]
should be broken in 5 intervals: -6->-5 | -5->-4 | -4->-3 | -3->-2 | -2->0
[(-6, -5), (-5, -4), (-4, -3), (-3, -2), (-2, 0)]
All right. We now have a raw mechanism, which given an interval, generates the correct arguments to leverage on a potential existing memoization. All we need to do now is to combine this raw mechanism with an existing memoization implementation to be able to lazy evaluate functions whose arguments include types that represent time (or otherwise) intervals.
Before doing so, though, it might be worth to give a few reminders on memoization itself or, more specifically, about memoization in Python.
Memoization in python: a short overview
The most commonly used implementations of memoization are probably the cache
and lru_cache
(LRU stands for Least Recent Used) in the functools
library. They are appropriate enough to illustrate many of the different aspects of memoization.
Schematic utilisation
Lazy evaluation technique can be summarised as follows:
-
create a dictionary of calls, that is a combinatorial set of requested arguments f(1, 2, a=True, b=’memo’) -> key = keymap(1, 2, a=True, b=’memo’)
- look for a new call’s arguments in that set
- if the key exists, retrieve the associated result from an archive else call the function.
The archive can be:
- in-process memory
- out-process memory
- files
- a combination
cache
and lru_cache
rely on a dictionary implementation, have limitation on the type of keys they can handle and store the result in RAM.
Characteristics
Many strategies and implementation designs affect the behaviour and performance of the caching approach. Among them:
- management of memory space: for example, does one let the memory space grow without restriction over time ? The
lru_cache
by default will constraint the cache space and will start discarding the Least Recently Used element to allow for a new element to be stored, once this limit is reached. - key generation mechanism: the default implementation through dictionary assumes key are hash-able for example.
Functions and dictionaries
cache
is according to the official Python documentation a thin wrapper around a dictionary lookup. Dictionary are a standard way to implement memoization. Anticipating on the rest of this article, it is worth noting that a dictionary and a Python (deterministic) function share much with each other. The notation are often different – parenthesis vs. square brackets – but both map an argument to a returned value.
As an aside, it is worth mentioning that in the C++ Standard Library equivalent of a dictionary is called a map and that map in many functional languages is used to specify a higher-order function that applies a function to each each element of a collection. Again both concepts seem intrinsically linked.
Available packages
We can distinguish essentially two categories of packages for performing lazy evaluation in Python.
The first category covers python clients or wrappers to non-python implementation, the two most famous being:
- Memcached
- Redis
The other category contains dedicated python implementation, among which:
- joblib
- ring – excellent documentation
- memoization](support for non-hashable arguments, Time To Live strategy…)
- klepto part of the pathos system.
The decorator concept and syntax
Caching is more often than not provided as a decorator, so that if the
original function is f
:
@cache
def f(i):
return i+1
# the call below is now to a memoized version
# of the function f
f(3)
the two lines from @cache
add lazy evaluation to the function f
.
A decorator can be thought and is actually implemented as a function object that takes a function and returns a function. That implies that the decorator is a python class that exposes a __call__
member function. We illustrate this feature by noting the above syntax could be rewritten as:
def f(i):
# ....
g = cache(f)
where g
this time is the cache-enhanced function.
An interval-aware memoization implementation
Coming up with a convenient software component for memoizing functions taking interval parameters involves some amount of meta-programming. We chose to expose only the part of implementation that will shed a light on the actual use of the functions. The others are to be found in the code, which is made available as free software.
Overview
The main component MemoizationWithIntervals
is provided as a function object and is commonly expected to be used as a decorator.
Below is an illustrative use of the component:
The MemoizationWithIntervals
constructor
To properly handle a generic function, the interval parameters that are candidate for memoization need to be specified by the user.
So the first two parameters of a MemoizationWithIntervals
constructor are the indices of positional parameters and names for the key-word arguments that are intervals.
In the example of section Overview, the only interval parameter is the interval
one and is specified as such in the first constructor argument.
Below is an example with one positional interval parameter and one key-word interval parameter:
The other arguments in the constructor will be detailed in the following sub-sections.
The return type issue: specifying an aggregation method
It is important to notice from the on-set that any implementation involving interval memoization as described previously presents a fundamental difference with a standard lazy-evaluation algorithm.
In standard memoization, a call to the cached function will return results undistinguishable from the a call to the original function, except for speed. In particular, a call to the cached function will correspond to at most one call to the underlying function.
The algorithm described in earlier sections involves potentially several calls to the underlying function. That is the result type cannot match a priori the initial function’s one. In the description we have given so far, we have suggested that the initial function would return a list of results. The interval-cached function would thus return a list of lists. We already hinted turning such a list of lists into a list would involve concatenation and possibly filtering.
It needs not to be so though. To properly account to the general case, the user needs to have the flexibility to specify an aggregation operation. This aggregation specification will be a parameter of the memoization class constructor. The aggregation function will take the list of results from the different calls and return a result whose type is compatible with the initial function.
So this aggregation function could be as simple as aggregation=list
or
equivalently, for Pandas data frames, aggregation=pandas.concatenate
. In the case where the user wishes to aggregate the values by summation, something along the line of aggregation=lambda listr: reduce(lambda x,y:x+y,listr)
would be apply.
The memoization algorithm
The sole purpose of the package described in this article is to apply a preprocessing to a function taking interval parameters so that the lazy evaluation can be delegated to an existing implementation of the user’s choice. The constructor of the MemoizationWithIntervals
object thus takes a fully constructed memoization object, that will perform the lazy evaluation.
As can be seen, we chose the klepto
package as default implementation. We found in that package features that were compelling and are using it exclusively. The user might find other implementations more suitable to their need and would thus change the default.
So typically to use the functools cache
algorithm:
Handling other interval types
Alexandre Decan’s Portion package is a great package for interval arithmetic. For the interval object itself, though, it is probably not the most common implementation. Arguably, Pandas‘ Interval
can claim that title. But one may have one’s own implementation. Using CacheIntervals with a particular interval type requires creating an ad-hoc type of interval recorder and a bit of wrapping to allow a two way translation between the Portion’s native interval type and the user’s interval type.
The package CacheIntervals provides an example of such a wrapping for the Pandas’ interval type. The purpose for implementing that specific interval was two fold. On the one hand, it is a template for users who want to implement that override. And on the other hand, the Pandas‘ Interval
type, along with Alexandre Decan’s native type should cover most of the needs. By default, the type of interval recorder is the one that accommodates Pandas‘ intervals. To change it, specify the new interval recorder type as argument of the constructor, e.g:
Overview of the mechanism
This section is more involved and intended primarily for package developers. In particular, the hope is that interval memoization could be embedded directly into standard memoization packages and made readily available.
We dive in this section deeper into the implementation details. The more casual reader can safely skip it entirely.
No advanced knowledge of Python is assumed in these explanation, though. So feel free to carry on if curiosity itches.
The MemoizationWithIntervals
class has only one member function besides the constructor: the __call__
function. Recall that, as explained in
section The decorator concept and syntax, the MemoizationWithIntervals
object is a function that takes a function and returns a memoized version of this function.
Let’s see that in the code:
Details are voluntarily omitted so that the structure stands out. From that code skeleton, it appears clearly that all the __call__
function does is to take a function f
and return a function wrapper
, itself defined internally within the __call__
function.
As can be guessed from this excerpt, the wrapper
function itself performs
calls to the memoized version of f
, f_cached
.
The first two lines are setting up an object to properly solve the call parameters, a functionality not straight-forwardly provided by the Python language to our knowledge. We shall skip over that part of the code though as it is largely unrelated to the article.
Leveraging the memoization algorithm twice
An intuitive but incorrect implementation of the interval lazy evaluation would be to associate each interval parameter with its own interval recorder object. To see why it is usually not what is intended, let’s look at a simple example:
Assume the calls are as follow:
print('==== First pass ===')
print( f'Final result:n{function_with_interval_param(2, interval=pd.Interval(-2, 0))}')
print('==== Second pass ===')
print(f'Final result: {function_with_interval_param(2, interval=pd.Interval(-3, 0))}')
print('==== 3rd pass ===')
print( f'Final result:n {function_with_interval_param(3, interval=pd.Interval(-3, 0))}')
The output will be:
==== First pass ===
((2, [-2,0]))
==== Second pass ===
((2, [-3,-2]), (2, [-2,0]))
==== Third pass ===
((3, [-3,-2]), (3, [-2,0]))
The careful reader has noticed that the results in the third pass leads to an inefficiency. Two calls are generated when one is required. The reason is that that there should be one recorder per value of the first parameter.
A correct algorithm needs to somehow keep one recorder per combination of non-interval parameters. One could keep manually keep track of these other parameters through, say, a dictionary but one realises immediately that:
- this is a difficult endeavour since key mapping is not as straight-forward as one may think initially
- this is exactly the type of problem the memoization algorithm solves.
A more ingenious approach is thus to leverage on the user-provided caching algorithm to do so. This is all the more suitable that the user has potentially specified many constraints on his setting: type of memory to use, persistence, memory management and others… To do so, let’s recall the remark of section Functions and dictionary: there is a conceptual equivalence between a dictionary and a function.
So the idea is to leverage the cached function implementation twice:
-
it will be called for the standard evaluation of the function.
- but, besides this standard usage, it will be used to store the appropriate interval recorders as we would have done with a dictionary.
The question now is: how do we distinguish between a standard call and a request for recorders ? One possible approach is to pass arguments that will never be passed by the user. If we detect such an argument anywhere, we know this is not a call to the underlying function but a request for the recorders.
This is the purpose of the QueryRecorder
class below. The self.query_recorder
variable is initialised to self.query_recorder=QueryRecorder()
in the constructor of the
MemoizationWithIntervals
object and used at each call to resolve the
recorders.
So in the f_cached
function below, there is a preliminary mechanism to
determine what type of call we are dealing with. If this is a request
for recorders:
-
the first time around the actual code is executed and the recorders are created (with the type specified by the user, e.g.
RecorderInterval
,RecorderIntervalPandas
or some user’s specialisation). AnyRecorderInterval
constructor specific parameter can be passed in by specifying them askwargs
in the constructor of theMemoizationWithIntervals
object. They are then stored into theself.kwargsrecorder
member variable. - henceforth, the recorders are cached and will be reused.
Clearly, there is a small overhead involved. Whether this is acceptable or not depends on the user’s context. In practice, this overhead should be most of the time negligible compared with the memoization benefits.
Below is the corresponding excerpt of the wrapper
function code. It can be seen that when the wrapped function is called, the first task of the
wrapper
is to replace each interval argument by a request to its corresponding recorder. When the recorders are resolved, the wrapper
function proceeds with, what is conceptually, a normal call to the cached function and, practically, a (possible) series of calls to the said function (not shown here).
Resolving the calls
From there on, the algorithm is relatively easy. From each interval recorder, one retrieves the calls to be made. The final series of calls is just the Cartesian product of the possible argument values, where non-interval arguments are just singletons. The split between positional and key-word arguments casts a bit of complexity on the code but that is essentially all there is to that last section.
Some final details
Tolerance
Suppose the Database being queried for data is a proxy one that gets updated every 15 minutes. If a call succeeds a previous one after a short interval of time, it is probably unlikely that the data will change noticeably. On the other hand, even retrieving a handful of data or no new data at all still requires an exchange over the network and some operations on the DB itself. This situation may seem far fetched when one thinks of a single user interactively making the requests but in a multi-user environment or a more automated set-up, this may actually be a very common scenario.
In order to prevent unnecessary transactions following rapid succession of requests or, in the more general case, if the new queried interval does not significantly differ from the original one, one may decide that, below a tolerance threshold, no new call is issued. This approach is common in caching algorithms and is often known as rounding.
As it turns out, in our case, implementing rounding turns out to be particularly simple. All it requires is a small modification of the RecordIntervals
class. The constructor now accepts a rounding
argument and the disjunct
member function will test if the boundary of the
newly requested interval is below the threshold, in which case the new interval is not added.
This surgical change is all that is required. Here is how this is used:
More on aggregation
It is now time to reveal that we occulted a detail in our initial RecordIntervals
that careful readers might have objected to. When dealing with a standard aggregation algorithm that for example takes the cumulative value of some columns or their average, our current strategy will not yield correct result. The reason is that aggregating on the superset of data is not what is expected.
We thus need another algorithm. The idea is that a request for a new interval cannot lead to a call with a wider interval. The 3 figures below demonstrate the behaviour of the new strategy.



The figures may give the false impression that caching will not happen with this strategy. This is not the case, however: caching will happen for all already stored intervals that are contained in the newly requested interval.
The corresponding algorithm follows:
Conclusion
The CacheIntervals package turns out to allow avoiding a lot of boiler plate code that is often prone to bugs. Given the ubiquitous nature of its applications, it allows to avoid rewriting the code each time. This leads not only to development time improvements but, most notably, in our experience, favours a systematic use of lazy evaluation, which benefits ultimately the application’s end users.
As a parting note, a small anecdote. This algorithm was written from scratch a few days ago. While designing a preliminary version of the algorithm, the Google search engine web site opened an Easter Egg door. It led to a spartan command line interface and a few Unix-like commands allowed to access a mission statement. A code had to be produced to help Minions on another planet sort out some task allocation. Whether this Easter Egg was a random event or the result of previous search requests on the Google engine was answered by the solution to the problem: it was a three-liner in Python involving memoization ! Upon submitting successfully this first challenge within the allocated time, I was congratulated with a Matrix-the-movie animated pixelised image – that I duly kept for several days on my screen. At which stage, I was prompted to enter my credential to take further challenges, which my corporate firewall blocked (that was probably the best outcome as I can still illusion myself I was on track to be recruited by the prestigious software corporation). The innovative recruitment model did much to impress me nonetheless. So there you are, working on memoization can lead to unexpected findings…
References
The CacheIntervals repository is on GitHub at the following address:
Alexandre Decan’s Portion can be found here:
klepto is here: