The world’s leading publication for data science, AI, and ML professionals.

How to Use Python Built-In Decoration to Improve Performance Significantly

How to implement a caching mechanism in Python, and when not to use it?

Image created in Canva by the author
Image created in Canva by the author

When talking about improving Python execution performance, especially for data processing, there are too many 3rd party libraries that can help us. If we think about their mechanisms, most of them rely on optimising the data structure or utilisation of the memory to achieve performance improvement.

For example, Dask leverages parallel computing and memory optimisation, Pandas relies on vectorisation of the dataset and Modin optimises the utilisation of the multi-cores CPU and memory, too.

In this article, I won’t introduce any libraries. In fact, there is a native Python decoration that could be used to improve the performance significantly. We don’t need to install anything because it’s built-in to Python. Of course, it won’t be used for all the scenarios. So, in the last section, I’ll also discuss when we should not use it.

1. A Use Case of the Cache Decoration

Image created in Canva by the author
Image created in Canva by the author

Let’s start with a vanilla example that we are all familiar with, the Fibonacci sequence. Below is a normal implementation using recursion.

def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

Like most other Programming languages, Python also needs to build a "stack" for a recursive function and calculate the value on every stack.

However, the "cache" decoration will improve the performance significantly. Also, it is not difficult to do so. We just need to import it from the functools module, and then add the decoration to the function.

from functools import cache

@cache
def fibonacci_cached(n):
    if n < 2:
        return n
    return fibonacci_cached(n-1) + fibonacci_cached(n-2)

Here are the running results and the performance comparison.

It shows that the performance of the cache-enabled version is roughly 120 times faster than the version without caching!

BTW, I have given the -r 1 -n 1 parameters to the %timeit magic command to make sure the function will only be executed once. Otherwise, the cached Fibonacci function will be dominantly fast. For example, if we run the function 10000 times, except the first time, all the other 9999 times the result will be loaded from the cache directly. So, the parameters ensure the test only be executed once.

2. How does the Cache Decoration Improve the Performance?

Image by the Image created in Canva by the author
Image by the Image created in Canva by the author

Let’s have a look at the stack of this Fibonacci recursive function. To make sure it can be demonstrated in a diagram, we have to simplify the scenario. The diagram shows the stack of fibonacci(4).

Fibonacci recursive function without cache
Fibonacci recursive function without cache

When we call the function fibonacci(4), the recursive function will call itself with a new parameter at the next lower level, until it reaches the base case fibonacci(1)==1 and fibonacci(0)==0.

In the above diagram, all the steps need to be calculated. For example, even though f(0), f(1) and f(2) appeared multiple times, they are all calculated separately.

Now, let’s have a look at the scenario with cache enabled.

Fibonacci recursive function with cache - All steps
Fibonacci recursive function with cache – All steps

This time, the steps that are coloured in green do not need to be calculated anymore. As long as a function f(x) had been calculated once, it will be cached. Then, when the f(x) happens again, the result will be directly loaded from the memory, because it had been cached.

Therefore, in the above diagram, the grey colour steps do not even need to be calculated. So, the real stack will be something like below. Some of the f(x) function will be directly loaded from the cache.

Fibonacci recursive function with cache - Actual steps
Fibonacci recursive function with cache – Actual steps

If we go back to our example in the above code, fibonacci_cached(20) will look like the following.

Fibonacci recursive function with cache - f(20), partial demonstrated
Fibonacci recursive function with cache – f(20), partial demonstrated

From the diagram, we can easily understand that only the steps on the left edges will be actually calculated, and each f(x) will only be calculated once.

This is the reason why the performance with the cache enabled is much higher than the normal recursive function. Also, for this particular example, the Fibonacci function, we can derive that the larger number used as the parameter, the more improvement in performance that the cache will give us. For example, fibonacci_cached(30) will be more than 120x times better performed than fibonacci(30).

3. A Practical Example from the Real-World

Image created in Canva by the author
Image created in Canva by the author

Let’s suppose we are developing a Python-based data dashboard that has many users. The dashboard shows weather data for 5 cities in the US and allows users to filter and aggregate temperature data for certain cities.

The code below will generate some dummy data as the dataset.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from functools import cache

# Create sample data: hourly temperature recordings over 10 days for 5 cities
np.random.seed(0)
date_range = pd.date_range(start="2024-04-01", end="2024-04-10", freq='H')
data = pd.DataFrame({
    'timestamp': np.tile(date_range, 5),
    'city': np.repeat(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], len(date_range)),
    'temperature': np.random.normal(loc=15, scale=10, size=(len(date_range)*5,))
})

Here is an output of the sample data.

Then, let’s write a method to calculate the average temperature of a city. Of course, we need two versions of it, one is normal, and the other one comes with a "cache" decoration.

def compute_avg_daily_temp(date, city):
    day_data = data[(data['timestamp'].dt.date == pd.to_datetime(date).date()) &amp; (data['city'] == city)]
    return day_data['temperature'].mean()

@cache
def compute_avg_daily_temp_cached(date, city):
    day_data = data[(data['timestamp'].dt.date == pd.to_datetime(date).date()) &amp; (data['city'] == city)]
    return day_data['temperature'].mean()

We can call the functions to check if the functions work fine.

compute_avg_daily_temp('2024-04-09', 'New York')
compute_avg_daily_temp_cached('2024-04-09', 'New York')

Now, let’s have a look at the performance. Please be advised that DO NOT use the same date if you also tested the output of the function as above. In the above testing, I have used the date 2024–04–09, so I will use a different date 2024–04–10 in the following performance testing. This is to avoid the results being cached and causing inaccurate testing results.

In the cell [9] and [10], the performance of the normal one and the cached one is the same. These are the first runs for these functions being executed respectively.

Then, for the second run of them, the normal function has almost no improvement in performance in the cell [11]. Please be advised that 8.82ms and 5.86ms don’t mean the latter one has better performance since we only ran it once. The fluctuation in the operating system could cause that difference very commonly.

However, when we look at the cell [12], the performance of the cached function is about 1,733 times faster. That is because the calculation didn’t actually happen. The results were loaded from the cache.

Why is this example practical?

You may ask why we need to implement the caching mechanism for this example. Of course, if we only want to calculate the average temperature of a city, we don’t need caching at all. However, considering we are developing a data-driven dashboarding application here. So, the caching feature will drive the following practices.

If we have multiple users of this dashboard app, they could query the same dataset and ask for the same aggregated or calculated metrics. Therefore, the caching feature will help the 2nd and all the rest of the users get the result in a fraction of the time.

In this example, the data we are using for each day is consistent for each day. That is, the average temperature will not change once it is calculated for a day.

In most of the other programming languages, implementing such a caching mechanism can be much more difficult. However, we have seen that in Python it can be easier.

4. Other Considerations and When We Should NOT Use Cache

Image created in Canva by the author
Image created in Canva by the author

Of course, it is not recommended to use cache everywhere. Before you start to add @cache on top of your every Python function, the following things need to be considered.

The version of your Python

Please be advised that the @cache decoration was introduced in Python 3.9. If you can’t use version 3.9+, please consider using the @lru_cache, which is a more comprehensive caching function. I’ll introduce this decoration in my next article.

Do not use cache in a non-deterministic function

When there is any non-deterministic stuff in the function, we should never use the cache. Let’s use the same example as above that calculates the average temperature. Suppose we change the requirement to "get the average temperature for today". We may need to add a datetime.now() to get the current timestamp in this function. The action of getting the current timestamp is non-deterministic.

If you don’t know what I mean, please look at this example.

from datetime import datetime
from functools import cache

@cache
def get_current_time_cached():
    return datetime.now()

The above function is the simplest non-deterministic function. Before we run the function, let’s run the datetime.now() function alongside the cached function.

We can see that even though I run the two cells very quickly, the timestamps are still different if we use datetime.now(). That makes sense because time is flowing. However, no matter how many times we call the cached function, the timestamp will not change ever again. We introduced a severe bug in our app!

The same principle also applies to random-based functions. If every time our random functions generate the same results, it’s not random anymore.

Do not use cache if the function has "side effects"

By "side effects", I mean the actions beyond returning a value, such as writing some text to a file or updating a database table. If we use caching for these functions, the "side effects" will never happen the second time. In other words, it will only work when we call the function for the first time.

Do not use cache if there is a memory size concern

Where does Python store these cached results? Of course, in the compute memory. It is surely OK to store an average temperature in a city, or even in 100 cities. However, if we want to cache the rolling average temperature per hour for the last year in all major cities in the world, that would be a bad idea.

However, just highlight that if we want to cache many small results but concerning there might be too many of the result sets, it would be a great use case for the lru_cache again. Please keep an eye on my profile. I will have another post for this decoration later.

Do not use cache if the data is security-sensitive

Remember, the cache decoration will not encrypt the results that it is cached in the memory. That means other processes in the current operating system may potentially obtain the information cached in the memory.

Therefore, please do not cache any sensitive data for security reasons.

Summary

Image created in Canva by the author
Image created in Canva by the author

To sum up, we have introduced the @cache decoration in the functools module that is built-in to Python, in this article. It can be used to improve the performance of some typical recursive functions, as well as can be considered the easiest way to implement a caching feature in our Python applications.

Of course, it is not a perfect and cookie-cutter solution for everything. We should be aware of certain scenarios that we should never use, such as with a non-deterministic function. We also need to be careful about memory utilisation since the cached results will be retained in the memory. If you are interested in a more flexible caching function but a bit more difficult to use, please keep an eye on my profile and it is coming soon.

All images in this article are created by the author


Related Articles