Running Apache Superset at Scale

A set of recommendations and starting points to efficiently run Superset at scale

Published in

Towards Data Science

4 min readMar 22, 2021

Sample Superset dashboard (source: https://superset.apache.org/gallery/)

When it comes to the Business Intelligence (BI) ecosystem, proprietary tools were the standard for a very long period. Tableau, Power BI, and more recently Looker, were the go-to solutions for enterprise BI use cases. But then, Apache Superset happened.

Frustrated with the multiple inconveniences of relying on a proprietary BI solution (like the lack of compatibility with certain query execution engines and vendor lock-in), Maxime Beauchemin used an internal Airbnb hackathon to build a BI tool from scratch.

The project was then open-sourced in 2016 (as Caravel initially) and over the past five years became today’s Apache Superset, offering developers worldwide the capabilities of proprietary BI tools (and more) as an open-source project. And so, unsurprisingly, it was quickly adopted by tens of companies.

Five years after its initial release, the documentation is still limited when it comes to running Superset at scale, even though the tool itself is cloud-native and designed for scale. This article aims to offer some starting points and recommendations that can help new adopters in that regard.

It’s all about containers

No matter which platform or cloud you want to run Superset on, if you’re planning for scalability then the first building block should be preparing your custom container image. This will allow you to have Superset containers that match your use-case and that can be deployed to all sorts of orchestration platforms (like ECS or Kubernetes).

The starting point should be the official Superset image (available on DockerHub) and then in your Dockerfile you can add the additional dependencies (like the database drivers) for your use-case. For example, if we wanted to use Redis the Dockerfile would look like this:

FROM apache/superset:1.0.1
# We switch to root
USER root
# We install the Python interface for Redis
RUN pip install redis
# We switch back to the `superset` user
USER superset

Additionally, you should create a superset_config.py file that will contain your custom configuration. Thanks to it you can overwrite values in Superset’s main config file, like the authentication method or the query timeout.

We can finally update the Dockerfile by adding the following commands:

# We add the superset_config.py file to the container
COPY superset_config.py /app/
# We tell Superset where to find it
ENV SUPERSET_CONFIG_PATH /app/superset_config.py

This will ensure that our custom configuration is loaded to the container and recognized by Superset. Then you can simply add a CMD step to your Dockerfile to actually run Superset.

Cache, cache, cache

No matter which database(s) you want to connect Superset to, if you’re planning on running it at scale then you definitely don’t want it to run multiple queries each time someone opens a dashboard.

The tool offers multiple caching possibilities that shouldn’t be ignored. At a minimum, you should try to cache Superset’s own metadata and the different charts’ data. That way, after we create a chart, Superset will save its data to the cache we configured. The next time someone tries to look at the chart after that (via a dashboard for example), Superset will retrieve the data instantly from the cache instead of running the query on the actual database.

Being basically a Flask application (it uses the Flask-AppBuilder framework), for caching purposes Superset uses Flask-Cache — which in turn supports multiple caching backends like Redis and Memcached.

If you also want to use Superset for long-running analytical queries, then you can use Celery to configure an asynchronous backend for which you can have a dedicated caching mechanism for query results.

Setting up authentication

Unsurprisingly, Superset relies on Flask AppBuilder also for authentication. It offers multiple authentication options like OAuth and LDAP that can be easily activated and configured.

For a scalable design, you’d want to activate one of the authentication methods and then let users perform self-registration on the platform. To do so, you can add the following two lines to your superset_config.py:

AUTH_USER_REGISTRATION = True
AUTH_USER_REGISTRATION_ROLE = "a_custom_default_role"

Permissions are a mess, but that’s okay

One of the not-so-ideal aspects about Superset is its permissions management, and there’s an open issue (opened about two years ago) on its repository detailing the different issues with the current implementation and proposing a better alternative.

Luckily for us, data-access permissions are easy to configure, and the problem is rather connected to the feature-related permissions generated by Flask AppBuilder.

To avoid having to manage the long list of permissions, an effective approach can be cloning one of the existing roles and customizing it based on your use-case, so that you can then give this new custom role by default to your users.

Conclusion

Thanks to its large community, Superset quickly became a mature and very complete project (version 1.0 was recently released). With that in mind, it should definitely be considered as a viable option when you’re designing a BI platform at scale.

The different configuration steps needed for it to work at scale are a small price to pay when we take into consideration its long list of features and capabilities. And the best example of how powerful it could be is Airbnb’s setup and how they leverage its different features (discussed in detail via a post on the company’s blog).

Finally, it is worth mentioning that Maxime Beauchemin, Superset’s creator, recently founded Preset — a company that offers a managed cloud service for Apache Superset. If you prefer to minimize the necessary configuration effort, then opting for Preset can be more convenient.

For more data engineering content you can subscribe to my bi-weekly newsletter, Data Espresso, in which I’ll discuss various topics related to data engineering and technology in general:

Data Espresso

Data engineering updates and commentary to accompany your afternoon espresso. Click to read Data Espresso, by Mahdi…

dataespresso.substack.com