Data Engineering

How to use Metabase Metadata to facilitate Data Discovery

Getting the most of your Metabase Production Database

Jacob Frackson

Published in

Towards Data Science

4 min readNov 26, 2020

Metabase is the best open-source BI tool.

For those who are unfamiliar, Metabase is an open-source business intelligence tool with both a query builder and a native query option. Being an open source tool, it’s a great MVP and, for many organizations, it’s a great tool to build on top of and really make your own on account of its open-source code and options for self-deployment.

One key way that I have built on top of Metabase is by tapping into its production database — which is either Postgres or MySQL, depending on your configuration — to really extend a self-serve data strategy. Here’s how to do it.

Before diving in, a few notes up top:

This is not an exhaustive list of what you can do with the production data, but rather the motivation to get started yourself!
All examples are using the Postgres option, so there may be some oddities that I’m not fully aware of when using a MySQL database.
And, lastly, I use leading commas. 😬

Why bother with the production database?

Well, because that’s where all the good stuff is! The chances are that if you’re using Metabase you won’t have great usage statistics coming from your database or data warehousing system because all of the usage will be funnelled through the Metabase system user. Because of this, you’re leaving a lot of usage data on the table; usage data that could help you better understand which data is most important, and to whom.

In short, this data can be used to:

Monitor your self-serve usage, and measure the impact of internal experiments and trainings: Who is using Metabase now and how often? And are these users just viewing questions or are they also creating new questions?
Aid in data discovery and highlight needs for better documentation: Which tables are viewed the most and could they use better documentation?
Identify database objects for deprecation and Metabase questions for archiving, to clean up the clutter: Which tables are not being used as often now?
Investigate impact of data model updates and, in some cases, automate data model migrations. What will the effect of changing “opened_at” to “created_at” in the orders table?

The following are a few queries that, out of the box, you can start using on your own Metabase production database.

Who is using Metabase?

Let’s start off with something simple to answer our first point above about monitoring self-serve: how many questions are being viewed and how many unique users are viewing questions?

Use Cases: as your team scales you can track adoption of Metabase; as you add new models you can see if adoption and utilization increases.

The next level of this type of query could look something like this. In addition to questions viewed we include data about tables viewed, questions created, tables used in creating those questions, unique creators, and more.

Rather than just focusing on views, which is a low threshold for usage, this helps us understand how many power users are using Metabase, and in what ways.

Uses cases: experiments around increasing usage or count of power users can be monitored; training on new models can lead to more tables being used; training for new users can lead to more unique creators; SQL training ca lead to more SQL creators.

What are the most frequently used tables?

Now on to point two: how can we help with data discovery? The key is knowing which tables are being referenced in your questions.

For simplicity theses query only focus on “GUI” questions. Native SQL questions are important too, but in general these are done by power users so the problem of data discovery is often more complex.

Use cases: most frequently used tables for data discovery; tracking changes of most frequently used tables to see if the same tables are being used more often or if more tables are being used.

A more detailed version of this query adds, in addition to count of Metabase questions, count of creators, count of views, count of unique viewers, and more. Similarly to the above, this helps give a more complete understanding of those tables are their usage.

Use Cases: serving a list of top viewed tables; serving a list of most frequently used tables; and if modified or filtered for a period in time, comparing the difference between most used tables over time.

Conversely, these queries can also be used for points 3 and 4 above. If you identify how often a table is used, it helps illustrate the potential extent of a table definition change or a deprecation.

What groups of people are using Metabase most?

Another important thing can be understanding whose activity it is and who is using these most commonly used tables. A great way to do that is by adding context to users by way of their permissions groups.

If you have finance team members, for example, you can identify them as a part of a permission group. By extensions this could then be used to filter activity or top tables.

Use Cases: who are your power users; used in combination with the above, what tables are most used by which permission groups.

Conclusion

Hopefully this small introduction has convinced you that there’s a lot of potential in your Metabase production database. Whether you use it like the above by running the occasional query to find out about your users, or you load the data into your own warehouse and run some automated reporting on it, the data has a lot of quick value add for the average data team.