Hacking MySQL’s JSON_ARRAYAGG Function to Create Dynamic, Multi-Value Dimensions

Making up for one of MySQL’s lesser-known shortcomings

Dakota Smith
Towards Data Science

--

Photo by Azamat E on Unsplash. Thanks, Azamat!

Introduction

Let’s imagine that we’re members on the Data team at a subscription box company. In a MySQL database, transaction records of purchases are written to table named subscriptions. Metadata excluded, the table contains a customer_id and subscription field, and looks something like this:

The subscriptions table. (Note: All images, unless otherwise noted, are by the author)

Note that in this example scenario, a customer can have multiple subscriptions. The uniqueness of each record is in defined by both the customer ID and a subscription, i.e., no customer can have the same subscription twice. If you want to load this test data into your own database, you can find the code to do so here.

As a subscription box company, our goal is to sell more subscription boxes. To that end, the Product team recently pointed out that all of our current customers have more than one subscription. They’re curious what this says about customer behavior. They ask our team for a data model showing combinations of subscriptions purchased by user, as well as which combinations are most common.

The Marketing team has also showed interested in this model. They believe the results might be useful for marketing bundled product promotions, profiling customers, and targeting email campaigns. For these same reasons, they’re also asking to see the most common total number of subscriptions purchased by each customer.

In short, the requested data model hopes to answer some important questions that’ll ideally end in higher subscription box sales. The question is, how exactly should we execute it?

In this article, we’re going to tackle a unique data modeling challenge by making up for one of MySQL’s lesser-known shortcomings. We’re going to go over qualitative aggregation, the JSON data type, and how to force MySQL to order a list of values in a way that produces distinct, multi-value dimensions.

Table of Contents

Aggregations as Dimensions

Conceptually, what we need to do is relatively simple: We need to bundle (group) our subscriptions by customer. Then we need to take a look at these bundles and see which are the most common and how many subscriptions are in them.

In data modeling terms, we’re looking at some form of aggregation: specifically, aggregating subscriptions by customer.

It’s normal to think of aggregate functions in the quantitative sense (SUM, COUNT, etc.), and this is largely because this is what most aggregate functions do in SQL. But we can also aggregate concatenated string values into longer, list-like strings.

The challenge with this though is lies in accessing, manipulating, or otherwise evaluating the values within these concatenated strings. MySQL is going to treat the value foo, bar, hello, world not as a list, but as text.

Why is this relevant? Primarily because in our hypothetical scenario, we want to count the number of subscriptions in each combination. We don’t want a long comma-delimited string, we want something more truly list-like.

Solving this in Python would be simple—using pandas, maybe polars, or even just Python-native data structures. But there are many cases where this isn’t an option. Maybe the Data team is a strictly dbt-only shop; or, perhaps more commonplace, you work at a company where the IT department has local environments severely locked down.

Whatever the case, if you have only SQL at your disposal, you need a solution that will provide you the most readable code and the most flexible results. Achieving this is not intuitive. For example, my first instinct when encountering this problem was to use GROUP_CONCAT, the function that concatenates strings according to your defined grouping:

WITH
subscriptions_grouped AS (
SELECT
customer_id,
GROUP_CONCAT(subscription) AS subscriptions
FROM
subscriptions
GROUP BY customer_id
)
SELECT
subscriptions,
COUNT(*) AS num_accounts
FROM subscriptions_grouped
GROUP BY subscriptions
;
Query results

As you can see, the aggregation worked, technically, but it didn’t work along the lines of our business logic. See the first and last rows. The subscriptions combination of “international_snacks, self_care” is the same combination as “self_care, international_snacks”. (The same is true of the 2nd and 4th rows.)

We can use an ORDER BY clause within GROUP_CONCAT to solve this particular issue:

WITH
subscriptions_grouped AS (
SELECT
customer_id,
GROUP_CONCAT(subscription ORDER BY subscription) AS subscriptions
FROM
subscriptions
GROUP BY 1
)
SELECT
subscriptions,
COUNT(*) AS num_accounts
FROM subscriptions_grouped
GROUP BY subscriptions
;
Query results

But this still leaves up with the problem of counting how many subscriptions are in each combination.

There is a way to do it. But not only is it needlessly complex and not very readable, in my opinion, but it also comes with some not-so-obvious pitfalls.

A quick search on how to count the number of values in a comma-delimited string in MySQL turns up a solution from StackOverflow that, translated for our purposes, amounts to this (subscriptions_grouped CTE excluded):

SELECT
subscriptions,
LENGTH(subscriptions) - LENGTH(REPLACE(subscriptions, ',', '')) + 1 AS num_subscriptions,
COUNT(*) AS num_accounts
FROM subscriptions_grouped
GROUP BY subscriptions
;

Which is essentially counting the number of commas and then adding 1 to the result. This works. But not only is this answer difficult to understand at a glance, it also introduces a potential mistake: the LENGTH and CHAR_LENGTH function don’t count the same thing.

As you can probably guess, this article details obstacles I ran into at work when I found myself in a similar situation.

Ultimately, the solution was in a somewhat hack-y, but remarkably comprehensible, workaround using the native MySQL JSON data type.

A brief overview of the JSON data type in MySQL

The JSON data type in MySQL was added in version 5.7.8, and provides a lot of useful utility for both storage and modeling.

Under the JSON data type umbrella (officially called “JSON documents”) are two different data structures: JSON arrays and JSON objects.

A JSON array can simply be thought of as an array (a list, if you’re a Pythonista): values enclosed by square brackets [ ] and separated by commas.

  • An example MySQL JSON array value: [“foo”, “bar”, 1, 2]

A JSON object can be thought of as a hash table (or, again in Python terms, a dictionary): key-value pairs, separated by commas, and enclosed by curly brackets { }.

  • An example of a MySQL JSON object value: {“foo”: “bar”, 1: 2}

MySQL has a number of functions that can be used to deal with both of these formats—almost none of which perform any sort of aggregation.

Thankfully, though, there are two that do. And they both return JSON documents, which means we can use MySQL’s built-in functions to access the values therein.

JSON_ARRAYAGG

The MySQL function JSON_ARRAYAGG acts a lot like GROUP_CONCAT. The biggest difference is that it returns a JSON array, which, again, comes with several helpful built-in functions linked above.

The JSON array data type solves one of our two problems with astounding simplicity: the problem of reliably counting the number of subscriptions in a combination. This is accomplished using the JSON_LENGTH function. The syntax is wonderfully straightforward:

SELECT JSON_LENGTH(JSON_ARRAY("foo", "bar", "hello", "world"));
-- JSON_ARRAY function used here just to quickly create an example array

The result of this statement is 4, since there are 4 values in the generated JSON array.

But let’s return to the combination of subscriptions. Unfortunately, JSON_ARRAYAGG doesn’t come with the ordering functionality that GROUP_CONCAT has. Ordering the subscription values, even in a CTE before the base query, doesn’t return the desired results:

WITH
subscriptions_ordered AS (
SELECT
customer_id,
subscription
FROM subscriptions
ORDER BY subscription
)
, subscriptions_grouped AS (
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions,
JSON_LENGTH(JSON_ARRAYAGG(subscription)) AS num_subscriptions
FROM
subscriptions_ordered
GROUP BY customer_id
)
SELECT
subscriptions,
COUNT(*) AS num_accounts
num_subscriptions
FROM subscriptions_grouped
GROUP BY subscriptions
;
Query results

The number of subscriptions in each combination is there, thanks to the JSON_LENGTH function—but combinations that are effectively the same are once again mischaracterized as distinct because of their order.

Using ROW_NUMBER to force the ordering of values

ROW_NUMBER is a window function that creates an index. The index has to be defined; that is, you have to tell it where to start, how to increment (directionally), and where to end.

We can see a quick example of this by applying the ROW_NUMBER function and telling it to order by the subscription field:

SELECT 
customer_id,
subscription,
ROW_NUMBER() OVER(ORDER BY subscription) AS alphabetical_row_num
FROM subscriptions
;
Query results

Look closely at the results. Even though we didn’t use an ORDER BY statement at the end of our query, the data is nonetheless ordered according to the ORDER BY in the OVER clause.

But of course this still isn’t exactly what we want. What we need to do next is add a PARTITION BY clause to our window function, so that the ordering of the results are related to (and in fact bounded by) each customer ID. Like so:

SELECT 
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
;
Query results

You can probably see where this is going.

If we execute the JSON_ARRAYAGG function against these results in a CTE, we see that the duplicate combinations now look exactly the same, thanks to the subscriptions being forced into an alphabetical order by the ROW_NUMBER function:

WITH 
subscriptions_ordered AS (
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
)
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions
FROM subscriptions_ordered
GROUP BY 1
ORDER BY 2
;
Query results

Now all we need to do is add in the grouping CTE following the one executing ROW_NUMBER, and alter the base query:

WITH 
subscriptions_ordered AS (
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
)
, subscriptions_grouped AS (
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions,
JSON_LENGTH(JSON_ARRAYAGG(subscription)) AS num_subscriptions
FROM subscriptions_ordered
GROUP BY customer_id
)
SELECT
subscriptions,
COUNT(*) AS num_customers,
num_subscriptions
FROM subscriptions_grouped
GROUP BY subscriptions
ORDER BY num_customers DESC
;

This gives not only accurately distinct combinations of subscriptions, but also the number of customers who’ve purchased those combinations, and how many subscriptions comprise each of them:

Query results

Voila!

To recap

  • We wanted to know how many customers purchased different combination of subscriptions, and how many subscriptions were in each of those combinations. This presented two problems: how best to obtain the latter, and how to generate accurately distinct subscription combinations.
  • To obtain the number of subscriptions in each combination, we chose to go with one of MySQL’s JSON functions, JSON_ARRAYAGG. The resulting aggregation was returned to us as a JSON data type, allowing us to use the JSON_LENGTH function.
  • We then needed to force the ordering of values inside the JSON array so that duplicate combinations didn’t mistakenly appear distinct. To do this, we used the window function ROW_NUMBER in a CTE prior to the base query, partitioning by customer ID and ordering the subscriptions alphabetically (in ascending order).
  • This ultimately allowed us to aggregate up to accurately distinct combinations of subscriptions; and with this we were able to use a simple COUNT function to see how many customers had purchased each combination.

Thanks for reading! 🤓

I hope this was helpful! If you know of any other clever tricks/workarounds in SQL (regardless of dialect), I’d love to hear about them. SQL has long been the de facto lingua franca for transforming structured data, but it’s not perfect. I always enjoy learning about innovative and/or clever solutions to real-world challenges. 🔥

I write semi-regularly about data engineering and analytics topics—the goal always being to write as clearly and simply as possible. If anything in this article confused you, please let me know in the comments. And if you’re interested in reading more articles like this, feel free to follow me and/or connect on LinkedIn.

--

--