The world’s leading publication for data science, AI, and ML professionals.

Native Sessionization In Apache Spark

With the next release of Spark 3.2.0, among many new features, there's one that will make our life much easier when solving sessionization…

With the next release of Spark 3.2.0, among many new features, there’s one that will make our life much easier when solving sessionization problems using Spark Structured Streaming.

Before jumping into the details let’s clarify some streaming concepts, like Windows.

Photo by Luca Bravo on Unsplash
Photo by Luca Bravo on Unsplash

When you look out of a window you see a portion of land, you can’t see it all. Here it’s the same 🤓

In streaming, since the flow of data is infinite we can’t and it doesn’t make sense to try to group data using a batch approach.

So, how do you group in streaming?

Window, windows, and windows.

Windows split the stream into "chunks" of finite size, over which we can apply computations. Without them, it would be impossible, as mentioned before.

The most common types of windows are Tumbling Windows, Sliding Windows, and Session Windows.

As you can imagine in this post we are going to talk about the latter.

Session Windows

Session windows capture a period of activity in the data that is terminated by a gap of inactivity. In contrast to tumbling and sliding windows, they don’t overlap and don’t have a fixed start and end time.

The gap of inactivity will be used to close the current session and the following events will be assigned to a new session.

As an example for sessions, think about listening to music on your favorite music provider.

Once you play the first song, a session window will be created for your user, and this session will end after a period of inactivity, let’s say 10 minutes.

This means that if you listen to just one song, the window will close after 10 minutes, but if you keep listening to more songs the window won’t close until the period of inactivity is reached.

They’re particularly useful in Data Analysis when researching user’s activities in our system over a specific period of time during which they were engaged in some activity.

How do I do session windows in Spark < 3.2.0

This would be too long to cover 😩 , and there are great posts about it, as a hint search for flatMapGroupsWithState.

How do I do session windows in Spark ≥3.2.0

That’s exactly why you are here and it is your lucky day because they are way simpler than before, like way simpler 🎉 .

All you need to define is your time column(it must be a TimestampType) and the gap of inactivity.

def session_window(timeColumn: Column, gapDuration: String)

Let’s set a quick example.

Input data:

{"time":"2021-10-03 19:39:34", "user_id":"a"} - first event
{"time":"2021-10-03 19:39:41", "user_id":"a"} - gap < 10 seconds
{"time":"2021-10-03 19:39:42", "user_id":"a"} - gap < 10 seconds
{"time":"2021-10-03 19:39:49", "user_id":"a"} - gap < 10 seconds
{"time":"2021-10-03 19:40:03", "user_id":"a"} - gap > 10 seconds

Let’s use a session window to group this data:

val sessionDF = df
    .groupBy(session_window('time, "10 seconds"), 'user_id)
    .count

Easy right? 😭

Output:

+------------------------------------------+-------+-----+
|session_window                            |user_id|count|
+------------------------------------------+-------+-----+
|{2021-10-03 19:39:34, 2021-10-03 19:39:59}|a      |4    |
+------------------------------------------+-------+-----+

As you can see, we have 4 events that fall in the first session window because the inactivity period is always less than 10 seconds. The fifth event will fall in the next session window for this user because the activity gap was greater than 10 seconds.

Conclusion

Spark is evolving fast and things that weren’t possible before might be possible now, it is always good to be up to date, not just with Spark but with every technology.


Related Articles