Hands-on Tutorials
A comparative Markov chain analysis to identify the root causes of aberrant online activity
As a trained cognitive neuroscientist, I have experience understanding human behavior and how information guides decisions. Now as I transition into my role as a data scientist, thanks in part to my Data Science Fellowship at Insight, I’ve come to learn of the many other mediums and methods for tracking and understanding behavior on a much larger scale.
During my fellowship, I consulted for Lazy Lantern, a software company that assists other companies to monitor how customers are interacting with their online platforms. More specifically, Lazy Lantern provides its client companies automated monitoring and analytics of their websites and iOS/Android apps which shed light on customer traffic, interactions, and overall behavior on those platforms.
The problem
To understand how Lazy Lantern tracks customer behavior, let’s consider a user journey on a typical retail website, depicted in the figure below.
From the homepage, customers might click on the list of products, then hone in on a product of interest, add that product to their cart, and finally check out. Considering that there would be many such customer interactions on a site at a given time, Lazy Lantern uses a time series model, Prophet, to predict the rate of interactions for a given link or event (e.g. product list, product, etc.) on that platform for a given time period. The Prophet model evaluates the expected number of clicks for each of those events on an hourly basis and derives an expected normal range of clicks for a given time period. To learn more about how the Prophet model is used at Lazy Lantern and ways to improve its detection efficiency, check out my Insight colleague Yeonjoo Smith’s blog here.
If the number of customer interactions with these links falls above or below the range predicted by the model, the algorithm triggers an anomaly and the client company is alerted. While this model is able to detect anomalies in rates of user interactions for each of these events individually, it is unable to determine if the triggered anomalies are actually a part of a chain reaction of events on the platform.
For example, let’s consider the same user journey as before, but this time it becomes interrupted, as depicted in the figure below.
Here, the customers again go from the homepage, click on the product list, click on a product of interest, but then are unable to add the product to cart due to some glitch. In this case, the algorithm would alert anomalies for both the ‘add to cart’ and ‘checkout’ events; however, it is unable to give any insight beyond that to clients.
In this example (Fig. 2), it could be the case that the ‘add to cart’ button was broken, which resulted in a cascade effect that led to the ‘checkout’ button to also falter. However, the algorithm was only identifying these as separate anomalies and not necessarily the overlaps and relationships of the events to one another. Thus, the central problem of this project was posed: how could we identify which event is the potential root cause of these overlapping triggered anomalies? Pinpointing the root causes of these cascading anomaly effects would give the client companies more nuanced feedback about potential hiccups on their platforms which they can then quickly address to ensure future customer satisfaction and no unnecessary loss in revenue.
The data
While it is simple to understand what event was the root cause of the issue in the example above, in practice, the answer is… not so straightforward. When investigating customer behaviors of an actual company’s platform, there are multiple users and user interactions that occur, not all of which clearly and neatly correspond to the triggered anomaly. To build a scalable, generalizable model to encapsulate the complexity of potential user-interactions across diverse platforms and use-cases, I had to first gather the relevant data.
For a given client company, every event anomaly that occurs is stored in an InfluxDB database, with its corresponding event label, the start and end time of the infraction, and the platform in which the anomaly occurred (i.e. web, iOS, or Android). The raw user interactions that accumulate those anomaly calculations are stored in a MongoDB database, divided by company and platform; these contain the event label, user identity, and the timestamp of when the user interacted with the event.
First, I evaluated the list of anomaly events and assessed which anomalies overlap in time with one another. These were chained anomalies, and in the real data they typically ranged between 2 and 15 anomaly events. We can refer to the temporal overlap between anomalies as anomaly periods. Considering our initial example, the time period in which both the anomaly events of ‘add to cart’ and ‘checkout’ occurred would result in one such anomaly period because their alerts overlap in time.
After generating a list of those overlapping, chained anomalies, I extracted the raw events that occurred within the resulting anomaly period by time-matching across the anomaly period and raw events. Now with these raw events, we can try to pinpoint which of the events was the origin or root cause of the cascading anomalies.
The solution
With the list of raw events that corresponded to an anomaly period, I now had to figure out what event(s) were the root cause of that anomaly period. The number of raw events for an anomaly period ranged in the thousands, and of course was directly related to the length of the period. Therefore, it was imperative to generate a solution to the root cause problem that was:
- QuickBecause of the sheer number of events that need to be parsed for a given anomaly period, the solution that needed to be deployed needed to be time-efficient and readily implementable.
- FlexibleBecause Lazy Lantern has over 160 client companies, each of which has multiple platforms, the solution also had to be something that was fairly data-agnostic and did not need excessive customization.
Considering those qualifiers, my solution was to utilize Markov chain models and conduct some comparative analyses. A Markov chain is a "mathematical system that experiences transitions from one state to another according to certain probabilistic rules…where the probability of transitioning to any particular state is dependent solely on the current state." To relate this definition directly to the problem at hand, a Markov chain model would be able to tell you the likelihood of users transitioning from one event to another for a given time period.
For an anomaly time period I determined how users traverse the associated events, tracking which event directly followed a previous event. Then after averaging interactions across all users for a given event, I generated the likelihood that a user would transition from one event to another by creating an event-by-event Markov matrix (i.e. all columns in the matrix sum to 1).
This transition matrix shows us all the raw events the users traversed during this anomalous time period on the x and y axis. These events are represented by numbers in Figure 4 for the anonymity of company data, but these numbers essentially correspond to events in the initial example, e.g. "add product to cart", "checkout", etc. The values of each cell of the matrix correspond to the proportion of users that traversed from one event (numbered event on x-axis) to the next (numbered event on y-axis).
Now that I had generated a Markov chain transition matrix for an anomaly period, I needed some sort of a baseline with which to compare it. This baseline was achieved by extracting raw events that occurred during the same exact time period but backdated a week before the anomaly period, where there were no detected anomalies. I presumed that this would be a good proxy of how event traversals occur for users during a similar, "normal" time period without any site issues. For this normal time period, I again generated a transition matrix exactly the same way.
Now with a transition matrix for an anomaly period, and a transition matrix for a similar time-matched non-anomaly time period in hand, I decided to compare the probabilities between them: subtracting the normal matrix from the anomaly matrix. The resulting matrix would be composed of delta change of the two probabilities of the two time periods (Fig. 4).
This matrix of probability difference scores tells us how much users differed in their transition likelihoods between the same events between the two time periods. Going back to our initial example, if we tracked the transitions of customers interacting with our retail website on a normal day (Fig. 1) and then compared it to the time in which the retail website was misbehaving (Fig. 2), from the difference score matrix, we would see that the proportional difference in transitions across both those time periods would be between the transitions of the events ‘adding the product to cart’ and ‘checking out.’ This would indicate that the resultant actions which correspond to these events on the website should be addressed in order for this anomaly to be rectified.
This solution is ideal because it performs well regardless of the diverse clients and their varying business types and events. Specifically, the Markov chain matrix only tracks which events customers are interacting with and is, therefore, agnostic to the specific type of data. With this solution, I could specifically pinpoint which events or actions on a given client’s platform resulted in usual traffic behavior, getting to the root cause of the anomaly. With this insight, Lazy Lantern could relay specific pain-points of clients’ platforms to the clients who could then address those issues on their site or app efficiently.
The product
To make the solution scalable, I decided to create and deploy a Streamlit dashboard. With this dashboard, Lazy Lantern could easily examine the consumer behaviors of all of their clients in one place. With a drop-down menu, one can choose their client of interest, choose the specific platform of that client (i.e. web, iOS, or Android), and then choose from an InfluxDB-linked list of associated anomalies for that specific client and platform. After choosing a specific anomaly, the raw events corresponding to that anomaly and its time-matched non-anomaly are extracted from MongoDB, and the matrix solution described above is deployed.
The dashboard outputs the resultant matrices showcased in Figure 4; Lazy Lantern can visually examine the difference matrix to discover which events cause changes between the two time periods. In addition, the dashboard also outputs a summary graph (Fig. 5), which highlights the top five event transitions responsible for the greatest probability change (positive and negative) between the two time periods. With this summary, Lazy Lantern can quickly assess which event transitions on the clients’ platforms are responsible for the greatest change in customer behavior, and can alert their clients accordingly.
This dashboard was an ideal product for this solution because one could toggle between clients and anomalies with ease and relative speed, without having to modify the internal code to access and match the data with its corresponding solution.
Future considerations
This project, from idea to implementation, was done over a period of about three weeks. While the solution detailed here is fast and flexible, with more time and data, my solution could be advanced to give even more customized feedback to users. In the current solution, I only examined one-step transitions between events on a given platform. Something else to consider would be event-chains that are multiple steps, beyond just the previous or next click. This would allow insight into multiple types of "typical" event routes that customers take, e.g. routes for purchasing items, another route for reading more about the company and finding its social media platforms. These higher-order Markov Chains would allow for more customized feedback to clients about aberrant user traffic.
Another consideration to supplement this solution might be to train a customized Hidden Markov Model for each of the client companies to get a representation of the event states that exist in their platforms. This would give us a more stable probability distribution of event transitions than the snapshot distributions we derive from the current solution. However, one caveat of this approach is that it would not only require a lot of data for generating the model, but it would not be so easily or quickly amenable to changes that clients might make in the platform, and would have to be constantly updated and re-trained.
I want to thank Lazy Lantern, especially Bastien Beurier, for the opportunity to work on this problem. This was my first experience providing a business-centered solution and I really appreciate the support and feedback I received. It was vital for the data solution and product that you just finished reading about! You can find the code along with some sample data on my GitHub.