The world’s leading publication for data science, AI, and ML professionals.

How I Used Tracing and Data Science to Learn Redux Correlates with Terrible App Performance

Why reducing redux size is worth it

Image source: pexels
Photo by Lukas from Pexels

A couple years ago, I was a founding member of Slack’s frontend performance team. The goal was to understand what impacts Slack’s performance on the frontend. One metric the company wanted to understand is the timing it takes to switch from one channel to another – i.e. what causes channel switch to be slow?

The beginning of the project was just to trace different data so we can get a sense of what’s happening during a channel switch. For instance, we traced all the redux actions that happen during a channel switch and how long they take; later on, we traced React rendering as well. The tools we used was a tracer with the same api as OpenTelemetry, and a few data analysis software including Honeycomb. If you’re not familiar with tracing, read my previous article on how tracing can be used to measure application performance.

Measuring React performance with OpenTelemetry and Honeycomb

We also tagged each trace with information about the user and the team. This information includes what kind of computer they use, what geographical region they’re from, how many channels they have, how many people are in their team, how big are their redux stores, etc.

Here is what a channel switch trace looks like:

Channel switch trace visualized with Honeycomb.io. Image by author.
Channel switch trace visualized with Honeycomb.io. Image by author.

Note: I crossed out information that could be considered proprietary.

The main parent trace measures how long it takes from clicking on a channel to that channel finish rendering with a list of recent messages. The children spans are redux actions, thunks, and the rendering of react components.

Adding in traces is more-or-less the easy part of that project. There’s still technical challenges involved with implementing a robust and scalable system that other developers can easily use. However, it’s straightforward work. The hard part came from figuring out what should be measured and how to analyze traces to understand performance bottlenecks.

The bulk of this article is about the techniques I used to analyze frontend performance, and the lessons I learned while working on the team.


Basic tools and techniques to analyze app performance

I used Jupyter notebook and several python libraries to create visualizations and do the analysis.

There are several ways to get the trace data. You can download a CSV file directly from Honeycomb (but I think there’s a limit of 1000 rows). You can also query Honeycomb’s data via their API. Slack stores the trace data in its own database, so I queried the Slack’s database to do the data analysis.

The next step is to use the pandas python library to create a dataframe from the CSV files or api data. After you’ve done all the preprocessing to create the dataframe, you’re ready to analyze the traces!

Correlation heatmap

A correlation heatmap is an easy and insightful way to get a sense of what tags are potentially associated with channel switch timing.

From Jupyter notebook, use seaborn and matplotlib to make the heatmap in a few lines of code.

import seaborn as sns
import matplotlib.pyplot as plt
correlations = dataframe.corr()
heatmap = sns.heatmap(correlations, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title(f'Correlation Heatmap for {dataframe_name}')
plt.show()

The resulting heatmap will look like so.

Correlation heatmap - redacted names of the tags. Image by author.
Correlation heatmap – redacted names of the tags. Image by author.

The entire heatmap shows the correlation amongst all the tags with one another. The row we’re most interested in is the last row, i.e. the coefficients of these tags with whether the channel switch is tolerable (i.e. not slow).

From the heatmap, I can identify tags that are most associated with whether the channel switch is slow. A number close to 0 means no correlation between the given 2 columns. A number close to 1 means strong positive correlation and a number close to -1 means strong negative correlation.

The tags most negatively associated with is_tolerable have to do with redux store sizes. In other words, this was my first hint that redux store size is correlated with application performance.

Scatterplots

I further looked into the effect redux store size has on channel switch performance with scatterplots.

To do so, I bucketed redux store size on the x-axis. On the y-axis, I looked at the percentage of channel switches within that bucket that are considered "slow."

Redux size vs. percentage of slow channel switches. Image by author.
Redux size vs. percentage of slow channel switches. Image by author.

As you can see, the greater the redux store size, the greater the percentage of slow channel switches.

I also looked at the p50 (median) of channel switch times within each bucket. The trend is similar, as expected. The greater the redux store size, the higher the p50 value.

Redux size vs. p50 of channel switch times. Image by author.
Redux size vs. p50 of channel switch times. Image by author.

For both of these graphs, I used [matplotlib.pyplot.scatter](https://matplotlib.org/3.5.1/api/_as_gen/matplotlib.pyplot.scatter.html).

I further broke down the sample data by different tag values. For instance, I looked at enterprise vs. non-enterprise performance, I looked at how hardware power affects performance, etc.

What about redux actions and thunks?

Analyzing how the presence or attributes of a child span affect its parent is not easily done with tools like Honeycomb. For instance, not all channel switches trigger the same set of redux actions, but you can’t compare channel switch traces that contain redux action "A" vs. those that don’t. You also can’t do correlation analysis between children spans and the main channel switch span. Maybe Honeycomb will have this functionality eventually, but for now, it doesn’t. This means the only way to do children span analysis is to add info about these spans to the parent span as tags on the parent span. You can also write complex queries and preprocessing scripts to pull children spans and tie them to the main channel switch span via parent_id. This is a lot of effort, and ultimately, we didn’t decide to go in that direction.

Why redux size correlates with app performance, and caveats

Correlation does not imply causation. Though redux store size is strongly correlated with channel switch speed (and several other metrics we measured), it does not imply a causal relationship.

Generally speaking, redux store size is associated with complexity of the application. The more complex the application is, the more components and information the app has, the bigger the redux tends to be. Performance regression is a symptom of app complexity in general. This is because inefficiencies tend to be buried everywhere. For example, I’ve seen many cases of new objects (e.g. new arrays) being created in mapStateToProps before my team at Slack fixed the problem by introducing a lint rule. I’ve seen unnecessary props being passed into components that cause the components to re-render unnecessarily. The bigger the application, the bigger the redux size, the more symptomatic these anti-patterns become.

Therefore, redux size alone might not have a lot of effect on performance if you’re doing everything else right. However, software aren’t written perfectly in production, at least not that I’ve seen. Therefore, larger redux sizes can lead to worse performance by causing more rendering as well as longer processing times (in mapStateToProps, in selectors, in reducers, etc.).


Challenges and lessons learned

Overall, this was an extremely challenging project that involved a lot of unknowns and trial and error. It was originally supposed to be a short project to investigate channel switch performance. However, it evolved into a multi-quarter initiative that led to the creation of a separate performance engineering department at Slack! I worked on this performance observability initiative for close to 1.5 years with a small group of 4–5 frontend engineers.

The goal of the team changed from time to time, but the idea was to find big contributors to performance regression, and create ways for developers to easily see the performance impact of their projects.

The biggest takeaway I had from my time on the observability team was tooling matters. A lot. Different analytics products are ideal for different use cases. Part of the "error" in the trial and error came from tracing a lot of data that ultimately, did not lead to much actionable insights. A big part of the reason, as I mentioned before, is it is very difficult to analyze some of the data, such as children spans, with the existing tools we had.

Thus, before implementing a trace (or any software feature), you should have a good sense of how you’re going to analyze the data, and figure out whether the way you are collecting information is ideal for the tools at hand.

Another big takeaway is to have an accessible pair programming buddy, or even better, a mentor, when tackling an area you’re not trained in.

My background is in frontend. For this project, I walked into the Data Science world with no prior training or mentorship. I learned python and many of its data science libraries on my own, and I took statistics and data science classes at Berkeley and Stanford.

Even though I was resourceful in learning what I had to learn, I had Slack channels at my disposal to ask questions, and I consulted with a data scientist a few times for help, the lack of proper mentorship nevertheless led to avoidable roadblocks.


I think one of the most important skills a software engineer should have is versatility. Having a growth mindset helps. I learned so much in the last couple years about tracing, observability, performance, AND data science and machine learning!


Related Articles