The world’s leading publication for data science, AI, and ML professionals.

When Milliseconds Matter - My Journey to Performance Improvement

Lessons Learned From a Latency Improvement Project

Working under a strict SLA, where milliseconds matter, while maintaining a complex system with multiple dependencies – can expose us to many challenges and non-trivial investigations when latency-related issues occur.

In this article, I will walk you through my journey to improve our system’s performance, initiated by a problem of the type described above, and the lessons learned along the way.

Step 1 – Expected Behavior and Problem’s Description

The company’s product handles transactions, and for each transaction received – we either approve or decline it.

For some of the transactions, we go through a data enrichment step, to obtain more information for our real-time decisions, and future transactions.

However, for tens of thousands of transactions per day – the enrichment process simply did not take place, and it wasn’t clear why.

Step 2 – Searching for a Lion In the Desert

Let’s look at a simplified view of the relevant systems:

As this was an enrichment issue, the first question was – did the problem occur within the enrichment service, or even before we got there? And apparently, for these problematic transactions, we never even sent a fetch request to the Enrichment service. One suspect down.

Step 3 – Getting to Know Our Topology

Let’s take a break from our story. I want to introduce you to the Decision-Making system’s topology. This system is based on Apache Storm, which is designed to process unbounded streams of data in real time.

In a nutshell, the Spout receives the data from a data source (e.g. Kafka / RabbitMQ) and outputs streams into the topology. Each Bolt is a component in the topology, which receives and emits one or more streams. A Bolt conducts simple logic to process the stream such as filtering, aggregating, reading from DBs and writing to them, and more.

Some Bolts run in parallel, while others are dependent on one another. Some Bolts will anchor a relative timeout value, after which they will proceed (process streams and emit them) whether or not they received all the inputs they have been waiting for. The usage of timeouts prevents Bolts and components from delaying the topology for too long, thus enabling us to meet the expected SLA.

Step 4 – Why No Enrichment?

Back to our story. Why didn’t the Decision-Making system execute the calls to the Enrichment service? I added some metrics and found out that an important condition was met for the problematic transactions – the spare time that was left for the Enrichment process was not enough, so the fetch call wasn’t executed in order to leave enough time for future Bolts.

This was surprising. Enrichment is a core component in the flow and takes place relatively at the beginning of the topology. How come we don’t have enough time left to execute this call?

Step 5 – Dependencies and Latencies

To understand why we’re out of (relative) time, I dug into a view provided by an internal tool and looked backward at the Bolts my Enrichment-Bolt was dependent on. I found an earlier Bolt which consistently took about 100 milliseconds. Considering our SLA, and the average time a Bolt is supposed to take, this was considered A LOT. What happened in this parent Bolt that took so long?

When diving into the parent Bolt’s code, I saw an elasticsearch query and wondered if this could be the reason for our bottleneck.

And it was – when looking at the relevant dashboards, I found a correlation between the hours in the day my Bolt had high Latency, and when this cluster had high CPU usage.

After syncing with the team that maintains this cluster, I learned they were familiar with its long-standing performance issues, and its gradual degradation.

Step 6 – Is This Dependency Necessary?

Why is the Enrichment-Bolt dependent on this elasticsearch query? Is it justifying the price we’re paying in the form of unenriched transactions?

In the Enrichment context – we were waiting for this query’s results for a specific feature, but further investigation showed that the given feature had a bug, God knows for how long, so we weren’t making use of the feature’s desired output.

If we delete this feature, we can disconnect the dependency between Enrichment and the Bolt that calls the problematic elasticsearch cluster. If we fix the bugged feature – we’ll get back to receiving the data someone intended to make use of, but keep our high-latency-dependency and will need to look for alternative solutions.

After considering a few potential paths for resolution – effort, and cost-effectiveness of each such path, and having received the bugged-feature-owners blessing to delete this piece of code – I removed the bugged feature.

Step 7 – Time to Make Delicate Changes to Our Topology

The Decision-Making system is dependencies-based, and at this point, I wanted to have the Enrichment component dependent on a component that took place earlier than the one which called the problematic cluster. Such a change would save the time we were waiting for the high-latency query, and the earlier our new-depending-component takes place – the more spare time will be left for Enrichment and its subsequent components.

After investigating the code, and choosing the new parent component with the higher-ups, I made this delicate change and monitored the results.

Step 8 – What Are the Results?

At first, no dramatic improvement was seen after my changes were in prod. What a bummer! Months of investigation and anticipation, and the change in rate was minor. But we shall not despair!

I checked the unenriched transactions and saw they also met the timeout condition. I investigated the dependencies view and saw that via a different path – they are still dependent on the problematic component!

The reason for that was that the Enrichment Bolt waited for a few fields whose parent Bolts were leading, again, to our high-latency-Bolt that queried the high-latency-cluster. But these fields were not used in the code.

I deleted those fields, and was glad to see the results:

The daily rate of unenriched transactions was reduced from 26K to 200.

Also, at the beginning – for some merchants, the percentage of such problematic transactions was up to 20%, and after my changes – all merchants had no more than 1% of their transactions with this issue.

Great success!

Lessons Learned:

I learned a lot along the way. I investigated, using various technologies, dug into complex code that is also based on a complex architecture, analyzed latencies, and considered different trade-offs to resolve the problem. But here are some tips I want to share with you today:

  • Deleting Code Is GoodA good software engineer is not measured by the amount of new code she writes. Deleting code is an important task, and deleting deprecated code that overloads the system can be crucial to improve the system’s Performance. Taking the time to investigate may require patience and resilience, but it could lead to precious results.
  • Code Deletion Should Be Done Thoroughly One of the reasons for having component A dependent on component B was due to input fields that weren’t used within the component. As we delete code, it is a good practice to take the time to ask ourselves if we deleted everything related to it, and left no trailing tails.

  • Invest in Profiling Tools One of the things that enabled me to detect the high-latency component was an internal profiling tool. We need visibility into our components, their dependencies, and their latencies, and we need such a tool to be intuitive and comfortable to use.

  • Consider Monitoring Latency of Specific Components In Your Flows When milliseconds matter, monitoring the latency of each component could be the key to identifying the sources of your bottlenecks. Consider adding metrics to measure your components’ latencies upon creating them, to have this data accessible when you need it.

  • Master the Technologies You Work With, Learn How to Investigate with ThemEach technology our team uses has its powers and tricks. As we encounter a new technology or new tool, we might learn what we need for our daily use, and move on. But investing the time to learn which additional insights can be derived using the tool might come in handy when we are swamped with questions to investigate.
  • Consult and BrainstormComplex investigations can be hard, and a colleague might be familiar with investigation tools we weren’t aware of, or suggest a different perspective on our problem. Keep in mind that your project’s success is your team’s and company’s success, and loop colleagues in if you feel stuck or in need of another opinion.

Related Articles