The world’s leading publication for data science, AI, and ML professionals.

Designing a Rate Limiter

How do you design a rate limiter service?

System Design Basics

Photo by Jakob Søby on Unsplash
Photo by Jakob Søby on Unsplash

System design is one of the most significant concepts of software engineering. For a new learner, it becomes confusing to learn about designing because there are no fixed guidelines in various Software Architecture books that are hard to understand. Everywhere you check seems to have a different approach.

So, I set out to design a system based on my experience of learning architecture courses. This is part of a series on system design for beginners (link is given below). For this article, let’s design a Rate Limiter service.

What is Rate Limiting?

Rate limiting is a simple concept to grab, but it’s vital for large-scale systems. It is used for security purposes as well as performance improvement. For example, a service can serve a limited number of requests per second. However, if a service is receiving a huge number of requests at a time, it might be too much for the server to handle. In that case, even the server might be shut down. To handle this type of problem, we would need some kind of limiting mechanism that would allow only a certain number of requests to the service.

If we want to put it simply, rate limiting is limiting some operations with some threshold. If those threshold values are crossed, the system will return errors. You may say rate limiting limits the number of operations performed within a given amount of time.

Figure 1: Rate limiter rejects 3rd request (Image by Author)
Figure 1: Rate limiter rejects 3rd request (Image by Author)

Rate limiting can be implemented in tiers. E.g., a network request can be limited to 1 request per second, 4 requests per minute, 10 requests per 5 minutes.

Why do Need Rate Limiting?

Rate limiting is needed to protect a system from being brought down by hackers. To know the importance of rate-limiting, we have to know about the DOS attack.

The denial of service attack, which is mostly known as a DOS attack, is when hackers try to flood a system with many requests within a short period of time to shut the system down. Because a server has a limit of how many requests it can serve within a certain period of time. So, the system can’t handle the floods of requests properly; it can not handle them properly.

Rate limiting can prevent that from happening as it protects the system from being flooded with intentional unnecessary requests. After a threshold value, the system returns an error.

Say, for example, in Leetcode or some website where we can execute our code, rate limiting may be needed so that users don’t spam the code execution service needlessly.

We can rate-limit the system in various ways, like based on user id. Say, we can limit a user’s request operation to a server for 5 operations/minute. Other ways to rate- limit can be based on IP address, region, etc.

Even we can use rate limit on the system as a whole. Like, the system may not handle more than 20K requests/minute. If the total user request exceeds that number in a minute, the system is gonna return errors.

Definition of the System:

System design is such a vast topic; if we don’t narrow it down to a specific purpose, it will become complicated to design the system, especially for newbies. In this article, we are designing a rate limiter service that will limit the number of client requests to a server within a time frame.

Photo by Mitchell Luo on Unsplash
Photo by Mitchell Luo on Unsplash

Requirements and Goals of the System

In this part, we need to specify the features of the system. The requirements are divided into two parts:

Functional Requirements:

  1. A client can send a limited number of requests to a server within a time window, e.g., 10 requests per second.
  2. The client should get an error message if the defined threshold limit of request is crossed for a single server or across different combinations of servers.

Non-Functional Requirements:

  1. The system should be highly available since it protects our service from external attacks.
  2. Performance is an important factor for any system. So, we need to be careful that rate limiter service should not add substantial latencies to the system.

Memory Estimation:

Let’s assume we are keeping all of the data in a hash-table. Let’s say we have a structure of key-value pair. The key would be the hash value of user ID, and the value would be a structure of count and startTime, e.g., UserId {Count, StartTime} UserId {Count, StartTime}

Now, let’s assume ‘UserID’ takes 8 bytes and 2 bytes for ‘Count,’ which can count up to 65k, which is sufficient for our use case. Although the end time will need 4 bytes, we can store only the minute and second parts. The hour part is not necessary if we are not considering a window of an hour. Now, it will take 2 bytes. So, we need a total of 12 bytes to store a user’s data.

Now, if our hash-table has an overhead of 20 bytes for each record and we need to track 1 million users at any time, the total memory we need would be :

(12 + 20) bytes * 1 million => 32MB

If we need a 4-byte number to lock each user’s record to resolve our atomicity problems, we would require a total of 36MB memory.

Domain Model/High-level Design:

When a new request arrives from the client, the Server first asks the Rate Limiter to decide if it will be served or rejected. If the request is not rejected, then it’ll be passed to the internal API servers.

Rate Limiter’s responsibility is to decide which client request will be served and which request will be declined.

Figure: Domain Model/High-Level Design (Image by Author)
Figure: Domain Model/High-Level Design (Image by Author)

Fixed Window algorithm:

In this algorithm, we may consider a time window from the start to end time unit. For example, our window period can be considered 0–60 seconds, a minute, irrespective of the time frame at which the client requests the server. We already have thought of a structure of count and startTime, e.g., UserId {Count, StartTime}

Figure I: Rate limiter allows two requests per minute from the client (Image by Author)
Figure I: Rate limiter allows two requests per minute from the client (Image by Author)

Figure I displays the scenario where a client can make two requests per minute. So, request 1 and request 2 are allowed by the rate limiter. But the third request is blocked as it can within 1 minute from the same user. This type of algorithm is named as Fixed Window algorithm. In this algorithm, a period is considered 1 minute( for this example) irrespective of the time frame at which the two API requests have been allowed.

After 1 minute of the first request from a specific client, the start time is reset for that user. This is a simple approach for rate limiters. But there are some drawbacks to this approach.

Drawback:

It can allow twice the number of allowed requests per minute. Imagine the user sends two requests at the last second of a minute(12:01:59) and immediately makes two more requests at the first second(12:02:00) of the next minute. It may seem a minor problem for only two requests. But if the number of requests is much higher(for example, 50 requests) or the time frame is much lower, e.g., 15 seconds, it can cause a huge number of requests for a server. Especially we are considering here only one user. Imagine what will happen to a service that serves millions of users.

Besides, in a distributed system, the "read-and-then-write" behavior can create a race condition. For example, imagine if the client’s current request count is 2 and makes two more requests. If two separate processes serve each of these requests and concurrently read the Count value before either of them updated it, the process would result in the client not hitting the rate limit. Thus, we may need to implement locking each record.

Sliding Window Algorithm:

If we keep track of each request per user in a time frame, we may store the timestamp of each request in a Sorted Set in our ‘value’ field of hash-table. But, it would take a lot of memory. So, we can use a sliding window with the counters. Now, consider this if we keep track of request counts for each user using multiple fixed time windows.

For example, if we have an hourly window, when we receive a new request to calculate the limit, we can count requests each minute and calculate the sum of all counters in the past hour. This would reduce the memory we need for the set.

Suppose we rate-limit at 500 requests per hour with an additional limit of 10 requests per minute. This means that the client has exceeded the rate limit when the sum of the counters in the past hour exceeds the limit request(500).

For a smaller time frame, the client can’t send more than ten requests per minute. This would be a hybrid and reasonable consideration as we may think that none of the real users would send such frequent requests. Even if they do, they will get success with retries since their limits get reset every minute.

Problems in Distributed Environment:

In the case of distributed systems, the algorithms we discussed will have some problems. You may check the figure below to get the problem. A user is allowed to request 3 times within a minute. The scenario is that the user already made 2 requests and made two new requests within 2 seconds. And the requests from the user went to two different load balancers, then to two different rate limiter services. As the DB already contains the count value 2, both the Rate Limiter service gets the value below the threshold and permits the requests. So, we are now allowing 4 requests from the same user within a minute, which breaks the limit of the rate limiter. This is the inconsistency problem.

Figure1: Problems in a distributed environment (Image by Author)
Figure1: Problems in a distributed environment (Image by Author)

As a solution, we may use sticky session load balancing to ensure that one user’s request will always go to the same rate limiter service. But the problem here is that it’s not a fault-tolerant design. Because if the Rate limiter service is down, user requests can not be served by the system.

We can solve this problem using locks. Once a service uses counter data from the database, it will lock it. So, other services can not use it before the counter data is updated. But this procedure also has its own share of problems. It will add an extra latency for the locking. So, we need to decide between availability and performance trade-offs.

Caching:

This system may get huge benefits from caching recent active users. Servers can quickly check if the cache contains the counter value for the user before hitting backend servers. This should be faster than going to the DB each time.

We may use the Write-back cache strategy by updating all counters and timestamps in cache only. Then, we can write to the database done at fixed intervals, e.g., 1 hour. This can ensure minimum latency added to the user’s requests, which should improve the Lock latency problem.

When the server reads data, it can always hit the cache first. This will be useful when the user has hit their maximum limit. The rate limiter reads data without any updates.

We may use the Least Recently Used (LRU) as a cache eviction policy for our system.

Figure 1: Final Design of Rate Limiter (Image by Author)
Figure 1: Final Design of Rate Limiter (Image by Author)

DoS attack:

To prevent the system from a DoS attack, we can take other strategies to rate limit by IP address or users.

IP: In this strategy, we limit requests per IP; It might not be a good way to differentiate between normal users and attackers. But, it’s still better to have some protection rather than not have anything at all.

The biggest problem with IP-based rate limiting is when multiple users share a single public IP. For example, in an internet cafe or smartphone, users are using the same gateway. Another problem could be, there are lots of IPv6 addresses available to a hacker from even one computer.

User: After user authentication, the system will provide the user a token which the user will pass with each request. This will ensure that we can rate limit an API that has a valid authentication token. But the problem is that we have to rate-limit the login API itself.

The weakness of this strategy would be that a hacker can perform a denial of service attack against a user by providing wrong credentials up to the limit; after that, the actual user may not be able to log in.

So, we can use both the approach together. However, this may result in more cache entries with more details per entry. So, we would need more memory and storage.

Conclusion:

A rate limiter may limit the number of events an entity (user, IP, etc.) can perform in a time window. For example, a user can send only five requests per minute. We can shard based the data on the ‘UserID’ to distribute the user’s data. We should use Consistent Hashing for fault tolerance and replication. We can have different rate limiters for different APIs for each user or IP.

The task of a rate limiter is to limit the number of requests to or from a system. Rate limiting s used most often to limit the number of incoming requests from the user in order to prevent DoS attacks. The system can enforce it based on IP address, user account, etc.

We should be careful about the fact that normal DoS attacks may be prevented by rate-limiting. But in the case of distributed DoS attack, it may not be enough.


This article is part of the system design basics. Some article links of the series are given here:

Designing Notification System with Message Queues

System Design Basics: Getting started with Caching

System Design Basics: Client-Server architecture

System Design Basics: Load balancer 101

System Design of Google Auto-Suggestion Service

System Design Analysis of Google Drive

System Design Analysis of TinyURL

Availability in Distributed Systems


Reference: Grokking the System Design Interview Course.

Thank you for reading the article. Have a good day


Related Articles