Most APIs in the wild implement rate limits. They say "you can only make X number of requests in Y seconds". If your API rate limit exceeds the specified limit, their servers will reject your requests for a period of time, basically saying, "Sorry we didn't process your request, please try again in 10 seconds."
Many language-specific SDKs and clients, even from major API providers, don't come with built-in rate limit handling. For example, Dropbox's node client does not implement API throttling.
Some companies provide an external module like GitHub's plugin-throttling package for their node clients. But often it's up to you to implement.
These rate limits can be annoying to deal with, especially if you're working with a restrictive sandbox and trying to get something up and running quickly.
Efficiently handling these is more complex than it seems. This post will walk through several different implementations and the pros and cons of each. We'll finish with an example script you can use to run benchmarks against the API of your choice. All examples will be in vanilla JavaScript.
Quick and dirty ⏱️: Getting around API rate limts
Maybe you just want to get something working quickly without error. The easiest way around a rate limit is to delay requests so they fit within the specified window.
For example if an API allows 6 requests over 3 seconds, the API will allow a request every 500ms and not fail (3000 / 6 = 500
).
Where sleep
is:
This is poor practice! It still could error if you are on the edge of the time window, and it can't handle legitimate bursts. What if you only need to make 6 requests? The code above will take 3 seconds, but the API allows doing all 6 in parallel, which will be significantly faster.
The sleep approach is fine for hobby projects, quick scripts, etc—I admit I've used it in local script situations. But you probably want to keep it out of your production code.
There are better ways!
The dream
The ideal solution hides the details of the API's limits from the developer. I don't want to think about how many requests I can make, just make all the requests efficiently and tell me the results.
My ideal in JavaScript:
As an API consumer, I also want all my requests to finish as fast as possible within the bounds of the rate limits.
Assuming 10 requests at the previous example limits of 6 requests over 3 seconds, what is the theoretical limit? Let's also assume the API can make all 6 requests in parallel, and a single request takes 200ms
- The first 6 requests should complete in 200ms, but need to take 3 seconds because of the API's rate-limit
- The last 4 requests should start at the 3 second mark, and only take 200ms
- Theoretical Total: 3200ms or 3.2 seconds
Ok, let's see how close we can get.
Handling the 429 too many requests
error
The first thing we need to nail down is how to handle the 429 status code error responses when the API limits are exceeded.
If you exceed an API provider's rate limit, their server should respond with a 429
status code (Too Many Requests
) and a Retry-After
header.
The Retry-After
header may be either in seconds to wait or a date when the rate limit is lifted.
The header's date format is not an ISO 8601 date, but an 'HTTP date' format:
An example:
Fortunately if you are a JavaScript / Node user, this format is parsable by passing it to the Date
constructor.
Here's a function that parses both formats in JavaScript:
Now we can build out a function that uses the Retry-After
header to retry when we encounter a 429
HTTP status code:
This function will continue to retry until it no longer gets a 429
status code.
Now we're ready to make some requests!
API rate limit setup
I'm working with a local API and running 10 and 20 requests with the same example limits from above: 6 requests over 3 seconds.
The best theoretical performance we can expect with these parameters is:
- 10 requests: 3.2 seconds
- 20 requests: 9.2 seconds
Let's see how close we can get!
Baseline: sleep between requests
Remember the "quick and dirty" request method we discussed earlier? We'll use its behavior and timing as a baseline to improve on.
A reminder:
So how does it perform?
- With 10 requests: about 7 seconds
- With 20 requests: about 14 seconds
Our theoretical time for serial requests is 5 seconds at 10 requests, and 10 seconds for 20 requests, but there is some overhead for each request, so the real times are a little higher.
Here's a 10 request pass:
Approach 1: serial with no sleep
Now we have a function for handling the error and retrying, let's try removing the sleep call from the baseline.
Looks like about 4.7 seconds, definitely an improvement, but not quite at the theoretical level of 3.2 seconds.
Approach 2: parallel with no throttling
Let’s try burning through all requests in parallel just to see what happens.
This run took about 4.3 seconds. This is a slight improvement over the previous serial approach, but the retry is slowing us down. You can see the last 4 requests all had to retry.
This looks pretty reasonable with only 4 retries, but this approach does not scale. Retries in this scenario only get worse when there are more requests. If we had, say, 20 requests, a number of them would need to retry more than once—we'd need 4 separate 3 second windows to complete all 20 requests, so some requests would need to retry at best 3 times.
Additionally, the ratelimiter implementation my example server uses will shift the Retry-After
timestamp on subsequent requests when a client is already at the limit—it returns a Retry-After
timestamp based on the 6th oldest request timestamp + 3 seconds.
That means if you make more requests when you’re already at the limit, it drops old timestamps and shifts the Retry-After
timestamp later. As a result, the Retry-After
timestamps for some requests waiting to retry become stale. They retry but fail because their timestamps were stale. The failure triggers yet another retry, and causes the Retry-After
timestamp to be pushed out even further. All this spirals into a vicious loop of mostly retries. Very bad.
Here is a shortened log of it attempting to make 20 requests. Some requests needed to retry 35 times (❗) because of the shifting window and stale Retry-After
headers. It eventually finished, but took a whole minute. Bad implementation, do not use.
Approach 3: parallel with async.mapLimit
It seems like a simple solution to the problem above would be only running n
number of requests in parallel at a time. For example, our demo API allows 6 requests in a time window, so just allow 6 in parallel, right? Let’s try it out.
There is a node package called async implementing this behavior (among many other things) in a function called mapLimit
.
After many 10-request runs, 5.5 seconds was about the best case, slower than even the serial runs.
At 20 requests, it finished in about 16 seconds. The upside is that it does not suffer from the retry death spiral we saw in the previous parallel implementation! But it's still slow. Let's keep digging.
Approach 4: winning with a token bucket
So far none of the approaches have been optimal. They have all been slow, triggered many retries, or both.
The ideal scenario that would get us close to our theoretical minimum time of 3.2 seconds for 10 requests would be to only attempt 6 requests for each 3 second time window. e.g.
- Burst 6 requests in parallel
- Wait until the frame resets
GOTO
1
The 429
status code is nice and we will keep it, but we should treat it as an exceptional case as it is unnecessary work. The goal here is to make all the requests without triggering a retry under common circumstances.
Enter the token bucket algorithm. Our desired behavior is its intended purpose: you have n
tokens to spend over some time window—in our case 6 tokens over 3 seconds. Once all tokens are spent, you need to wait the window duration to receive a new set of tokens.
Here is a simple implementation of a token bucket for our specific purpose. It will count up until it hits the maxRequests
, any requests beyond that will wait the maxRequestWindowMS
, and then attempt to acquire the token again.
Let's try it out!
With 10 requests it's about 4 seconds. The best so far, and with no retries!
And 20 requests? It takes about 10 seconds total. The whole run is super clean with no retries. This is exactly the behavior we are looking for!
Approach 4.1: using someone else's token bucket
The token bucket implementation above was for demonstration purposes. In production, you might want to avoid maintaining your own token bucket if you can help it.
If you're using node, there is a node module called limiter that implements token bucket behavior. The library is more general than our TokenBucketRateLimiter
class above, but we can use it to achieve the exact same behavior:
Usage is the same as the previous example, just swap LimiterLibraryRateLimiter
in place of TokenBucketRateLimiter
:
Other considerations
With the token bucket in the two approaches above, we have a workable solution for consuming APIs with ratelimits in production. Depending on your architecture there may be some other considerations.
Success rate limit headers
APIs with rate limits often return rate-limit headers on a successful request. e.g.
The header names are convention at the time of writing, but many APIs use the headers specified above.
You could run your token bucket with the value from these headers rather than keep state in your API client.
Throttling in a distributed system
If you have multiple nodes making requests to a rate-limited API, storing the token bucket state locally on a single node will not work. A couple of options to minimize the number of retries might be:
- X-Ratelimit headers: Using the headers described above
- Shared state: You could keep the token bucket state in something available to all nodes like
redis
API Throttling Verdict: use a token bucket
Hopefully it's clear that using a token bucket is the best way to implement API throttling. Overall this implementation is clean, scalable, and about as fast as possible without triggering retries. And if there is a retry? You’re covered by the 429 Too Many Requests
handling discussed in the beginning.
Even if you don't use JavaScript, the ideas discussed here are transferable to any language. Feel free to re-implement the TokenBucketRateLimiter
above in your favorite language if you can’t find a suitable alternative!
Note: check out the example script I used to run these benchmarks. You should be able to use it against your own API by putting your request code into the callTheAPI
function.
For any questions, feel free to reach out at developers@useanvil.com.