In today’s modern app development, many applications are distributed with multiple instances running in parallel. While there are numerous benefits to this, there is also greater complexity in handling things that previously were a lot simpler - one of those things is rate limiting. In this blog post, we will look closer at what rate limiting is, some real-world applications, and how we can implement this in a distributed environment.

Why distributed rate limiting?

Recently, I was tasked with building an integration platform for a client. The platform had a few downstream dependencies with very strict rate limiting policies. This meant that when many events occurred, there was a high risk of overwhelming these dependencies with numerous API calls. Now, typically, it’s possible to mitigate this by adding a Polly Rate Limiter strategy , but since this platform was built on Azure Integration Services using Azure Functions, the scale-out would cause a lot of unnecessary API calls.

Distributed rate limiting solves this by turning the limit into a shared contract: every instance must obtain a token from a central store (Redis) before it can call the API. If no token is available, the instance must wait to obtain one.

Picking the right algorithm

When talking about rate limiting, it’s important to understand the different algorithms available. In the table below is a quick overview of the algorithms and their use cases, which we will cover in this post.

AlgorithmBest forHandles bursts?MemoryComplexity
Fixed WindowSimple, predictable capsPoorLow★☆☆
Sliding WindowEvenly spread trafficGoodMedium★★☆
Token BucketBursts and sustained flowGreatLow★★★
Distributed SemaphoreMax concurrent actionsN/A (concurrency)Low★☆☆

Algorithms in detail

Fixed Window

This algorithm uses fixed intervals to determine how many tokens are available. In each window, a number of tokens are available for use. When all tokens have been used, usage is closed down until the window closes. The available token count resets in every window.

This strategy is fairly straightforward to implement, as all it does is track a count and an expiration time. When the expiration time is reached, the count is reset. This makes it a good candidate for use cases where the consumption is pretty linear, meaning token use is spread out over the rate limiting window. This also highlights the biggest drawback of this strategy: bursts. With a large window and token size, the strategy may allow for a lot of tokens to be used at once, causing spikes in uses.

Use when: Limits are in intervals, and bursts are acceptable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local window_ms = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local current = tonumber(redis.call('INCR', key))
if current == 1 then
  redis.call('PEXPIRE', key, window_ms)
end

if current > max_tokens then
  local ttl = redis.call('PTTL', key)
  return {0, ttl} -- ttl = time client should wait (ms)
else
  return {1, 0}
end

It’s effectively the same as Redis’s official INCR‑based fixed‑window rate‑limiter pattern , so you can rely on well‑tested behavior.

Sliding Window

In contrast to the fixed window strategy we now keep track of when a token was consumed and make it available after the window has passed. This helps spread out the token usage to prevent spikes and is more flexible in where it can be applied. One of the drawbacks is the increased complexity of the algorithm, as it requires storing more data.

Use when: You must smooth traffic and avoid spikes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local window_ms = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, '-inf', now - window_ms)

local count = redis.call('ZCARD', key)

if count < max_tokens then
  redis.call('ZADD', key, now, now .. "-" .. math.random(100000, 999999))
  redis.call('PEXPIRE', key, window_ms)
  return {1, 0}
else
  local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
  local wait_time = window_ms - (now - tonumber(oldest[2]))
  return {0, wait_time}
end

Token Bucket

The easiest way to understand this algorithm is to imagine a bucket of tokens, as the name implies. The bucket starts full and each action consumes one or more tokens. The bucket slowly refills at a constant rate. This strategy allows for bursts if needed, but also helps spread out tokens after a burst period. The algorithm is very flexible as there is a few parameters to tweak to get the desired rate limit.

Use when: You need burst tolerance and a steady refill rate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
local rate_limit_key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local token_cost = tonumber(ARGV[4])

local bucket = redis.call('HMGET', rate_limit_key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
if tokens == nil then
  tokens = capacity
  last_refill = now
end

local elapsed = math.max(0, now - last_refill)
local refill_tokens = elapsed * refill_rate / 1000.0
tokens = math.min(capacity, tokens + refill_tokens)

if tokens < token_cost then
  local tokens_needed = token_cost - tokens
  local wait_time = math.ceil((tokens_needed * 1000) / refill_rate)
  return {0, wait_time}
else
  tokens = tokens - token_cost
  redis.call('HSET', rate_limit_key, 'tokens', tokens, 'last_refill', now)
  redis.call('PEXPIRE', rate_limit_key, math.ceil((capacity / refill_rate) * 1000))
  return {1, 0}
end

Distributed Semaphore

This is a strategy that is a bit different from the other strategies we have covered so far, but it does have its use in rate limiting. This strategy limits the number of concurrent actions at a given time. This is particularly useful if you want to limit how many applications access a database or an API at one time. A caller can obtain a token from the semaphore, perform an action, and then release the token when finished. This allows the next caller to get the token.

Use when: You must cap concurrency (e.g. max 10 simultaneous requests).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
local key = KEYS[1]
local now = tonumber(ARGV[1])
local timeout = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local member = ARGV[4]

redis.call('ZREMRANGEBYSCORE', key, '-inf', now - timeout)
local count = redis.call('ZCARD', key)
if count < limit then
  redis.call('ZADD', key, now, member)
  return 1
end
return 0

Conclusion

Distributed rate limiting isn’t about picking a “best” algorithm - it’s about matching strategy to workload:

  • Fixed Window – trivial to code, fine for low‑variance traffic.
  • Sliding Window – smooths spikes when fairness matters.
  • Token Bucket – mixes bursts with sustained flow; most versatile.
  • Semaphore – caps pure concurrency where “one‑in, one‑out” is vital.

By utilizing Redis and wrapping the logic in Lua, we guarantee atomicity across every function instance, container, or worker that scales out (Redis Lua scripting guarantees atomic execution ). Pair that with sensible TTLs, key namespacing and you have a rate‑limit implementation that’s:

  • Centralised – one contract, many callers.
  • Cloud‑ready – drop‑in for Azure Functions, Kubernetes jobs, or anything stateless.

I hope the walkthrough was useful. If you’d like to see these scripts packaged in a NuGet library or hosted in a GitHub repo, just let me know in the comments!