James Stanley


Against exponential backoff

Thu 22 May 2025
Tagged: software

When software uses some external dependency, like an HTTP service, it sometimes has to handle failures. Sometimes people choose exponential backoff: after each failure, you wait longer before retrying, for example by doubling the delay each time. In this post I'm going to argue that exponential backoff is a bad idea.

I know why you are tempted by exponential backoff. It sounds clever and appealing. If your dependency is down for N minutes then you only make O(log N) attempts to contact it. Everybody loves O(log N)! What's not to like?

But I think you shouldn't do it, for these reasons:

1. it wastes O(N) time

If the service is down for an hour and you double your sleep on every failure then you could sleep for another hour while your dependency is actually already working. Your dependency is down for 1 hour, your service is down for up to 2 hours. Is that what you want?

That is not a good experience. Nobody would choose that. By choosing exponential backoff, you're choosing that.

2. it makes debugging difficult

Let's say you get in to work and there's an outage of your important Fooservice. You work out that the root cause of your outage is that Barservice was down.

You fix Barservice. (Let's be honest, you simply restart Barservice and it mysteriously starts working again. Whatever, doesn't matter).

But your customer-facing, actually-money-making, super-important end-user product Fooservice still isn't working! What's going on?

It turns out that Fooservice is stuck in sleep(1800) for the next 25 minutes. It hasn't even tried to see if Barservice is back.

Is that what you want?

What do you do next?

Of course you restart Fooservice because you know Barservice is back and you want Fooservice to start working again. When you're looking at it, your revealed preference is for the service to try again much sooner than in 25 minutes' time.

But if Fooservice wasn't using exponential backoff it would have already started working again on its own.

3. it composes geometrically

If Fooservice depends on Barservice depends on Bazservice, and Bazservice is down for half an hour, then Barservice might be down for an hour and Fooservice might be down for 2 hours!

Is that what you want?

4. compute is cheap

If you have no delay between retries then yeah you max out a CPU core on your machine and you hammer the remote service.

But compute is cheap enough that if you wait 1 second between retries then it probably won't be a problem.

If it is a problem then pick a number larger than 1. Say, your end of the request takes 9 seconds of CPU time to initialise, so if you only sleep 1 second every time then you still have a 90% duty cycle on CPU usage which is unacceptable. Sleep more than 1 second! I don't care. Just don't let it keep growing. Pick an allowable duty cycle, set your sleep accordingly, don't let it grow.

If you really must...

If you really must use exponential backoff then let's use a bounded exponential backoff. Instead of doubling forever, pick a (low!) upper limit on the maximum time you will sleep. Ideally your maximum sleep should be almost imperceptible at human time scales. You should be able to fix whatever dependency is broken, and then your dependent service should be working again faster than you can find out that it's not.

Don't make it possible for your program to sleep for an hour just because a dependency has been down for an hour. Make your program cap out at 10 second sleeps or something. Please.

The downfall of civilisation

Btw I think we're living through the downfall of civilisation. It seems like every day stuff gets broken faster than it gets repaired. OK, it doesn't seem like it's changing that much from day to day, but Rome didn't fall in a day. What can we do about it? Probably nothing. But we can at least prolong it by eschewing exponential backoff in favour of constant backoff. Thanks for listening to my TED talk.

And remember: friends don't let friends implement exponential backoff.



If you like my blog, please consider subscribing to the RSS feed or the mailing list: