Load averages can be very useful, but it's problematic to combine them with some...

heavenlyblue · on Aug 31, 2023

> as it's possible for all systems to decline traffic resulting in no available backends

I don't see why this is a problem? This is when we should start rejecting requests on the frontend

sethammons · on Aug 31, 2023

presumably you want your service to be up. If you can smooth the load out to other nodes, your service stays up. If you have a bad strategy, you start losing capacity as nodes get removed.

Let's say each node can handle 5k rps and you have 10 nodes. You can handle 50k rps. If you are receiving 40k rps, a good strategy will put each node at 80% capacity. A bad strategy will knock a node out, reducing your total capacity, putting extra pressure on the rest of the system, causing more failures, and more pressure. This is called a thundering herd.

At some point, your only option is load shedding. But with a bad LB strategy, you start load shedding much much earlier than you should. This is a bad experience for customers that is avoidable with good LB strategies.

ignoramous · on Sept 2, 2023

Fair. Our solution is to make sure a biased load balancer (like in TFA) isn't sending the workload towards a select few machines while others may be not be as over worked, in as simple way as possible.

We run load balancers in a fail open mode. As in, if every backend is excusing itself, then none are excused.

But as you point out, load balancing is a hairy beast unto itself.

paulmd · on Aug 31, 2023

and also spinning up additional instances on the backend

ignoramous · on Sept 2, 2023

> Load averages can be very useful, but it's problematic to combine them with something like readiness checks.

We don't use loadavg. The linked article in my parent comment makes the same point as you are (: