Goodput Matters More Than Throughput

Goodput the throughput of useful work is the metric that actually matters. A system can have high throughput while accomplishing almost nothing if most of that work is retries, wasted computation, or requests that will never complete.

"A good way to weaken or break the feedback loop is to ensure that the goodput remains high even during overload." Huang et al., "Metastable Failures in Distributed Systems"

The distinction between throughput and goodput becomes critical during overload. A system processing 10,000 requests per second might have goodput near zero if 9,900 of those requests are retries of the same failed operations. The system looks busy CPU is pegged, queues are full, network is saturated but no useful work is being completed. This is the hallmark of a metastable failure: the system is drowning in self-generated load while customers see nothing but errors.

Work amplification is the mechanism that drives goodput to zero. Retries multiply failed requests. Failover routes traffic to healthy replicas that then become unhealthy. Cache misses cascade into database queries that generate more cache misses. In one extreme case, a geo-distributed system's combination of retries and failover destinations produced a worst-case work amplification of over 100x each user request could generate a hundred internal requests during overload.

The remedy is to design systems that preserve goodput under stress. Switch to LIFO scheduling during overload so at least some requests meet their deadlines. Shed load through admission control rather than letting everything queue up and timeout. Prioritize fresh user requests over retried ones. DynamoDB provides an instructive model: its performance is predictable because it does bounded work for every operation. There is an upper limit on work amplification by design, not just by hope. The system that processes fewer total requests but completes more of them successfully is the one your users actually experience as reliable.

Takeaway: Optimize for useful work completed, not total work attempted during overload, the system that does less total work often delivers more value.


See also: Metastable Failures Are the Hardest to Prevent | Efficiency Is The Enemy of Resilience | Circuit Breakers Are Not Enough