Make Your Failure Paths Cheap

January 24, 2022

In most systems, the success path is carefully optimized while the failure path is an afterthought. When failures start cascading, the expensive failure path becomes the dominant code path, and the system collapses under its own error handling.

"The sustaining effect is almost always associated with exhaustion of some resource. Surprisingly, feedback loops associated with resource exhaustion are often created by features that improve efficiency and reliability in the steady state." Huang et al., Metastable Failures in Distributed Systems

The success path in a performance-critical application might require only RAM access, with engineers optimizing even TLB miss rates. The failure path, coded for debuggability, might capture a stack trace (heavy CPU), perform a DNS lookup to identify the client (blocking a thread), write a detailed message to disk (occasionally blocking on I/O), and send telemetry to a centralized logging service (consuming network bandwidth). Under normal operation, failures are rare and this overhead is invisible. During an incident, when the failure path becomes the hot path, each of these operations amplifies the resource shortage that caused the failure in the first place.

This is a direct mechanism for metastable failures. The system runs out of a resource. Error handling consumes more of that resource. More errors occur. More error handling runs. The feedback loop is self-sustaining. An analysis of production failures in distributed data-intensive systems found that the majority of catastrophic failures could have been prevented by simple testing of error handling code not because the error handling was logically wrong, but because it was never exercised at scale.

The fix is to treat error paths as performance-critical code. Send failures to a dedicated error-logging thread via a bounded-size, lock-free queue. If the queue overflows, reflect errors only in a counter reducing per-failure overhead dramatically. Throttle expensive operations like stack traces; when there are many errors, a sample is sufficient for diagnosis. As TigerBeetle puts it: "all errors must be handled," but handling them must not cost more than the error itself. Your system's behavior under failure should be a deliberate design choice, not an accidental consequence of debug-friendly coding.

Your error path is your hot path during an outage. Optimize it accordingly, or it will be the thing that turns a recoverable incident into a metastable failure.

Make Your Failure Paths Cheap

Linked from