Leading Indicators Beat Lagging Ones

November 4, 2020

Lagging indicators tell you what already happened. Leading indicators tell you what is about to happen. In distributed systems, the difference between the two is the difference between preventing an outage and writing a postmortem.

"Metrics that describe the past are called lagging indicators. They have some uses, such as awarding bonuses and promotions, but are not enough for decision-making. They cannot alert you of problems before it's too late." Luca Dellanna

In the context of distributed systems, characteristic metrics are the leading indicators you need. These are measurements that are affected by a trigger and only return to normal after a metastable failure resolves. Retry rate, queueing delay, cache hit rate, thread counts, connection counts, timeout rates: these metrics reveal the state of feedback loops directly or indirectly. Queueing delay is particularly valuable because it is resilient to changes in workload, unlike metrics like queries-per-second that conflate load with health.

The concept of hidden capacity makes this concrete. A web application might have an advertised capacity of 3,000 QPS with a warm cache, but its hidden capacity the threshold below which the system will self-heal after a disruption, might be only 300 QPS (the raw database capacity). If you only monitor QPS and error rates (lagging indicators), you will not know the system is in the vulnerable zone until it is already in a metastable failure state. Monitoring the characteristic metric, say the fraction of requests hitting the database versus the cache, gives you a leading indicator of how close you are to the cliff.

The practical discipline is to identify characteristic metrics for your system, establish safe ranges, and alarm when those ranges are exited. Get into the habit of investigating outliers in latency or error clusters. Metastable failures often manifest as latency outliers long before they become full outages. An hour spent understanding a p999 spike today could prevent a two-hour outage next month.

Monitor the health of your feedback loops, not just the outcomes. Characteristic metrics give you a window to act before failure becomes self-sustaining.

Leading Indicators Beat Lagging Ones

Linked from