Leading Indicators Beat Lagging Ones
Lagging indicators tell you what already happened. Leading indicators tell you what is about to happen. In distributed systems, the difference between the two is the difference between preventing an outage and writing a postmortem.
"Metrics that describe the past are called lagging indicators. They have some uses, such as awarding bonuses and promotions, but are not enough for decision-making. They cannot alert you of problems before it's too late." Luca Dellanna
In the context of distributed systems, characteristic metrics are the leading indicators you need. These are measurements that are affected by a trigger and only return to normal after a metastable failure resolves. Retry rate, queueing delay, cache hit rate, thread counts, connection counts, timeout rates these metrics reveal the state of feedback loops directly or indirectly. Queueing delay is particularly valuable because it is resilient to changes in workload, unlike metrics like queries-per-second that conflate load with health.
The concept of hidden capacity makes this concrete. A web application might have an advertised capacity of 3,000 QPS with a warm cache, but its hidden capacity the threshold below which the system will self-heal after a disruption might be only 300 QPS (the raw database capacity). If you only monitor QPS and error rates (lagging indicators), you will not know the system is in the vulnerable zone until it is already in a metastable failure state. Monitoring the characteristic metric say, the fraction of requests hitting the database versus the cache gives you a leading indicator of how close you are to the cliff.
The practical discipline is to identify characteristic metrics for your system, establish safe ranges, and alarm when those ranges are exited. Get into the habit of investigating outliers in latency or error clusters. Metastable failures often manifest as latency outliers long before they become full outages. An hour spent understanding a p999 spike today could prevent a two-hour outage next month.
Takeaway: Monitor the health of your feedback loops, not just the outcomes characteristic metrics give you a window to act before failure becomes self-sustaining.
See also: Metastable Failures Are the Hardest to Prevent | Latency Sneaks Up On You | Monitor What Matters Not What Is Easy
Linked from
- Control Theory Applies to Software Systems
- Delays in Feedback Loops Cause Oscillation and Overshoot
- Feedback Loops Are the Hidden Architecture of Everything
- Latency Sneaks Up On You
- Monitor What Matters Not What Is Easy
- The Heilmeier Catechism for Evaluating Ideas
- The Media-Historian Gap Reveals What Actually Matters