Modern systems can emit thousands or millions of metrics, and modern monitoring tools can collect them all. Faced with such an abundance of data, it can be very difficult to know where to start looking when you’re trying to diagnose a problem. And when you’re not in diagnosis mode, but you just want to know whether there’s a problem at all, you might have the same difficulty. What are the truly key KPIs coming from your systems?
I’ve written extensively about this before, but this time I want to direct your attention to other people’s opinions, not mine. Specifically, some of the smartest performance experts know of: Brendan Gregg and Tom Wilkie. These two have coined acronyms—USE and RED respectively—that are easy to remember, and provide good high-level guidance for system observability.
Brendan Gregg’s USE Method
USE is an acronym for Utilization, Saturation, and Errors. Brendan Gregg suggests using it to get started quickly when you’re diving into an unknown system: “I developed the USE Method to teach others how to solve common performance issues quickly, without overlooking important areas. Like an emergency checklist in a flight manual, it is intended to be simple, straightforward, complete, and fast.”
A summary of USE is “For every resource, check utilization, saturation, and errors.” What do those things mean? Brendan defines the terminology:
- utilization: the average time that the resource was busy servicing work
- saturation: the degree to which the resource has extra work which it can't service, often queued
- errors: the count of error events
This disambiguates utilization and saturation, making it clear that utilization is “busy time %” and saturation is “backlog.” These terms are very different from things a person might confuse with them, such as “disk utilization” as an expression of how much disk space is left.
Tom Wilkie’s RED Method
Tom Wilkie introduced this acronym in a talk on monitoring microservices in 2015. The acronym stands for Rate, Errors, and Duration. These are request-scoped, not resource-scoped as the USE method is. Duration is explicitly taken to mean distributions, not averages.
USE And RED: Two Sides Of The Same Coin
What may not be obvious is that USE and RED are complementary to one another. The USE method is an internal, service-centric view. The system or service’s workload is assumed, and USE directs attention to the resources that handle the workload. The goal is to understand how these resources are behaving in the presence of the load.
The RED method, on the other hand, is about the workload itself, and treats the service as a black box. It’s an externally-visible view of the behavior of the workload as serviced by the resources. I define workload as a population of requests over a period of time. I’ve spoken and written extensively before about the importance of measuring the workload, since the system’s raison d’etre is to do useful work.
Taken together, RED and USE comprise minimally complete, maximally useful observability—a way to understand both aspects of a system: its users/customers and the work they request, as well as its resources/components and how they react to the workload. (I include users in the system. Users aren’t separate from the system; they’re an inextricable part of it.)
I often refer to this duality as the "Zen of Performance," a holistic, unified system performance worldview I'm developing. It's work in progress!
Mapping USE And RED To Standard Terminology
USE and RED are convenient, and part of the reason they’re so valuable is that their atoms map directly to standard concepts that are core performance metrics:
- U = Utilization, as canonically defined
- S = Concurrency
- E = Error Rate, as a throughput metric
- R = Request Throughput, in requests per second
- E = Request Error Rate, as either a throughput metric or a fraction of overall throughput
- D = Latency, Residence Time, or Response Time; all three are widely used
To learn more about why these metrics are so fundamental to performance and observability, listen to Jon Moore’s talk on why API admission control should use concurrency instead of throughput. And, for further reading, consider my ebooks on queuing theory and the Universal Scalability Law.
In conclusion, if you’re unsure which metrics are most useful for both monitoring and diagnosis, USE and RED are great places to start.