Doing Work with the Law that Defines Critical APM Metrics

Posted by Alex Slotnick on Aug 15, 2016 11:39:26 AM

Later this month, on August 30th, VividCortex’s Preetam Jinka will join Datadog’s Matt Williams for a webinar to discuss 5 Tips on Determining the Most Impactful Metrics in Your App,” along with the solutions each company uses to track those metrics.

In anticipation of that conversation, this blog post goes a bit behind the scenes and introduces the powerful theorem Little’s Law, based in the theory of probability, and the ways it provides the ability to define a set of metrics that we value as highly critical for gaining database insights.


Image Cred: Public Doman, Source

Yes, the more you can monitor the better, and, yes, there’s an irreplaceable sense of security that comes when you know you have an absolutely complete picture of your system, should you ever need it (and that’s why it's important to observe every query or other request sent to the server). But for all pragmatic purposes, there are certain readings that will implicitly tell you more about the most important parts of your system than others.

It's useful to label these valuable metrics as “work-centric” — they’re the ones that tell you the most about whether or not your system can actually execute the work is was designed to execute. When identified properly, work-centric metrics cut through context-specific (or system-specific) corner cases and provide absolute value. We know well that every very complex system is unique and inimitable, and what’s alarming or ideal in one system might be the opposite in another. It’s normal for abnormal things to happen when an environment gets complicated, and insights like CPU and IO can be powerful, but they’re case-dependent; work-centric metrics aim to provide universal value by defining what every system has as its bottom line goal: completed work.     

Laying Down the Law

In order to find these metrics specifically, we need a way to turn the concept of “completed work” into quantifiable, observable traits and phenomena.

The fundamental idea is that to measure a system’s success in performing work, the phase in the system’s process we want to watch is the full process of execution, when “requests for work” pass from “waiting to complete” to the active mode of “completing.” (If this sounds like waiting in line, that’s because it is! This is queueing theory in action — more on that later.)

Here is where Little’s Law come in -- it's a powerful formula that defines the relationship between three variables central to the execution processes: concurrency (L), arrival rate (λ), and residence time(R). For stable systems, Little’s Law states that:  

L = λR

Or, in practice, the number of requests-for-work currently in a system depends on the rate in which they line up, multiplied by the amount of time it takes for the system to actually do the work they’re asking for. The more requests that get in line, or the longer it takes for execution to complete, the more requests there will be occupying the system at a given time.   

This simple and elegant formula lends itself to a variety of rearrangements and dissections, each of which yields different metrics for you to leverage about your system. VividCortex’s founder and CEO Baron Schwartz leads an in-depth exploration of these resultant and related equations and concepts in our free ebook Everything You Need to Know About Queueing Theory. For instance, as Baron writes, this is the Utilization Law, which relates utilization to the product of throughput and service time:

ρ = λS

You can also apply your number of servers, by dividing the right side of the equation by that number.

And here is the equation for residence time, the sum of wait time and service time:

R = W + S

You’ll notice that resource metrics are inherently secondary here — yes, an argument can be made that your system’s power affects the speed of execution (and therefore concurrency) but CPU or IO by themselves say nothing about any of the pieces of Little’s Law directly. From this perspective, resources all work in service of the completion of work, which means (correctly) that they’re understood to be necessarily relative and smaller components of a larger, complex system.

Taken altogether, these powerful equations produces this list of crucial, work-centric APM metrics:

A page from VividCortex's ebook Everything You Need to Know About Queueing Theory.

Next Steps: Learn More and Apply

The next important step is to understand how to apply and interpret these important metrics. With that in mind, be sure to read the free ebook in full to see the more complex developments of these insights and the broader concepts they lead to. 

Once you know what you're looking for and why, you'll need to determine the best way to capture these metrics. With that in mind, we invite you to tune in to the webinar, co-hosted by VividCortex and Datadog at the end of this month, where you'll learn how these powerful monitoring products put knowledge of work-centric monitoring to good use and help customers keep track on the most valuable and indicative measurements in their systems. Hope to see you there!

Recent Posts

Posts by Topic

see all