I’m pleased to announce that VividCortex now offers 99th percentile metrics to help understand latency outliers in a query workload. These metrics provide visibility beyond the average latency, help identify the worst outliers, and improve focus on the customer experience. They are offered for all of the databases we currently support when using On-Host monitoring.
What You'll See
Latency percentile metrics are one of our most popular feature requests, so we know this will make a lot of you very happy! We actually started collecting these metrics some time ago; you’ll have p99 latency metrics for the last couple of months if you look back at your historical metrics. These metrics are captured globally for an environment, per-query, per-database, per-user, per-verb, per-caller and per-custom-query-tag.
Why Did We Choose to Implement This?
It is extremely useful for a lot of reasons. Averages can be misleading. Customers don’t just experience an application’s average behavior; they remember the worst experience they’ve had. P99 metrics show outlying behavior in the long-tail, while averages don’t. You can rank by highest-latency p99 in the Profiler which makes it very easy to focus on the queries with the worst outliers.
They’re most meaningful for high-frequency queries; where other monitoring systems have trouble providing any visibility at all into fast and frequent queries, we can also identify outlier performance. This is a huge blind spot for many people.
It is also a useful feature for proactive monitoring and notification. Since we are generating this value per-query you can set an alert on specific query performance. This could be a much more accurate way of alerting on unusual behaviour as compared to setting a threshold against average latency.
What Exactly Are We Collecting?
There is a wide variety in what monitoring tools delivers as a “percentile” measurement. The most literal definition is to take a complete population of datasets, discard a certain percentage of them such as the top 1%, and then present the largest remaining value. What VividCortex is returning for p99 is a metric of percentiles. We don’t keep the full dataset from which a percentile can be calculated; our agents calculate the p99 at 1 second intervals with an approximation algorithm and store a time series metric of that value. This is similar to how StatsD generates their upper_99 metrics of percentiles.
When charting the metric over an arbitrary timeframes, the API averages the metrics for display. This is necessary whenever you request data at a time resolution that differs from the stored resolution. If you want to render a chart of a metric over a day at 600px wide, each pixel will represent 144 seconds of data. We also average the data when it is downsampled for long-term storage at a lower resolutions.
It is interesting that averaging percentiles is improper, but still useful. If you store a p99 metric and then zoom out and view an averaged version over a long time range, it may be quite different from the actual 99th percentile. However the ways in which it is different don’t render it unusable for the desired purpose, i.e. understanding the worst experience most of your users are having with your application. Regardless of their exact values, percentile metrics tend to show outlying behavior and get bigger when outlying behavior gets badder. Super useful!
VividCortex does offer a free trial; the signup page can be found here: https://app.vividcortex.com/sign-up