A few days ago Percona benchmarked our agents’ performance overhead in two extreme deployment scenarios. We engaged Percona to do this (on a paid basis) because of their sterling reputation for independent, objective third-party evaluations.
As you may know, I worked at Percona for nearly five years and participated in many of these paid evaluations myself. There is no one more highly qualified to evaluate a product’s performance than Vadim Tkachenko at Percona. I know because I worked closely with him to evaluate products including Clustrix and others.
Vadim and I corresponded and planned the evaluation over a series of many months, and Vadim ran and re-ran benchmarks until we agreed it met our mutual standards.
At this point Vadim wrote a blog post, which you can find linked above. Prior to publication, Vadim circulated a draft of the post and I commented on it, suggesting some areas for additional information. But the post, and its conclusions, are Vadim’s own.
Interpretation Of The Results
As a performance and benchmarking expert myself, I have my own interpretation of the results, which are more nuanced. Vadim properly restricts himself to stating the facts, but I am allowed to speculate and extrapolate!
It’s difficult to benchmark software that runs the way VividCortex’s agents do. It is difficult to design a benchmark that can indicate their performance impact on the system in a variety of scenarios. Part of this is because most benchmarks, and most benchmarking practices, are designed to measure the performance of the system under test, not the performance of a system impacted by the system under test.
Vadim and I discussed how to do this and it’s an extremely broad problem space. We elected to focus our time and money on a couple of achievable, specific results that could be used, with interpretation such as this blog post, to point to the broader picture.
First, as a baseline, in production scenarios we observe the following, which we wanted to benchmark and evaluate formally:
- VividCortex’s agents can use up to a single CPU core in extreme cases.
- In typical cases they use less than 1% of total system CPU.
- The CPU they use is usually free and idle, so essentially zero real performance impact (response time) as seen by the user.
That’s what we actually see, running VividCortex in production on many deployments in the real world. Our agents are able to measure themselves, so we have real data on this. Example in our own production servers:
As you can see, during this time range our agent uses 16% (to use a round number) of a single core and this is a multicore box so it’s 2% of total system CPU. This is a bit higher than we usually see on most people’s servers, but as mentioned it can use up to a full CPU core in extreme cases (usually on 32-core and larger servers, which thus represents ~3% of total system CPU).
Now, what’s harder for us to measure with our own agents is how much response time impact this has on queries. In other words, even if there is free CPU for the agents to use, does the agent slow down the queries?
Vadim benchmarked a few different scenarios to try to evaluate that. You can read the results for yourself. The answers, in my own words, are:
- When there’s no free CPU the agent competes with the server for CPU, and makes queries slower.
- When there’s no contention for CPU there is no statistically significant difference, and the agents are truly “free.”
- Vadim’s excellent granular analysis shows that not only are the overall performance impacts bounded, but the agents’ impact on consistently fast performance is shown clearly as well (also essentially zero unless the server is being run at 100% utilization).
Benchmarking the intermediate scenario – the middle ground between zero and 100% CPU utilization – is subtle to do, but from queueing theory we know that as utilization approaches 100% queueing delay will increase. On a single-core server you should probably worry around 75%, on multi-core servers the threshold of concern is much higher, typically 95% and higher:
This chart, which you can explore interactively if you like, shows the “response time stretch factor” at various utilization levels for various numbers of CPU cores: 1, 2, 4, 8, 16, 32. Notice that the the so-called “knee” of queueing is an optical illusion and the 75% rule of thumb is not applicable on multi-CPU systems; please see Neil Gunther for more on this topic.
What About Alternatives?
Vadim benchmarked one alternative, the Performance Schema, and found that it has about half the performance impact of VividCortex. This is not surprising, since it’s internal to the server. Given that it also misses a tremendous amount of vitally important detail I’m personally willing to overlook a small extra bit of CPU usage and response time impact to get better instrumentation.
VividCortex’s agents, by the way, can use the Performance Schema as well. And someday it would be nice to measure the performance impact of it when it’s not only enabled, but being polled for metrics. (In Vadim’s benchmarks it’s enabled, but unused. When used it will have more performance impact; which may actually put it on par with our TCP capture solution.)
Another option is slow query logging. As mentioned in the same blog post, and demonstrated by Percona’s own benchmarks linked from there, slow query logging is typically higher-overhead and carries much more operational risk than passive analysis of TCP traffic.
In general, Percona’s benchmarks validate my decision to build our solution on top of network traffic capture analysis. I personally wrote the leading log analysis tools for MySQL and many other databases and I’ve learned my lessons from doing that. Network traffic capture is an excellent solution when you consider the alternatives.
Is VividCortex A Good Solution For You?
When evaluating whether you want to use VividCortex, therefore, you should ask:
- Do I have free CPU on my servers? If you run your servers at or near 100% CPU utilization it may degrade performance.
- Do I hit an edge case where the agent will use more than usual CPU? If you have very small, high-speed queries you’ll use a bit more CPU than fewer, slower queries.
If you’re interested in benchmarking the agent’s performance without running the agent itself, you can just run our free network analyzer tools which are thin wrappers around our network decoding library.
What Is The Performance Impact Of Instrumentation?
For an answer to the broader question I would like to appeal to a higher authority, Oracle performance guru Tom Kyte:
Would Oracle run faster without this stuff? Undoubtedly – not. It would run many times slower, perhaps hundreds of times slower. Why? Because you would have no clue where to look to find performance related issues. You would have nothing to go on. Without this “overhead” (air quotes intentionally used to denote sarcasm there), Oracle would not have a chance of performing as well as it does. Because you would not have a change to make it perform well. Because you would not know where even to begin.
This is how you should think about VividCortex’s performance impact. If you’re running near 100% CPU and you think VividCortex is going to slow your servers down, you probably needed VividCortex a long time ago.
Our customers repeatedly find that we enable them to run their servers 20-50% moreefficiently, meaning that they can actually run their workload on 20-50% fewer servers, or wait much longer before buying more servers.
We work extremely hard to make our agents as efficient as possible so we don’t cause performance problems. Sometimes it’s a concern that they are using some CPU. But upon looking more closely, we’ve always found that those cases aren’t actually impacting server performance; they are using otherwise-idle CPU cycles.
As a performance expert, I have used my 15 years of experience, plus more than 100 years of experience in our talented team, to design a system that improves your servers’ and applications’ performance dramatically, without causing performance problems. By the way, every time I evaluate a similar product from our competitors, I do not find this to be the case. We repeatedly and consistently identify our competitors as causing performance problems on servers. (I guess it really does help that I know what I’m doing with database performance.)
I welcome your comments and questions about VividCortex’s agents and their performance impact.