Sylvia Botros, Sendgrid’s lead DBA, recently told us about one of the ways her teams have found unique value in VividCortex. “Engineering managers and individual developers have pinged me and said, ‘Hey, I saw this in VividCortex -- what do you think?’ Some of our developers are not fully familiar with the generals of MySQL,” she explained, “but VividCortex is giving them a view into how their app is dealing with databases. And at the same time it’s teaching them DB lingo, which is good.”
This feedback from Sylvia opens the door to one of the questions we hear a lot about VividCortex: “How is your solution different from my existing APM (Application Performance Management) tool?”It’s a great question with many technical answers, but ultimately the only answers that matter are the ones focused on results. As an engineering-led company, we’re naturally strong on the technical side of things, but we’ve developed a tremendous focus on our customers and the outcomes we create for them.
Simply put: VividCortex will find, solve, and prevent problems with your databases that no other solution can. Products such as New Relic, AppDynamics, and DynaTrace are incredible, and this isn’t a criticism of them. They are complementary, not competitive, and most of our customers are paying one of those companies too. But we find and prevent problems they don’t. Day in, day out.
How do we do it? Well, the flux capacitration of the differential--whoa, whoa.
There are a lot of ways we do it. We capture different data, we analyze it differently, we visualize and alert on it differently, we allow drilldown differently. We measure the system’s work, in 1-second resolution, at unmatched granularity--and we have intelligent algorithms reacting in real-time to do the hard work of sorting through the resulting fire-hose of data. But in this blog post I want to focus on just one element of it: a database-centric view of the data tier.
APM products, and for that matter most developer-created instrumentation, give engineers and operators an application-centric view of the system and its workload. In simplest terms, they measure the queries the app sends to the database(s). This is great, because ultimately it’s the apps that need to perform well. But if your question is “What queries are running on the databases?”, the answer should not start with “According to the app, it looks like...”
The problem is that an app-centric view relegates the data tier to second-class status, and increasingly the data tier is where all the heavy lifting takes place. The data tier needs to have first-class tooling, which extends across multiple different types of databases (relational, graph, document, etc), all operating at large scale on many servers in a distributed fashion. All together, they represent the application’s data tier; VividCortex recognizes that in a way that hasn’t been done before. We provide a high-level overview of the entire tier, with smart algorithms surfacing what you need to look at, and immediate drill-down into individual servers and queries to figure out why and what to do about it.
We were pleased to hear how our focus on the data tier proved useful for Charity Majors -- the former production engineering manager for Parse at Facebook -- who shared her thoughts about VividCortex on our blog last year. “I’ve come to see how wonderful it is when you can let [database] experts do their thing,” Charity wrote. “And VividCortex is a DB monitoring system built by database experts. They know what information you are going to need to diagnose problems, whether you know it or not… in a couple of years, I think we’re all going to look at building our own monitoring pipelines the same way we now look at running our own mail systems and spam filters: mildly insane.”
That sort of specialization is where VividCortex is indispensable. A funny thing happens when you get a database-centric view of the data tier, measuring what happens “according to the databases.” You immediately realize that the app-centric view omits a tremendous amount of workload that the databases have to handle:
- Cron tasks
- Queries from sources as diverse as BI connectors, Tableau, Excel, and third-party systems
- Queries that come from monitoring systems (which can crush databases!)
- Ad-hoc queries from humans
- Queries from legacy parts of the codebase
- Queries from portions of the app that are at a lower level than the APM tool instruments, such as database drivers
I call this non-app-generated workload out-of-band traffic. It’s not out-of-band from the database’s point of view, but it often flies under the radar for APM products.
And most importantly, it is disproportionately responsible for database performance problems and outages. I can’t tell you how many times in my career I’ve solved “unsolvable” performance problems by stepping back, noting that we’re listening to the wrong witness so to speak, measuring at the database itself, and immediately finding the source of the problem. A few months ago I was at a customer, for example, and I asked what was going on with one of their most resource-consuming queries. The app developers immediately knew what that was and thought it was okay, but then I pointed out that it was running in 30x parallel constantly day and night. Two clicks later we’d determined that it was coming from a non-app server. We then looked at the per-process statistics on that server and found a cron job that had been forgotten. It was supposed to be processing an hour’s worth of data every hour, but it was in fact processing all of history every hour, and it took more than a day to complete. A few minutes later, the affected database server suddenly went from fully utilized to practically idle.
App performance is what matters, and the APM tools are great at showing you when it’s not doing well, but a lot of times the cause is collateral damage from out-of-band traffic. The prototypical scenario we see is that someone finds a slow page load in the APM tool, drills down and identifies a 10-second query at the bottom of the stack. This is great, but it’s a single-row primary-key lookup that should have taken less than a millisecond, and now you have identified an effect but you have a dead end as to the cause, because the APM tool never saw what caused the problem.
Our customers find the cause of those problems. The APM’s mystery slow page load is no longer a dead end. They can drill into the database’s workload in 1-second detail--both current and historical--and find the actual cause of the problem.
It doesn’t stop there; we also help our customers in many other ways, such as finding problems no one except the occasional customer knew about. But this is one of the important results of the differences between VividCortex and APM products. You need both-and, not either-or.