Deterministically Subsampling Queries: A Million Samples?!
VividCortex lets users drill-down into their data up to a single-second granularity, and, using specialized sketch sampling methods (check out our free ebook about Sampling a Stream of Events With a Probabilistic Sketch for a look at one way we employ sampling to handle big data sets) we’re able to offer customers up to 30 days of data retention. As you can imagine, 30 days of samples represents a lot of data. Exactly how much?
For each query we capture ten or more samples per hour, per host. In turn, that means we pull thousands of samples each month — about 7,200 per host. And if you’re a customer with hundreds of hosts, that can multiply into almost a million query samples, all being sent to your cloud-based view of VividCortex’s app. If that sounds like an easy way to crash your browser… it is. We’ve had to develop and employ solutions to overcome this problem.
One of our initial observations was that it’s not helpful to have a giant line of samples stacked on top of each other. A user probably isn’t interested in all of them, and, from a UI and usability standpoint, it’s difficult to click on individual samples when they’re visually overlapping in our interface.
Compare what you're seeing here (basically, a big, flat, dense bar) to what we believe a sample set should actually look like:
This is preferable because here we have enough samples to see any patterns and outliers, without such a high density that those patterns get swallowed up. They're separated enough that you can isolate, specify, and click on individual samples. Plus, you can see the different color indicators -- yellow means "warning" and red means "error" -- and you can see the three-dot icon, which shows that the sample contains an explain plan.
We determined that we need a way to view samples over long time-ranges. However, for all of these reasons, we can’t view all samples. So, we decided, let’s pick a subset. However, because this is a user-facing feature, the display and sampling logic needs to be intuitive from a user’s perspective. To make sure our solution is intuitive, we established some constraints. Determining these constraints, along with the process steps that work into these constraints, proved to be the bulk of the problem-solving necessary to getting around the sample-size problem.
- If you’re viewing 30 days worth of samples and you share the page’s link with a coworker, they should see exactly the same samples as you do. This deep linking is integral to making VividCortex collaborative, flexible, and truly helpful to multiple users across an organization.
- If you refresh the page, you should see the same samples. For obvious reasons, we want users to be able to return to the same specific datasets whenever necessary, by going to the same page and URL.
- If you’re looking at 30 days worth of samples, and you shorten your time range by 1 hour on each side, almost all of the samples in the new view should be what you saw before.
All three of these constraints aspire toward universal and reliable consistency, even though the sample selection is random to some degree.
With those rules in mind, how do we actually pick the subset?
- Pick the first N.
- Pick out 1% or some other appropriate amount, randomly, based on the N.
To drill-down deeper into the process:
- Picking the first N:
- Not fair / random / representative, since it depends on the order.
- If the order is time, then we won’t have evenly scattered samples across a time range
Once we have that first N, picking the following sample presents the heart of the problem, in order to meet the criteria of our constraints. The selection can’t be truly random, because the constraints demand that, at some point, our selections become deterministic and repeatable.
Our current solution meets all of the constraints mentioned above. In the end, after a great deal of thought, the solution actually proved to be fairly simple. All it needed was the addition of
ORDER BY MD5(CONCAT(qs.ts, qs.host)) LIMIT <our limit>
to our query.
Now, In a way similar to Google Maps, the result provides a rough overview of features when you zoom out, but, as you zoom in, it continues to add detail. We can show a few hundred samples for every view. When you zoom in, we keep all of the samples that were in the time range before, and we add some more up to the limit. This way, we’re essentially subsampling our original (massive) sample set, to preserve consistency.
But why does that simple command work so well? In Part 2 of this blog post — coming soon — I’ll dig into what makes it function and why it satisfies those constraints so neatly.
Preetam Jinka works on back-end systems and anomaly detection at VividCortex during the day and hacks on storage engines and distributed systems at night. Next week, on August 30, he’ll co-host a VividCortex webinar with our friends at Datadog, discussing 5 Tips on Determining the Most Impactful Metrics in Your App — register here, and we hope to see you there.