Optimizing VividCortex’s Samples View: Part 2

Posted by Preetam Jinka on Sep 28, 2016 3:43:12 PM

A few weeks ago, I wrote a blog post explaining how sketch sampling methods can be employed to achieve fast 30-day data visibility for monitoring users. The problem we faced was that with that standard of retention, we’ve frequently seen systems that involve nearly a million query samples in a 30-day window, meaning that special solutions are needed in order to avoid overloading users’ browsers.

The solution we’ve found lies in a hash ordering that's proven to be both surprisingly simple and efficient. In this Part 2 post, I’ll look at why it works so well.

Ordering with a Hash

VividCortex’s query samples are stored in MySQL tables. The simplified schema definition looks like this:

query  bigint PRIMARY
host   int    PRIMARY
ts     int    PRIMARY
sample blob

Our primary key for those tables is (query, host, ts), as you may have guessed with the PRIMARY keyword. When we load the query details page for a user, we know the query, host, and time range, so our queries use the primary key.

As mentioned above, in Part 1 of this blog post we wrote that the following condition solved our subsampling problem:

ORDER BY MD5(CONCAT(qs.ts, qs.host)) LIMIT <our limit>

One optimization that we’ve made so far is replacing the MD5 operation with a CRC32. It’s less expensive and produces the same end result. So, how does this work?

MD5 and CRC32 functions are hash functions that we can use for a randomly distributed ordering. They also deterministic; given the same inputs, we’ll get the same output, and therefore the same ordering.

Let’s dig into the details and see an example.

First, think about how samples are stored as rows for the following view:

Samples_in_rows_1.png

Here it is as a diagram:

Diagram_Samples.png

I’ve split up the time range into two chunks so we can think about how things look when we “zoom in.” Imagine there are two ranges to represent different zoom levels. The first includes both chunks. The second includes only the first chunk. In each chunk, I’ve listed four rows which represent samples, so there are eight samples total. Each sample also has a hash.

Let’s pick some samples! In each example, I’m limiting the count to three samples.

Picking 3 samples from timestamps 1 to 4 gives us the following samples:

  • (host 1, ts 4)
  • (host 1, ts 1)
  • (host 2, ts 4)

I’ve also bolded them in the diagram below.

Diagram_Samples_2.png

Now, “zooming in” to timestamps 1 to 2 and picking 3 samples will give us:

  • (host 1, ts 1)
  • (host 1, ts 2)
  • (host 2, ts 2)

Diagram_Samples_3.png

Notice how we’re still choosing the same samples in the first range that we saw before:

  • (host 1, ts 1)
  • (host 1, ts 2)

And we’re including a new sample within our smaller time range to add more “detail”:

  • (host 2, ts 2)

We still haven’t seen (host 2, ts 1) yet. How would we see that? It’s easy: zoom in even more so that we only see timestamp 1, and we’ll be able to capture that as well.

I hope this helps elucidate one method for using samples to manage massive data sets and query volumes. The best way to understand such a solution, though, is to see it actually put to use in a monitoring environment. In order to do so, you can request a free trial of VividCortex and experience what high-volume, highly-granular database monitoring is like, firsthand.

Recent Posts

Posts by Topic

see all