A little while awhile ago VividCortex's engineering team encountered a minor mystery when observing some trends in
time_us metrics in a MongoDB database. While the conclusion of our investigation was a kind of combination of "D'oh!" and "Ah ha!," we thought our considerations along the way were interesting and involved a few valuable questions worth sharing.
The mystery started when we noticed that the time metrics for a certain query appeared to be constant despite the fact that the query was operating under various loads, over several different time frames, as we isolated and observed it with VividCortex's Time Selector. (You can read a bit more about how VividCortex uses the
time_us metric as a way to measure queries here.)
What we saw piqued our interest, even though it didn't raise the alarms, exactly. Even so, the readouts on the
time_us metric were enough to make us really dig in and ask, "What's behind this?" This post shares the thought process behind our investigation and highlights a few points from our subsequent conversation.
Smells Like Lock Contention
What first appeared to be lock contention in MongoDB quickly revealed itself to be something else. A closer look exhibited something more unusual: the
time_us for this mongodb host appeared to be essentially constant.
Our initial explanations aimed to identify a cause that required the fewest hypotheses and least guesswork (thanks Occam!). For instance, could it just be something as simple as MongoDB's typical high speed, normalizing activity?
host.totals.queries.time_us shows total execution time for all queries, and
host.totals.queries.tput shows the rate of execution of all queries.)
However, as we observed the
time_us metrics across varying time ranges but still found the same
time_us per second, it became clear that something more unusual was going on.
Here's a snapshot from a 24 hour period:
And here's a 12 hour period:
Admittedly, the scales on these kinds of metrics can be widely different, making it hard to judge the nature or behavior of the data observed. But even with that caveat, we couldn't dismiss the aberration as just a "difficult metric" or side effect of MongoDB. Time to take a closer look.
We began digging in and analyzing the disparity between what the query looked like and what it should look like. Here's a clue: the average was ~1M
time_us per second, with small aberrations, regardless of host (whose qps varies a lot) or time range (qps varies over time). If a 0 - 100 µs query count goes up by ~100, then the calculation is 100/1000000, equal to 0.0001.
Here, a sophisticated visual like a query latency band can be useful in illustrating more granular query
tput; here's the query latency band for the MongoDB query in question:
However, looking at the same test host, switching between timeranges of 4d, 1d, 12h, 3h, 1h, all boxes appeared to have the same behavior. Why did the total aggregate query time for the host essentially stay constant under widely varying loads? A helpful metaphor would be to equate the query activity to driving all kinds of cars at all kinds of speeds but finding that they all achieve exactly the same MPG—which is suspiciously always "1," under varying circumstances of many types.
We determined that there should be at least ~1% difference in total time. If we have 100 queries w/ 0.01 sec latency per sec, the total is 1 sec. If we double the input by adding 100 0.0001 sec queries, the total becomes 1.01 sec.
Query Type - D'oh! / Ah ha!
And that's when we looked more closely at the query type itself.
The query was one designed to read the MongoDB replica set oplog, which means that what VividCortex was capturing was a replica reading recent operations in order to sync to itself.
The reason why latency was going down even as operations were going up was because the query was continuously reading more from the cursor, so the faster it returned, the more often the replica fetched data. And because reading the replication stream is always going to be a 1-second-per-second operation by definition, there's a constant "1" across time frames and workloads.
In cases like this one, "average latency" (
time_us divided by
tput) can actually be misleading. By calculating average latency, you're calcuating the average for the entire system; however, if your system includes a significant number of particularly fast queries, it can tip the numbers and cause the average latency to go down--but possibly not in a way representative of the queries you're interested in.
So, a couple key takeaways to keep in mind: knowing query latency can be very helpful, and keeping scales in mind can be helpful too.