One of our agents, vc-mysql-query, works by sniffing TCP traffic with libpcap and decoding the MySQL protocol. As you can imagine, it’s one of the most complicated portions of our codebase. It’s also difficult to test. We have a set of tests using tcpdump files of production MySQL traffic to deterministically test the sniffer code. We run the tcpdump files through the sniffer and check the generated output. The issue with this is that every time we add a new dump file, it’s a lot of manual work to know for sure what we should be seeing in the output. If the agent says that 100 SELECTs were run, how do we make sure it’s true?
In order to brute-force test and try to smoke out bugs, we wrote a small Go tool that we call the sanity-check. It runs a production version of vc-mysql-query and tests it with a black-box approach.
The first thing that the sanity-check tool does is create a test database to hold all of the test data. The tool then starts vc-mysql-query, configured to send metrics to a non-default location. The sanity-check listens on that address so it receives metrics from vc-mysql-query. After this, it starts up a configurable number of goroutines which each connect to MySQL and create some query traffic.
After each query-generator goroutine starts, it creates a new table for itself and runs a deterministic set of queries, including prepared statements. It keeps internal counters of query types as they run.
vc-mysql-query sees and generates time series metrics for these queries as they go across the local network. It sends these metrics back to the sanity-check tool. After all of the goroutines have finished their runs, the aggregated query totals of what the sanity-check tool knows to have sent are compared against what vc-mysql-query reports to have been seen. Once this is done, the test database is deleted and all of the internal state is reset, and the entire run is restarted until a configurable number of iterations is reached.
We successfully discovered and eliminated some bugs with this approach. Here is an example of the output from sanity-check:
14:05:11 ✗ expected 10 insert, got 20 14:05:11 ✓ got 1 drop 14:05:11 ✓ got 2 create 14:05:11 FAIL 14:05:11 Total iterations: 1, failed 1
It’s clear that something is wrong when we see twice as many queries as we sent. This was one of several bugs we found.
Some “bugs” weren’t after all. For example, we issued a single SELECT. The sanity-check tool told us that vc-mysql-query saw two SELECTs. Turns out that the Go MySQL driver issues a SELECT to fetch a system variable. The sanity-check tool did not account for this in its internal counters. Our sniffer was correct, and exposed behavior we didn’t expect in the Go drivers.
One of the biggest wins we were able to get out of using the sanity-check tool was with prepared statements. We saw that vc-mysql-query did not capture all of the prepared statements that were run. After debugging, we found a subtle bug in our sniffer code that affected the protocol decoding state, which ultimately led to some prepared statements being lost.
The challenge with writing sniffers is that we cannot guarantee that we see every packet. If we lose a single packet in a TCP stream, we can’t always be sure of the rest of the state of the connection, and we lose queries. We noticed this when we bumped up the concurrency option in our tool. We’ve found ways to deliberately introduce conditions that cause this, and mitigate the effects.
The other challenge is knowing what’s right - the agent or the sanity-check tool. Fortunately we can work with both at the same time, as well as tools like tcpdump, to figure out where problems are.
There are many things we can do to improve our sanity-check tool. It’s surprisingly minimal with roughly 300 lines of code. In the future we’d like to add support for more metrics coming from vc-mysql-query. We’d also like to use some sort of fuzzer to generate more interesting queries.
Have you tested your software in this fashion? What bugs did you find? What suggestions do you have for us? We’re always interested in hearing how we can improve further.