Back in September of last year, a customer brought to our attention an unusual behavior: they were seeing inexplicable alerts, reporting that a host was failing to send data for more than fifteen minutes. Yet when the customer checked the host a short while after, they saw that no data was actually missing. Strange, we thought, so we began to investigate.
At first, from what we saw, everything looked okay, but the issue that ultimately presented itself was that the vc-aggregator agent did, in fact, occasionally fail to send data.
The vc-aggregator is a metrics collector for the other VividCortex agents. It receives metrics forwarded using the statsD protocol and periodically sends them to VividCortex's API. But it was only occasionally failing? What’s the big deal? Well, missing data -- any data -- is a serious concern for us; the agent needs to be rock solid. The discovery of this issue was especially significant, as we'd debugged several problems in this agent in the past to make it reliable. And, finally, I found that the issue only affected a couple servers, so we weren't sure what caused it or why those machines were special. The plot thickened.
I started debugging by instrumenting our agent's log to tell us when we sent data to our APIs and how long it took, as well as how long requests took when they failed. Surprisingly, I saw that we were getting failed requests that took up to 30 minutes to return an error. This kind of delay is a prime indicator that network requests aren't being made with a timeout set, so that was something I checked out right away, and I determined it to indeed be true here. VividCortex has run into this a few times in the recent past for our own services, but, in this case, it seemed that we had obviously forgotten to check out agents and make sure they were using timeouts too.
No big deal, I thought: I added a 1 minute timeout to all requests made by that agent's http.Client. Problem solved, right?
Nope! Even with a timeout set we were seeing requests delayed far beyond the 1min timeout I had set. At this point I suspected that we must be doing something very wrong (and -- spoiler alert -- it didn't really occur to me that there could be a problem with the standard Go library. This code is heavily used by the entire community; surely someone would have run into a problem first if it existed!?).
My next steps were to look through the Go issues list to see if any bugs had been fixed. At the time, we were still using Go 1.3, though 1.5 had been released. My hunt discovered no issues pointing to the issues I saw, but I figured upgrading was a good idea anyways, just in case. Meanwhile, I found no posts on the golang-nuts mailing list from someone with a similar issue.
Okay, I thought, if there are no issues logged as fixed and nobody has submitted relevant questions, then I must be doing something wrong, right? Time to learn more about how the HTTP library and request cancellation works.
In pursuit of making sure I understood everything necessary, I did the following:
- Spent some time reading through Go's net/http implementation, tracing the lifecycle of a request and how they are cancelled.
- Ended up writing a wrapper to report when requests in the Agent are cancelled, and measure how long we spend trying to cancel a request.
- Found that cancellations were being triggered at 1min as expected, but the cancellation became blocked for the remainder of the request!
So cancellations weren't working how I’d assumed, and nothing in the code or documentation indicated that this was expected behavior. I finally decided it was time to reach out to the Golang-Nuts community mailing list myself and see if I can get an answer.
I posted to the Google Group, asked my question, and shared the following bit of code:
I have a service written in Go that is running as a data collector, it's designed to send batches of data up to a server once a minute over HTTPS, and also does a "checkin" with the server reporting its current health. I've been dealing with a problem recently where the collector's requests posting new batches of data are getting stuck for a long period of time, and causing it to get backed up.My collector is built with Go1.5.2 using the standard net/http library to make requests up to the server, and the http.Client is built like so:
As I continued to look into the issue on my own, Brad Fitzpatrick, one of Go’s core developers and primary maintainer of the net/http package, answered my post. He didn't see anything obviously wrong with what we were doing, and he recommended getting a snapshot of the program mid-cancellation to see what was going on.
I instrumented the agent's code to take a snapshot of all goroutines after a request cancellation is stuck for over a minute (so the request is hung for 2 minutes at that point), and found this issue:
The connection implemented by the crypto/tls package (used for encrypted HTTPS communication with servers) was written in such a way that if a request was blocked trying to write data to the server, then cancelling that request would block the waiting protocol in order to write a notification to the server that we were closing the connection. This was a pretty surprising example of two standard library packages that should be designed to work well together but weren't aligned in their goals.
I reported this back to the mailing list and showed them my findings, and got a quick response from Brad that this was unexpected behavior. Two days later, he followed up on the mailing list and told me that he had submitted for the Go1.6 release
(crypto/tls: don't block in Conn.Close if Writes are in-flight)
And, while Brad had worked on a fix in the standard library, I’d discovered a workaround for our agent, by composing a custom Connection wrapper to set deadlines on every read/write operation a connection performs; this forces the network operations to return with a timeout error if it exceeds my set deadline. Doing so has fixed all known occurrences of the issue.
Some lessons learned from this:
- Do your research, but don't be afraid to ask the community for help! golang-nuts is a great resource for getting questions answered, and also for finding out if anyone else has seen a problem you are having! If I had posted earlier, this issue could have been solved more quickly for us. That being said, you need to be ready with answers for the follow-up questions you are likely to receive.
- The standard library is well tested, but it can still have bugs. I went through a bit of denial at the start of my investigation. I basically discounted the idea that the standard library could be wrong and assumed it was user error. But, of course, even well tested code can produce unexpected behavior from time to time.
- Stack dumps are the ultimate investigation tool: I have used them successfully to identify a number of problems in our internal services, and even another instance where we were running into an issue in Go's standard library. If you get stuck trying to understand how/why two goroutines are interacting the way they are, instrument your code to grab a goroutine trace and you can see exactly what is happening.
Thanks again to the Go team for their help, and I hope this incident helps others solve their own mysteries in the future.