VividCortex’s SaaS backend is a service-oriented architecture, which means that in addition to our external APIs that our agents and web UI use, we also have internal APIs. Deploy a bunch of API servers across a cluster of machines, hook everything up and make all the parts talk to each other, and what do you have? You have a distributed system that’s hard to troubleshoot, that’s what.
One of the primary problems in this type of system is finding out what’s happening, right now. Not status counters or metric rates, but what requests are in-flight and what status they are in, and most especially what they’re waiting for. At VividCortex we have built several relatively simple pieces of plumbing to help us get that kind of visibility.
It’s been out there a while but we’ve never spent much time to explain what we did and what the benefits are. We’ve solved a lot of problems with our tools the “easy” way, which otherwise might have been very hard (i.e. tedious and time-consuming, with lots of dead-end investigation).
Our system is composed of the following:
- Live process lists
- Web and commandline clients
- Chat-bot integration
- Poor man’s profiling
Let’s take a look at what these are and how we use them.
What’s Running Right Now?
Quick, what requests are pending – started but not finished – in your application? If you don’t know, you’re not alone. But wouldn’t you like to know?
A lot of systems have some notion of processes. The most basic and familiar is
ps in Unix. This is an interface into the process table the operating system keeps. MySQL has a process table too, and you can view it with the
SHOW PROCESSLIST command, or by selecting from system tables.
A process is the basic worker. Its job is to handle some kind of tasks. In MySQL, those tasks are queries, so when you look at
SHOW PROCESSLIST you can see whether a thread (connection / worker) is idle or running a query, and if it’s running a query, what state it’s in. This notion of process state is another really important part of knowing what’s going on. Unix processes have states, too.
Finally, with the ability to view the processes and their states, another set of functionality can be built on top of that:
- Sort the processes, as
topdoes in Unix, and
innotopdoes for MySQL.
- Get profiles – aggregations of states and the times spent in these states.
- Kill processes that are causing problems.
Now, the question is how to build this for a SOA system like VividCortex’s backend. What are the processes and tasks? What are the states? What would we like to know about them, and what would we like to do to them?
Obviously, part of this is technology-specific and part is generally applicable. We’re dealing with services that speak JSON over HTTPS, so the tasks are clearly HTTP requests. The workers, or processes, are the tech-specific thing. We use Go to build all of our internal and external APIs. Each HTTP request is handled by a single goroutine, so a process is a goroutine, which is like a lightweight thread running in the Go runtime, multiplexed onto a pool of threads. Now the fun starts: Go doesn’t allow you to identify, inspect, or manipulate (e.g. kill) a goroutine. How can we do this?
The answer is that we built a process table library that we hook into our code. It’s simple to integrate into the code and it has a set of functionality that wraps around it to provide the things we want. This is our open-source pm library.
pm library works by creating a new HTTP server, which we tell to listen on a new port. Now each API server process is listening on one port for inbound requests it’s supposed to handle, and each of these requests is registered with the process table. When the request finishes, it’s removed. Meanwhile, the server is also listening on another port that provides a REST interface to the process table. So with a simple HTTP request to that port, you can get a list of everything that server is doing right now: the requests, their current status, a list of their status histories, and the time spent in each status. And, you can register a request for cancelling that request, which will cause the goroutine to
panic() next time it updates its status. (Don’t worry, this only terminates the goroutine itself, not the whole server process.)
So this is the basic building block of seeing what user requests are in flight at any time. It’s REST, so we can build more on top of it. That’s where our client programs come in.
pm project includes a commandline client. If you have Go installed on your computer, it takes 2 seconds to get it and compile it:
baron$ go get github.com/VividCortex/pm/cli
Now we can run this and give it an endpoint to poll. Its job is to act like
top, more or less. I’ll explain in a bit how I know which endpoint to poll.
baron$ ./cli --endpoints=api3:9084
The program polls that endpoint and redraws the screen at intervals, just like
Host Id Time Status method uri api3:9084 41b3f9b329e2b462 0.0006941 init POST /hosts/[redacted]/agents/vc-agent-007
Cool! I caught a request to our
/hosts API. It told me what it was, what it was doing, and how long it’d been running. Nice, but it’s a little handier to look to a web browser. We have a browser client program, written in AngularJS, which can do the same thing. You can find this in pm-web. It runs completely in-browser and just polls a list of endpoints.
Nifty, but so far that’s just one process that was in-flight in one of the many, many API server programs we have running on lots of different hosts in our infrastructure. What about the global view of all of them? For this we need a list of endpoints.
ChatOps, if you’re not familiar with it, is the most awesome thing since chat. You have a bot that integrates with your systems, and you ask it to do stuff for you. One of the things we can ask our chatbot is
/pm <environment> and it prints out a URL that leads to the Github-hosted copy of pm-web, with the URL string containing the list of all services in our architecture and their
pm ports. A simple click and you’re looking at a live, realtime updating view of every request that’s in flight. Across the whole architecture. And you can kill and get status history for individual requests too. I’ve personally done this a few times when something gets stuck, as well as just identifying an API endpoint that might be causing trouble.
To see how useful this is, consider the following completely real scenario: a customer chats to our on-call support person through the in-app chat and says something’s taking a long time to load, and refreshing the browser doesn’t help. Uh-oh! The customer probably ran into a bug and fired off several long-running requests that are doing nothing but adding load and interfering with other requests. Quick: where are they running? Across a load-balanced infrastructure, this is not easy to find out. But
pm makes it easy to find and kill those requests. No sweat!
We haven’t open-sourced our implementation of this, but it’s really simple. You should be able to build your own in a few lines of code for Hubot or similar. It’ll be specific to your infrastructure, of course.
Poor Man’s Profiling
So now we have the ability to see and manage every customer request that’s in-flight, across our entire distributed architecture (and so can you, with a few lines of code). But not everything is a service. There are workers, too. For example, there are Kafka consumers. They’re drinking from the fire-hose of inbound data, doing useful things with it. What about managing their tasks? They are highly concurrent and they do things continually; they don’t really have the notion of “receive a request, process it, finish.” The
pm library, and the process table abstraction, isn’t appropriate for this. If one of them gets behind, what status is it in, and what’s it doing?
One way to do this is with profiling. There are basically two ways to profile programs: by timing things they do, and by sampling and seeing their current task. There are profiling tools built into Go, but sometimes you just want a stack dump of everything, too. This is akin to the processlist: it shows you every goroutine (and in other programming languages it’d show you threads) and the stack traces. By sampling this and aggregating the samples, not only do you get the ability to see what’s happening at points in time, but you see where things spend their time blocked. This is “poor man’s profiling,” or PMP.
This may sound crude, but it’s incredibly useful. It’s hard to even estimate how much of the performance improvements I’ve seen in MySQL are due to Domas and others doing this type of wait analysis. See http://poormansprofiler.org for more of the story on this.
At some point, we wanted something similar for Go programs to get the high-level view on what Kafka consumers were doing. But PMP has some downsides.
- It forklifts the process with GDB, locking it while it walks the stacks and prints them out.
- It doesn’t know about Go’s internals, so you basically get to find out what the Go runtime is, but not the goroutines themselves.
I recalled that I’d seen a way to make the Go runtime print out a full stack trace of all goroutines, like it does when there’s a
panic(). You can send a Go program a
QUIT signal to do that. So I tried. It worked! But it also made the program quit. Bummer. Well, that’s not hard to fix. We just wired up a signal handler that, by default, handles
SIGUSR1 by printing stack traces to the log.
We haven’t open-sourced this yet, but we might. It’s a little bit integrated with our logging, and we didn’t want to refactor a bunch of code at the moment that we wrote it to help us figure out slow consumers. Doing it the way we did allowed us to get our problems solved with only a few lines of code. A full solution is a bunch more code.
We’ll likely improve this, too. I’m thinking we’ll make it set an HTTP listener. When you do a GET to that endpoint, you’ll get a JSON formatted version of the goroutines and their stacks. This will be much nicer for writing clients to consume, instead of text-based aggregation of stack traces in the app logs.
Distributed systems are hard. If you can’t follow requests through them, when something gets wedged nobody knows where to look or what to do. Being able to trace tasks, see what they’re doing, where they spend most of their time (and thus how to make them faster), where they’re waiting, and kill requests that are causing problems – and building this as service endpoints – is a huge help.
Hopefully the software we’ve written and open-sourced thus far will be a help to you, too. For the things we haven’t shared as open-source yet, I hope our descriptions of it are helpful enough that you can also see how to solve problems in your own systems. And, we’re constantly improving. We welcome not only your suggestions and ideas, but also your pull requests!