Over the last several months we’ve designed and built an events dashboard that lets you inspect very large amounts of system event data quickly (interactively). This feature was driven by customer requests and feedback. The initial proof of concept established its usefulness right away, when customers began remarking that they’d diagnosed server issues by noticing events such as database restarts, replication failures, and configuration changes. At least one customer told us this saved them a long wild-goose chase.
The key concept is that event data is loaded fully into the browser and then you can thin-slice and drill down without reloading any events. And it’s fast. Really fast – it remains responsive even with hundreds of thousands of events. Here’s what the last 30 days of fine-grained events looks like on our own systems:
The skyline along the top shows events by count over time. The left-hand pane lets you select events by severity. The category selector at the top lets you select or deselect categories. We’ve loaded more than 16,000 events in the last month. Let’s look at agent startup events only:
Why would we do that? It illustrates how often we deploy new versions of agents to our systems. We ship code dozens of times a day – our agents, backend, and webapp are constantly improving. (We ship changes to customers less often, naturally.) Here’s what we see after applying this choice (cropped to show detail):
The gray bars in the skyline remain at their original height, and the blue bars show the filtered dataset. Whoops, there were clearly a lot of agent restarts during this time! In fact these weren’t code deploys – they were caused by a bug that made some agents restart. This dashboard was very helpful for finding that problem and fixing it before it caused customers problems.
The skyline view is interactive, too. You can click and drag to select regions of it, which further filters the view, all without reloading the browser or the dataset.
You might think the table in the main body of this tool shows events, but it actually shows groups of similar events together. Click on an item to expand it, and you see all the events in the group, on a timeline. If there’s a metric associated with the event, that’ll be charted in the timeline too.
Even this chart is interactive! If points are tightly clustered you can click and drag to zoom in. Clicking on an individual event shows its details in the expanded table row.
This all takes a lot longer to read about than it does to see in action. The net effect is that you can filter and drill into tens of thousands of events in a few seconds. It’s amazingly productive and highlights all the things you always wish you knew about your systems’ behavior. There’s more than one kind of data regarding your systems, and more than one way to look at all of it.
We’re not done; there’s more to improve, and we’re building on top of this feature by integrating it into all of our other features too. Upcoming capabilities will include customizable alerts, integrations, triggered reports, and especially customer requests.