Monitoring, like any other function of an application, is really a set of tradeoffs among many competing priorities in a high-dimensional space. In this post I’ll discuss some of those tradeoffs in hopes of helping you make choices that will lead to better outcomes.
Here are a few of the most important tradeoffs I’ve seen:
Developer Friendliness vs Operability:
If you build your application to be easily developed but ignore how to deploy and operate it in production, you’ll likely end up with an app that is harder to operate and monitor. These need not be mutually exclusive goals—but if operability is an afterthought, you might make many decisions that do preclude choices later.
Your Process vs Their Software:
All software, including monitoring systems, expresses a worldview and workflow. When these don’t match your own, the choice is which gives: do you adopt your systems and practices to fit into the monitoring software, or do you require it to support your workflow?
Cost vs Visibility
In many cases, the more observable a system is, the more expensive it is to monitor. This follows rather directly from the amount of monitoring data you can collect from the system and the granularity at which you collect it. Monitoring can be expensive if you collect a lot of data, but it can pay off. I’ve heard that Netflix’s monitoring systems are a double-digit percentage of their overall operating budget. Netflix has even been described as a monitoring company that happens to stream movies. At the same time, Netflix’s revenue per employee is one of the highest among publicly traded companies. Coincidence? You decide.
As another example, I know of many companies migrating from Oracle to PostgreSQL for cost reasons. The licensing cost is certainly much lower, but there’s simply no comparison between the amount of visibility Oracle provides and what you can get in PostgreSQL. Is the compromise worth it? That’s a decision you have to make.
Isolated Services vs Monoliths
Microservices architectures are all the rage at the moment. We’re big fans of some of the principles of microservices at VividCortex. But we’ve definitely seen some customers struggle with the implications of monitoring microservices, especially when taken to extremes. Many small pieces means many sources of metrics, which means many metrics, which makes sophisticated monitoring systems a must. Likewise, lots of metrics leads to lots of cost, which I addressed in the previous point (cost versus visibility).
This point also applies to another current hot topic, containerization. If you ship tons of Docker containers and run lots of them in production, you have that many more things to monitor. Likewise, whether you isolate every different workload onto different databases or you have some databases that handle multiple workloads—or even whether you want to run a few big powerful database servers versus lots and lots of small cheap ones.
This is not a small consideration; depending on the monitoring system you want to use, you might either find that you’re forced to move to a more scalable alternative; invest insane amounts of time, money, and hardware; or spend through the nose. Monitoring isn’t cheap no matter how you slice it, and when you multiply the number of “things” in your architecture by N, you generally are multiplying your monitoring costs too.
Any kind of shared or combined resource might amortize the monitoring cost, but at the same time it might reduce visibility. If you don’t use containers, and a server runs many different kinds of services, then which one of them is responsible for a spike in network traffic or disk IO from that server? It might be hard to tell. (VividCortex has per-process metrics on CPU, IO, and the like; but not all monitoring systems are capable of providing this level of granularity).
Built-In Metrics vs By Hook Or By Crook
If the software doesn’t provide much visibility into what you decide is important, what lengths are you willing to go to get it? At VividCortex, for example, we’re not willing to compromise on query-level visibility, which is why we use TCP traffic capture and decoding to measure every query a database gets over the network—and we don’t need the database’s cooperation to do this, since the packet capture is an OS facility. TCP traffic capture is really hard and we don’t recommend you build your own. But you might be able to do things like DTrace probes to capture your systems’ work if they don’t expose what you want to measure. It’s just a matter of how important it is to you.
These are not exhaustive, but hopefully it’s a good sample to illustrate some of the tradeoffs.
In my opinion, perhaps the most important set of tradeoffs is how your custom application code is instrumented. This can be expressed in a quadrant diagram of two related continuums:
These are the same two dimensions at play in the principle of convention over configuration. The idea is that you’d like code to be consistently and intuitively instrumented and observable with minimal developer effort, yet have that instrumentation be flexible if you want or need it to be changed. It’s a goal that can be achieved with frameworks in some cases.
This post was was excerpted from VividCortex's new ebook "Best Practices for Architecting Highly Monitorable Applications." To download the PDF and read more about designing applications so that they can benefit fully from effective monitoring practices, check out the full ebook here.