Velocity Conference: What Should I Monitor?

Posted by Kyle Redinger on Nov 14, 2013 7:12:00 AM


Click the preview above to see the video of Baron’s Velocity Conference talk that discusses the following:

  • We should measure the system’s work, not just its status. Work is the system’s raison d’etre.
  • Generic, dumb tools aren’t enough; we need to know the meaning of the metrics.
  • Some metrics are of central importance. Everything else is for reference only. What are the core metrics?
  • What’s the difference between correlation and cause, and how can we determine it?
  • We need high resolution—one-second at a minimum. One-minute or five-minute is useless.
  • Fault detection should be based on whether work is getting done, again, in high resolution.
  • Graphs have no intrinsic meaning. Don’t stare at a graph and wonder what it means. That’s a backwards process.
  • Abnormality detection isn’t very useful at fine granularity, because systems are constantly abnormal.
  • End-user monitoring is great for detection, but not for diagnosis.
  • There are significant technical challenges to building capable tools, and open-source software currently leaves a lot of gaps that we need to fill.
  • Large-scale modeling and correlation, machine learning, AI, and so on have uses, but it isn’t one-or-the-other. We can do a lot better than our crude tools today, without needing that kind of sophistication.

Slides are available here.

Recent Posts

Posts by Topic

see all