Maintaining a SaaS service in the cloud inherently involves nonstop complexity, inevitable fixes, and the threat of issues — especially when that SaaS service involves consistent growth and active changes. In order for VividCortex's product team to reliably pinpoint and head off potential problems, we've needed to use creative and resourceful DevOps systems. In this post, we'll be discussing one of the methods we use to keep potential problems in check: "Chaos and the Fugly Five." (Its name might not communicate the full seriousness of its use, but the entire extent of its creativity no doubt comes through.)
In this method, "Chaos" and "The Fugly Five" are actually two separate components, but they overlap and work toward a common goal: the proactive identification and avoidance of big operational problems. As Jay Ennis, VividCortex's Vice President of Product Development, explained, the ultimate benefit of a program such as "CATFF" is the continual improvement of scalability, performance, and reliability across the VividCortex platform.
"Chaos" may sound like the last thing a product team would want to intentionally introduce to its environment, but in the way VividCortex uses it, it's essentially synonymous with "practice."
In this case, "chaos" represents a proactive, preventative approach to issue resolution and system maintenance. The term "chaos" comes from Chaos Monkey, a tool that's been used in DevOps circles for years. Its purpose is simple, albeit intimidating: it instigates random failure somewhere inside a given system. This serves two important purposes:
- The systems we design are meant to be able to operate through most forms of failure. The only way to really test that proposition is in production, with a real failure.
- Teams need to be able to find and fix failures in live scenarios. By actually exercising the techniques and plans we have in place for addressing failure, we can familiarize and improve upon them.
Chaos has the capability to introduce its failure states at random, within a set of failure modes, without warning. This unpredictability is a key aspect of Chaos' ability to reinforce DevOps principles, as it tests those principles in a scenario as close to a bonafide failure as possible. And, of course, these unpredictable elements are what earned it the name "Chaos."
Jay explains that Chaos is a response to the simple fact that, when running a SaaS service in the cloud, component failures are inevitable. "A team can design and plan to accommodate failures extensively, but unless that team routinely puts those plans into practice, the service will never be truly robust. In that environment, customers will see some of those failures as outages, missed SLAs, and data loss. We are introducing Chaos as part of our DevOps culture so that failure itself is routine and expected, so that outages, missed SLAs, and data loss are ultimately rare. Our goal is that through the monitoring and automation we develop, most of these component failures are resolved by the system itself. Every Chaos event teaches us something."
And from those lessons, we not only discover whether our existing contingency plans are satisfactory; we also encounter other, endemic issues in our system, previously unknown or overlooked. Chaos can surface these bigger, possibly more subtle issues, in the course of a breakage… and if those issues are potentially severe or difficult to address, they get added to an exclusive, not very flattering list: The Fugly Five.
The Fugly Five
Put simply, the Fugly Five is a formalized list of the top five issues our team feels the need to segment and target as top-priority at any given time. In the vast majority of cases, as DevOps becomes aware of issues in the services they run, they address them immediately and cleanly. However, for a variety of reasons, some issues can end up stalled in the backlog, because the appropriate solution, the resources required to apply that solution, or the right team/system conditions might not be in place to effectively resolve the issue.
In a dynamic DevOps environment, where changes and distractions are common, there's the risk of a team becoming desensitized to issues that have persisted in the background for too long. The Fugly Five is our solution to this potential DevOps and workflow problem.
By narrowing down various concerns into a top ranking of five, the team forces itself to address problems that might otherwise stay in the periphery — not actively problematic yet, but threatening to metastasize at any time.
As Jay says about the Fugly Five, "It's a place for the team to agree on the issues that concern them the most and finally put them to rest. The fact that it is limited to five keeps it practical and provides the necessary focus. Once any issue is fixed, there is a list of candidates the team can choose from to repopulate."
Even in a high-tech environment and industry, even old-fashioned list-making can serve a powerful, irreplaceable function.
Iterating Upon "Chaos and the Fugly Five"
As each part of these processes continues and expands, it feeds into the other. A systemic issue we address as part of the Fugly Five might give us new means of addressing a failure state as introduced by Chaos; and as Chaos continues to surface new understanding of our team's processes, new Fugly Five candidates may appear. The key, of course, is that we take a proactive role in improving VividCortex's systems, and, with those methods in place, our DevOps team can sleep better, with the knowledge that effective problem prevention is underway.