People tend to look for sturdy, expansive safety nets when facing risky situations. In database management, organizations turn to the sweeping strategy of code freezes during times of hectic, peak activity. Most engineering teams have had to deploy code freezes during such times, with the noble intention of protecting the integrity of a system, using the logic that if nothing can touch it, nothing can break it.
Unfortunately, a code freeze is a flawed strategy.
During heavy shopping seasons (hi, Black Friday), the most important thing for e-commerce companies is uptime. A classic mentality might be something like, "The shoppers are coming, money will be falling from the sky, and we need to make sure our bucket for catching that money has no holes—everything works now, so don't change anything, and we won't have any problems."
In recent years, though, it's become increasingly well known that an all-out approach like this one can introduce sizable holes and risks of its own. Welcome to the Code Freeze Paradox, where absolute code freezes can't be relied upon to prevent outages.
The Trouble with Freezes
The basic assumption behind a code freeze is that it helps systems maintain their status quo. "No new changes, no unwanted surprises," is the driving mantra. The first problem with code freezes, however, is that an understanding of the “status quo” could be riddled with faulty assumptions, or that status quo might have issues that haven't been discovered or hiccups that only present themselves when truly heavy traffic starts.
The logic behind a code freeze assumption might look like this:
- The systems are working now.
- A working system can only break if something changes.
- Restricting change should therefore prevent breakage.
- The riskiest change for an RDBMS is code deployment, so restricting deployment specifically eliminates risky change.
- The downsides of a freeze are less costly than a potential outage.
However, systems are dynamic things, even when code deployments cease. A freeze doesn't change the requisite activity of production, such as executing queries, writing logs, etc.
And, of course, if you've initiated a code freeze in a system where a problem exists but has yet to manifest, the freeze can backfire and allow issues to accumulate, creating greater risk of an outage as traffic continues to surge. In a post on his own blog, VividCortex’s CEO Baron Schwartz discussed "Why Deployment Freezes Don't Prevent Outages" in depth. That argument still stands, as faith in code freezes melts more and more.
Warmer than a Freeze: Code Slush
If a total freeze is too extreme a measure, is there a less frigid alternative?
Enter the “Code Slush”—rather than completely banning changes to the system, a slush prescribes more cautious, measured deployments. With a carefully determined amount of wiggle room, engineering teams can address issues that are fixable with strategic bouts of intervention, without risking the system's fundamental stability.
Fluidity, the ability to shift the system with bits of code while most of it remains secure and solid: these are the virtues of a slush. Etsy wrote about this on their Code as Craft blog recently, noting that a slushes' moderate approach has been successful for them as long ago as 2008, and slushes are something something we've embraced too.
But a code slush isn't just a code freeze, except less. Like many proper approaches to the complex task of database management, a code slush demands detail-oriented planning and intimate knowledge of the system in question.
To make a code slush successful, teams should note a few key considerations.
During a slush, the more automated (scripted) a deployment is, the safer it'll prove to be. A regular, automated deployment has the advantage of having been executed many times, making its effects better known and ultimately mitigating the risk of unexpected outcomes. By sticking with regularly scheduled deployments, teams can avoid the much more dangerous scenario or an emergency, one-time code push, the effects of which are untested, possibly unknown, and inherently much riskier.
A performance monitoring solution can help isolate problems in development, before they hit production, making the deployments during a slush more reliable. This adds a layer of risk mitigation to your dev process and can give a team confidence in their knowledge of what kind of effect a deployment might have beyond the most obvious impacts.
The correct "level of slush" for your organization depends on the specifics of your systems and how well you understand them. As Etsy notes in their post, their slushes have gotten much more flexible over the years, and "in the early days, [a slush] was far more strict, in part because Etsy’s infrastructure was not as robust and mature as it is today." At first, those slushes may have looked more like a traditional, total freeze, as the system and Etsy's understanding of it were still maturing. Even a code slush introduces some risk, after all.
Stay aware of how changes during a code slush or a critical peak might cause things to queue for later—disk space can ultimately come back to bite you. VividCortex plans to offer predictive alerts on this sort of scenario in the future. Not many systems currently offer effective Time-to-Live alerts (stating, "Disk X has Y days until it fills, if things continue as they have been."), but knowing how much longer your system's current state is viable can prove extremely valuable. Rather than reactively scattering to firefight problems, teams have the ability to bypass costly issues altogether, if they know where to look.
Watch out for latency increases that can show up after you conclude a strong code slush. When you start "thawing" a system, it can cause a lot of items that had been "slushed" to start queueing, getting bundled up, and deploying en masse. Changes you might have made a long time ago— and forgotten—can start to go out on the coattails of other changes that you now intend to ship. Latency can spike, and you should be ready for it. A monitoring platform like VividCortex can help you do this by comparing total query time in the Profiler. By looking at top queries, you can catch the ones that are most time consuming and see whether their latency has increased. In the image below, the number one query in our platform has grown during the 2 displayed time regions day over day by 14 percent, 15 percent, in total time consumed. The average latency has gone from 3.2 milliseconds to 3.56 milliseconds, which isn't a significant change, but if you do see something significant, you can find key information that needs your attention.
With several days left in 2016's holiday shopping season, the virtue of code slushes can still play a role for any e-commerce organization coping with heavy traffic.
But even beyond predictable retail highs, every industry has seasonal spikes, when planned slushes are a smart strategy, if implemented wisely. Overall, it's important to keep in mind that databases are dynamic structures that require nuanced solutions—where icy, absolute methods might backfire, something more flexible and fluid can make all the difference. Don't get stuck in a critical or peak business season with the inability to adjust your system as spryly as necessary.
If you'd like to see how VividCortex can assist in something like a code slush, don't hesitate to sign up for a free trial today.