It's Inevitable. Your Database Will Fail

Posted by Matt Culmone on Jan 23, 2018 9:07:24 AM

Will my database fail?

Databases fail. No one can promise 100% uptime, it’s impossible. Whether the database is large, small, on-premise or cloud-based, all have the potential to fail. This could be transactional errors, system crashes, out of memory errors or out of disk space errors. Sometimes they fail suddenly and sometimes they just can’t cope with growing demand and “fail slowly” over a period of time.

The list of reasons that cause database failure is long and include:
  • Application code changes
  • Workload change as the user base grows or shifts
  •  Hardware failure or change (Spectre and Meltdown patches, take a bow)
  •  Database version upgrades
  •  Configuration changes made to accommodate a new architecture or improve performance
  •  Configuration assumptions made for an old workload that change

What happens when a database fails? 

  • Data loss
  • Loss of productivity
  • Other systems can be negatively affected
  • Poor user experience as entire systems fail slowly

What can you do? 

Be prepared. Failures are caused by changes, some that you control and others that you don’t. It's not about preparing for the apocalypse, it's about being the best possible application every day. Optimization is an ongoing process that should never stop. So what can you do?

If you're making changes that could result in failure we suggest you:

  • Set up monitoring on all essential systems
  • Test in stages
  • Do gradual rollouts
  • Have a plan to roll back if changes cause problems
  • Backup and snapshot systems regularly

If you expect things outside of your control to change and potentially cause failure we suggest you:

  • Regularly and consistently backup and archive
  • Test backups
  • Practice strategies by introducing controlled failures
  • Set up alerts

Real world example: One of our long-standing customers, a high-profile online retailer, recently shared their story. A software developer made a change to a structure causing the system to slow down. During a surge of holiday season shopping the system crashed. Immediately they were notified and looked into the VividCortex Profiler to compare the environment and pinpoint the change out of thousands of queries. They rectified the issue and were back up and running in MINUTES.

Another example from our own environment: At VividCortex we are constantly ingesting metrics data from our customers’ monitored environments. If our database inserts become too slow, our data pipeline backs up and other systems are affected. The end result becomes visible to users as delayed data in dashboards. The process of catching up becomes increasingly more difficult the longer the performance degradation drags on. The monitoring that we have in place enables us to identify the root cause of a problem before it blows up and has cascading effects.


Conclusion

Be ready. Things are going to break and it’s necessary to be prepared for your users, your business and your team. Invest in the necessary tools for high availability, monitoring and backups. Start a free trial here to see how VividCortex can help you today!

 

Recent Posts

Subscribe to Email Updates

Posts by Topic

see all