Part 1:  Why Chaos Engineering can work for your business

Everyone’s talking about it, but hardly anyone is doing it.  Why is chaos engineering so rarely implemented? The truth is, if done incorrectly, chaos engineering is costly, causes down-time, and can damage your business.  So why do we think business leaders like Amazon and Netflix will continue to be doing even more of it in future?

Simply put,  because it works like nothing else to make your business systems strong.  But before we get into the nitty-gritty, let’s explore the basic concepts of Chaos Engineering, also known as resiliency engineering.  

We observe chaos when frustration runs high during any online system failure. Today’s world is dependent on the cloud, and the cloud depends on its systems engineers.  Unfortunately, we can't always prevent failure.  The more complex a system is, the more likely failure becomes, and the longer the recovery typically takes.  We simply cannot plan ahead for every moment.  What we can do is plan towards different outcomes.  Call it a disaster recovery plan, or maybe it is just the intuition that you develop when you're really in touch with the big picture.

We should practice breaking things

Responding to imaginary chaos, we can imagine how we’ll fix things. What happens when it hits in real life, and how do we practice fixing things before they’re hit?  Chaos engineering suggests that only in practising how to break actual systems, do we discover what we don't actually yet know how to fix.

When planning (architecting) and building a new system, today's systems engineers are planning deliberate and regular moments of failure and response, for instance, turning off the power, or closing a data centre, and dealing with the resulting chaos. This helps them to develop better systems faster and more effectively.  The message?  We must design with chaos and failure in mind.  It makes systems more resilient.

While its true that the cloud guarantees high availability and better fault-tolerance, the fact is that no single component can guarantee 100% uptime.  Simplicity is no longer a key to understanding the future’s software systems.  Today, an increasing number of multiple micro-services are interacting faster than ever.  Architectures are larger and more distributed than before.  An average full working system is running day and night with communications happening constantly.

Identify failures before they become outages

All this complexity means system failures are harder to predict and contain.  Failure cannot be prevented, and chaos cannot be ruled out.  More and more unexpected in nature, only real failures can teach us how to prevent them from happening again.  By implementing chaos engineering, we can identify failures before they become outages.

Curious to go deeper? In this podcast, Nora Jones, a senior software engineer on Netflix’ Chaos Team, talks about what Chaos Engineering means today.  She covers what it takes to build a practice, how to establish a strategy, defines cost of impact, and covers key technical considerations when leveraging chaos engineering.

This post by Node.Kitchen is part of an ongoing series on chaos engineering.  If you want to implement these tools in your company or business first-hand, we’d love to help.  Let’s talk about what your business needs.