Part 2: How Chaos Engineering makes businesses more resilient

Maybe it is time to think beyond "chaos" engineering. As Vilas Veeraraghaven, Director of Engineering at Walmart Labs puts it, "We are not trying to practice inserting chaos in a system. We are trying to practice how resilient we can be when there is chaos in a system."

A healthy system is resilient because it integrates feedback and learns from past changes, irregularities and losses.  It adapts to disturbances.  The experimental tests and studies run by Chaos Engineers at companies like Google, Microsoft, Amazon, and many others, "build confidence in the system's capability to withstand turbulent conditions in production" (principlesofchaos.org).  

Diagram by The Interlock

Failure is normal

We tend to think that a normal system is one that works perfectly - but failure happens all the time. A failure does not have to be considered a total failure, especially within currently increasingly complex distributed systems and microserver architectures.

Today, total disaster may be as unrealistic an idea as total management. Systems engineers are now speaking of 'going into partial failure mode'. They adapt their response in practice to integrate the constant presence of failure within any complex system.

Failure is not abnormal or wrong, or a dead end. It is only a degraded state, one for which systems engineers can plan adaptations and mitigations. Resilience (or Chaos) Engineering adapts in the moment of crisis to real circumstances likely to reoccur. The knowledge of recovery results in confidence.  Resilience is being achieved and measured in business operations today by boosted systems stability, and the resulting service reliability and reputation with customers.

Simply put,

"Chaos Engineering is an experiment to ensure that the impact of failures is mitigated" - Adrian Cockroft, pioneering Netflix Cloud Architect, AWS re:Invent 2018 talk on chaos engineering

Beyond attempts to merely manage disaster and emergency, "mitigation" aims to promote constant readiness.  Introducing Chaos Engineering into your company means you can build a failure mitigation plan.  

Train to be ready

What's your company's failure mitigation plan? When a crisis hits your cloud computing systems, what will you do?

Currently, a lot of the disaster recovery plans for cloud infrastructures and architectures are reactive. We are alerted when an error happens, either spinning up new servers and auto-healing the infrastructure, or rolling back to a stable state. In such cases it is likely traffic is diverted to other regions until an affected region is stable again.

In most cases, these disaster recovery plans are untested, or under circumstances that avoid impacting the full functioning of a production system.  This means that operations never learn to be ready in the event of unexpected system failures.  

In conventional testing, engineers aim to ensure the normal function of units according to expectation. Alternatively, they test the integrations, checking against expected normal results.  This is no longer possible with the current state of cloud computing.  When you have thousands of services running at the same time, staging and simulation no longer work.  You need a different kind of plan, one that integrates the unknown.

Companies who implement Chaos Engineering train to be ready by testing for unexpected failures during actual operations.  Thereby, they are actively building long-term infrastructure readiness. Through carefully planned chaos testing, they are discovering small points of failure that can be eliminated. They are are moving towards auto-healing infrastructures and reducing additional strain. This gives them the peace of mind to be able to handle disaster.  

Failure testing should be rigorous and systematic.  If done right, it will shift your production culture from fear and avoidance that something might go wrong to having the confidence that comes from ensuring that (almost) everything goes right.

To hear more about implementing a Resilience Engineering plan, watch this video.

This post by Node.Kitchen is part of an ongoing series on chaos engineering.  If you want to implement these tools in your company or business first-hand, we’d love to help.  Let’s talk about what your business needs.