Part 3:  The innovative history of Chaos Engineering

Chaos Engineering accompanies the historical shift to cloud computing and the resulting culture shifts required by software systems engineers. What started as a brave experiment is now best practice at big tech firms like Google and Facebook.  This relatively young, truly interdisciplinary 21st-century field draws on anything from scientific method, systems and behavioural theory to gaming culture and firefighting.

Master of Disaster

Walking into the backup system operations of Amazon's Silicon Valley headquarters in the mid-2000s, you might have run into software engineer Jesse Robbins, official job title: “Master of Disaster”.  During those years, when its international data centres expanded rapidly, Amazon was already experiencing the growth of complexity in its systems that many companies are seeing today, including the associated failures.  

Robbins had been training for active duty as a full-time firefighter.  To qualify, you'd need about 600 hours.  By some reports, once you are ready to fight real fires, you'll spend  80% of your time in training keeping fit. This level of practice in disaster management keeps firefighters in regular touch with the full system of possible emergency responses.  Robbins calls this level of knowledge 'intuition', and found it lacking in regular systems engineering practices.

With GameDay, he transformed his workplace at Amazon into a gamified experimental firefighting unit.  He would light proverbial fires in Amazon’s backup systems, forcing teams beyond the safe routines of the fire drill concept.  Data centres were closed, electricity shut down, machines simply unplugged. Gradually, through participating in each deliberate injection of failures, Amazon staff learnt to be disaster-ready and recover faster.  They built a more reliable system and pioneered a new approach to software engineering.  

Tools for Resilience

In 2011, Netflix engineers created and unleashed 'Chaos Monkey' on their operations, the first automated chaos tool - inspired by a monkey let loose on a farm that randomly destroys things.  Chaos Engineering relies on chaos tests, single controlled experiments to systematically observe and analyse system failures.  Chaos tools automate all this.

Netflix Chaos Monkey logo

Netflix runs on a large multi-regional system, delivering streaming video on demand to customers worldwide.  As an early cloud adopter, operating via a backend on Amazon Web Services, it had to develop new tools to test the fault tolerance of this kind of architecture, while keeping customers happy.

The Chaos Monkey system resilience tools have evolved into a 'simian army', a whole library of online 'monkeys' or bots that can be dropped into your system. Each of the many monkeys offers a different model of disturbance.  For instance, Chaos Gorilla drops entire data centres, even geographical areas.  Latency Monkey causes communication delays.  They deliberately and randomly attack Netflix's own system to cause failures that their team has to respond to and fix in real-time.

Through repeated testing, Netflix can now run chaos tests while their customers are watching movies, with hardly any effect on operations. Some monkeys do not cause chaos, but assist, like Doctor Monkey (automated health checks) and Janitor Monkey (eliminates unused resources). Conformity Monkey checks and reports the nonconformity of instances against a set of rules.  Through the simian army, Netflix has engineered a system that is so resilient that it can regulate itself, also known as auto-healing.

Failure-as-service

Netflix also shares its chaos tools.  The simplest and easiest way to start your own Chaos Engineering experiments is 'Chaos Toolkit', an open-source toolkit.  New modules continue to be developed and made freely and publicly available.  If you're more advanced, you can download Chaos Monkey on Github and check out the documentation if this sounds like something you would like to try.  Going beyond Chaos Monkey, the platform 'Gremlin' is a fully hosted "failure-as-service" safe-experimentation-solution for engineers.  It recently went independent as a startup.  

As densely paired systems continue to grow in complexity, and customers demand increasing availability of information, the resilience engineering approach, along with existing tools, will become the status quo.  Since the 2000s, the widespread adoption of Chaos Engineering has evolved into a global movement, with its own community, official web page for professional principles, and regular meet-ups in cities worldwide.  It has its own conferences, channels, and startups selling "resilience services".  

What's next?  At ChaosConf 2019, Crystal Hirschorn, a VP of Engineering at Conde Nast, speculated about the future tools of Chaos Engineering and the unknown unknowns.

This post by Node.Kitchen is part of an ongoing series on chaos engineering.  If you want to implement these tools in your company or business first-hand, we’d love to help.  Let’s talk about what your business needs.