Part 4: The scientific process of Chaos Engineering
Chaos engineering becomes more necessary the faster you're developing. Undiscovered, your new system's weaknesses might only emerge once they trigger an outage in production. Implementing Chaos Engineering instead can reduce your unknown costs long-term.
Chaos testing simulates a situation in which failure happens, but takes place during the conditions of everyday operations. (Normal testing happens during build / compile activities.) Chaos testing also tests for factors beyond your knowledge or control. Perhaps the most important difference between chaos testing and normal testing is that it includes people, and suggests training and preparing them for the failures they will be required to fix in person at three in the morning while the rest of the world is sleeping.
Failure is erratic. How to test rigorously.
Breaking your own system may sound crazy, but Chaos Engineering is a scientific practice with an exact engineering process. Scientific rules can be applied to any experimental scenario, for example for resulting in failure or success.
A test determines whether something known happens or not. As Nora Jones, a Senior Chaos Engineer at Netflix, puts it: "If x happens, can you expect y to happen?" Failure is erratic, so you have to be rigorous if you want to study it. This kind of chaos testing process maximises the safety of your operations while allowing you to discover unknown threats.
Phases of Chaos Engineering
There are four basic steps to grasp when thinking about introducing Chaos Engineering to your company.
- Identify a measurable output that indicates behavior, define "steady state".
- Form a hypothesis.
- Simulate real-world events.
- Disprove your hypothesis.
Beyond the basics, the various phases of Chaos Engineering differ in practice, as do the diagrams above and below. Ultimately you will have to find a practice that works for your situation.
Before you can start, you will need to identify a stable or steady state that indicates behaviour. You can only test for failure and instability if you have a steady state first. For instance, overall system stability can be determined by measuring throughput, error rates, latency percentiles and other factors over a short period of time.
1. Form a hypothesis around a steady state of behaviour: When designing a test, first ask yourself: "What do you think should happen?"
2. Plan your experiment: How can you recreate this failure safely without impacting users? This could happen on various layers.
- Simulate an attack to kill a process, reproducing conditions for an application or dependency crash
- Stop or reboot the host operating system, simulating the loss of cluster machines
- Alter the host’s system time, recreating time-related changes, eg to daylight saving time
3. Minimize the blast radius: Start small. "What is the least disruption I can cause as an experiment to teach me about the system?
"While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained."
(We'll talk about this in detail in the final post in this series, "How to survive causing chaos").
4. Run the experiment and compare the outcome to the expectations (verify): Just observe and document what happens. Analysis can happen later. Record your expectations. ("It should explode.") Record your failures. ("It didn't explode.") Record your expected failures. ("This time, it should not explode.") Record unexpected non-failures. ("It should not have, but it did not explode.")
5. Celebrate! "It did not work as it should." You found a bug! Success! Increase the blast radius and begin again at #1.
As failure scenarios can be reliably repeated, documented, and addressed through Chaos Engineering, these phases are in fact a cyclical process. After verifying your results, you will want to improve or strengthen the infrastructure and return to a steady behaviour state, to make room for a new hypothesis. To save cost and labour, experiments, their analysis, and mitigating failures should be automated to run continuously. Our last post mentions platforms that make this possible.
The point of simulating potentially catastrophic events is to make them non-events that are irrelevant to our infrastructure’s ability to perform as required. - Gremlin
Chaos proves the system does work (under certain conditions) by observing systemic behavior patterns during experiments, both regular and disruptive. Addressing systemic vulnerabilities to boost fault-tolerance is a matter of taking each step as it comes. A fireman's training in actively putting out real fires makes him ready and resilient because the steps blur and disappear, but he starts at the beginning.
The ultimate goal is to run your experiment full-scale in production. No more bugs showing up at this point? Your system is tolerating failures and working to full expectations? Well done. Mission accomplished. You can end the experiments and step out of the process - for now.
This post by Node.Kitchen is part of an ongoing series on chaos engineering. If you want to implement these tools in your company or business first-hand, we’d love to help. Let’s talk about what your business needs.