You don’t have to search too far to realize that a competitive edge starts with having resilient systems. Resilience shows an architecture perspective/consideration, advanced engineering practices and most of all it highlights the ability to bring harmony to many disparate moving parts.
Architects used to design for resilience as part of their non-functional discovery processes to cater for failure tolerance and quality of service, among other aspects. But the complexity has increased exponentially lately, so new means need to be introduced and at the core of resilience practices, the introduction of chaos engineering is bringing a whole new dimension to the game of resilience.
What is Chaos Engineering?
We’ve come a long way with advanced architecture, maturing engineering practices, faster development, deployment and implementation of complex distributed systems. But even when these collectives are perfectly designed, there’s no guarantee of harmony when executed in the face of real-life scenarios or events.
We need a way of ensuring, with confidence, that we can predict what would happen in an inherently chaotic environment (typically your production environment). What better way is there than to unleash chaos proactively into our production environment to strengthen our confidence?
We call this notion Chaos Engineering, or perhaps more accurately defined as: “The discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production and scalability of software applications” (source: PrinciplesOfChaos.org).
Behind all the beauty lies madness and chaos
What’s the benefit?
- Higher availability/reliability – Discover failure before your customers do and save yourself some embarrassment. Win-win. You’ll have the opportunity to test multiple real-life scenarios before your customers do.
- Money, Money, Money – Discovering defects in production is an expensive exercise. If you are a busy team, then getting involved in outages takes time away from your build plan and is inherently more expensive (3x cheaper to fix a defect during the build cycle instead of production).
- Less risk – Your risk/compliance/security and legal team will love you for changing the unpredictable into the predictable (before, this was only possible through flowers and chocolates, but now you can use chaos engineering as well).
- It’s cool – If you’ve ever had a fascination for breaking things to see how it was put together, then this concept (and perhaps the role of a Site Reliability Engineer) may just be for you.
How do I get started? What tools can I use?
The tooling landscape has significantly improved since Netflix first released the Chaos Monkey back in 2011.
Here are a few considerations that we found valuable:
- Netflix’s Simian Army
- Chaos Monkey – Kills random instances
- Chaos Gorilla – Kills Zones
- Chaos Kong – Kills regions
- Latency – Degrades network and injects faults
- Conformity – Looks for outliers
- Circus – Kills and launches instances to maintain zone balance
- Doctor – Fixes unhealthy resources
- Janitor – Cleans up unused resources
- Howler – Yells about bad things like Amazon limit violations
- Security – Finds expiring security certificates
- Gremlin (not the database): – Offers a Chaos Engineering platform that now supports testing on the Kubernetes cluster
- AWS’s fault injection simulator: Typically used to kill EC2 instances
- VMware’s Mangle: Can be used for killing VMs
- PowerfulSeal: Used for testing Kubernetes clusters
- Litmus: Can be used for testing stateful workloads on Kubernetes
- Pumba: Can be used with Docker for chaos testing and network emulation
- Chaos Dingo: Used for Microsoft Azure
- Chaos HTTP Proxy: Can be used to introduce failures into HTTP requests
And of course many, many more. We have our favorites mainly due to familiarity, but essentially all these tools will enable you to get started immediately and are of comparable functionality at the time of this writing.
What’s the process involved?
Collective input and a growing body of knowledge have come up with a simple, effective process to enabling chaos:
- Define a steady state (typically from non-functional values/requirements)
- Form a hypothesis (e.g. what if the load balancer fails? Hypothesis can include: application, host, resource, network and region levels)
- Design an experiment (pick a hypothesis, scope and identify the metrics for passing, control the blast radius)
- Execute/Verify and learn (quantify the result)
- Fix (by either including in the backlog as a sprint or fix immediately if possible)
Chaos Engineering Best Practices
If your aim is to get started immediately, then I suggest you communicate your intent to operational stakeholders as a first step and get agreement before moving forward. Outages are real in every industry, and not controlling the blast radius of your experiment could make you unpopular in a very short time frame. Especially when highlighted by the backdrop of attempting to prove a new concept.
Here are some best practices we’ve learnt from our efforts and research:
- Build a hypothesis around a steady state
- Simulate real-world examples
- Experiments MUST be in production (you lose the benefit of the real-world scenarios when tests are run in non-prod environment. Additionally, compliance/legal/security/risk policies must still apply so as to create a real world example and ensure policies are not breached)
- Automate the experiments so they become part of the DevOps pipeline
- Minimize and control the blast radius (you should be able to carefully monitor and back out of the experiment at any stage before serious impact)
Good luck with your chaos engineering efforts, we hope to see you on the side of advanced architecture, engineering and chaos.