A Practical Guide to Chaos Engineering

Listen on the go!

Modern systems are built on a large scale and operated in a distributed manner. With scale comes complexity, and there are so many ways these large-scale distributed systems can fail. Modern systems built on cloud technologies and microservices architecture have a lot of dependencies on the internet, infrastructure, and services that you do not have control over. Cloud infrastructure can fail for many reasons.

Power Outages

Unexpected surge in user traffic

Natural Disasters

Cyber-attacks like DDoS

Hardware Complications

Exhausted Resource

(low memory, high CPU, low bandwidth etc)

We cannot control or avoid failures in distributed systems. But we can control the impact radius of the failure and optimize the time to recover and restore the systems. This can be achieved only by exercising as many failures as we can in the test lab, thus achieving confidence in the system’s resilience.

Why Chaos Engineering?

Chaos Engineering is the discipline of experimenting with distributed systems to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos Testing is the deliberate injection of faults or failures into your infrastructure in a controlled manner, to test the system’s ability to respond during a failure. This is an effective method to practice, prepare, and prevent or minimize downtime and outages before they occur.

Chaos testing is one of the effective ways to validate a system’s resilience by running failure experiments or fault injections.

What is an Experiment?

An experiment is a planned fault injection in a controlled manner. Experiments vary based on the architecture of the systems under test. However, in a distributed system and microservices architecture deployed on the cloud, below are the most common fault injections that must be exercised.

Shutdown the compute engines randomly in an availability zone (or data centre)
Outage of an entire region or availability zone.
Resource exhaustion: High CPU, Low Memory, Heavy Disk Usage
Data Service Failure – Partially deleting a stream of records/messages across multiple instances to recreate a database-dependent issue.
Network – Inject latency between services for a select percentage of traffic over a predetermined period.
Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.

Principles of Chaos Testing

Define system’s normal behaviour: The steady state can be defined as some measurable output like overall throughput, error rates, or the latency of a system that indicates normal behaviour. The system’s normal behaviour is believed to be acceptable behaviour and unexpected behaviour. The normal state of the system should be considered the steady state.
Hypothesize about the steady state: Hypothesis defined here will be believed to be the expected output of the experiment. The hypothesis of the experiments should be in line with the objective of Chaos engineering: “the events injected into the system will not result in a change from the steady state of the target system.”
Design & Run experiments: Identify all the possible failure scenarios in the infrastructure, design failure experiments and run them in a controlled manner and ensure there is a back out plan for every failure experiment. In case, a back-out plan is unknown, identify the path to systems recovery and record the procedures during the recovery.
Analyse Test Results: Verify if the hypothesis was correct or if there was a change to the system’s expected steady state behaviour. Identity, whether there was any impact to the service continuity, user experience and was the service resilient to the failures injected.

Benefits of Chaos Engineering

Prepare for the unexpected: Chaos engineering allows you to test your system against possible failures there by allowing you to use the information from the experiment to strengthen your system against such failures.
Uncover the Unknowns: Chaos engineering helps you to understand the system behaviour during the event of failure and helps to uncover the path to recovery of sub-systems
Reduced System Downtime: You can swiftly figure out common and repetitive downtimes by injecting failures / faults into your system. This helps in strengthening the system against known failures.
Improved Customer Satisfaction: Chaos engineering prevents service disruption by early identification of outages, which in turn improves the user experience.

Chaos Test Tools

Chaos Monkey:

It was one of the first open-source Chaos Engineering tools and arguably kickstarted the adoption of Chaos Engineering outside of large companies. From it, Netflix built out an entire suite of failure injection tools called the Simian Army, although many of these tools have since been retired or rolled into other tools like Swabbie.

It only has one attack type: terminating virtual machine instances. You set a general time frame for it to run, and at some point, during that time it will terminate a random instance. This is meant to help replicate unpredictable production incidents, but it can easily cause more harm than good if you’re not prepared to respond.

Chaos Blade:

It was built for failure testing at Alibaba. It supports a wide range of platforms, including Kubernetes, cloud platforms, and bare-metal, and provides dozens of attacks, including packet loss, process killing, and resource consumption.

Chaos Mesh:

Chaos Mesh supports 17 unique attacks, including resource consumption, network latency, packet loss, bandwidth restriction, disk I/O latency, system time manipulation, and even kernel panics.

Chaos Mesh is one of the few open-source tools to include a fully-featured web user interface (UI). Chaos Mesh also integrates with Grafana to view the executions alongside the cluster’s metrics to see the direct impact.

Litmus:

Like Chaos Mesh, Litmus is a Kubernetes-native tool that is also a CNCF sandbox project. It was originally created for testing OpenEBS, an open-source storage solution for Kubernetes. Litmus includes a health checking feature called Litmus Probes, which lets you monitor the health of your application before, during, and after an experiment.

Getting started with Litmus is much harder than with most other tools. By default, Litmus requires you to create service accounts and annotations for each application and namespace that you want to experiment with.

AWS -FIS (Fault Inject Simulator)

It works with only AWS services.

Amazon Relational Database Service (RDS),
Elastic Compute Cloud (EC2),
Elastic Container Service (ECS)
Elastic Kubernetes Service (EKS)

FIS supports seven native attack types, including rebooting EC2 instances, draining an ECS cluster, or rebooting an RDS instance. Since FIS only supports a limited number of AWS services and has a limited number of attacks, whether you use FIS will depend on what services you use in your environment. The process of running an attack in FIS can be difficult. You must create IAM roles to allow you to run FIS actions, target specific AWS resources by ID, and, if using SSM, construct an SSM document.

Gremlin:

Offered as a SaaS (Software-as-a-Service) technology, Gremlin is able to test system resiliency using one of three attack modes. Users provide system inputs as a means of determining which type of attack will provide the most optimal results. Tests can be performed in conjunction with one another as a means of facilitating comprehensive infrastructural assessments.

It has the ability to test entire systems under a variety of parameters and conditions. Gremlin can also be automated within CI/CD and integrated with Kubernetes clusters and public clouds.

ChaosNative:

It is a SaaS platform that hosts the LitmusChaos control-plane for DevOps. LitmusChaos is a CNCF project for doing end-to-end Chaos Engineering. Users sign up to the ChaosNative Litmus cloud, securely connect their Kubernetes clusters or Kubernetes namespaces, and run chaos experiments to validate the resilience of connected resources. This SaaS platform also offers chaos engineering services for non-Kubernetes targets, such as VMware, AWS, Azure, and Google cloud platforms.

Test tools comparison

Best Practises

Smaller blast radius: Begin with small experiments to know the unknowns and learn about them. Scale out the experiments, only when we gain confidence. Start with a single compute engine or a container or a microservice to reduce the potential side effects.

Test tool selection: Perform a study of the test tools available. Compare the features available and the time and effort required to build your own tools. We recommend not to pick tools that perform random experiments as it would become difficult to measure the outcome. Use the test tools that perform thoughtful, planned, controlled, safe and secure experiments.

Exercise first in Lower environment: get confidence in the tests, start with staging or development environment. Once the tests in these environments are completely successful, move up to production.

Roll Back & Abort planning: ensure effective planning is exercised to abort any experiment immediately and revert the system or service back to its normal state. If experiment by any chance causes a severe outage, track it carefully and do an analysis to avoid it happening again. If these plans are void or cannot be run, exercise effective root cause analysis to learn further on the outage.

Path to achieve maturity of Chaos Testing:

Conclusion

No system is safe from failure or outage. Cloud infrastructure platforms cannot be over trusted, every major Cloud infra reported at least one outage in each quarter. We cannot control the failures or outages. You can only control the impact on your customers, employees, partners, and reputation by exercising failures as many times as possible in the test lab, thus identifying the path to your systems’ recovery.

Enterprises building distributed systems must exercise Chaos engineering as part of their resilience strategy. Running Chaos tests in a continuous manner is one of several things that you can do to improve the resiliency of your applications and infrastructure.

Cigniti has built a dedicated Performance Testing CoE that focuses on providing solutions around performance testing & engineering for our global clients. We focus on performing in-depth analysis at the component level, dynamic profiling, capacity evaluation, testing and reporting to help isolate bottlenecks and provide appropriate recommendations.

Schedule a discussion with our Chaos Engineering and Testing experts to find out more about Chaos Engineering and testing tools for cloud deployment.

Jitendra Nath Lella

Jitendra Nath Lella is a Senior Architect at Cigniti Technologies and is Certified Chaos Engineering practitioner. He is into the practice of Non-Functional testing for over 17 years. He is specialized in building & implementing test strategy’s for organizations that build / migrate data centres on to the cloud. Also, his expertise is into simulating heavy user load tests of more than 200K users.

View all posts

A Practical Guide to Chaos Engineering

Author

Leave a Reply

A Practical Guide to Chaos Engineering

Author

Leave a Reply

Related Posts

7 Steps to Execute Chaos Engineering