{"id":18074,"date":"2022-10-14T11:33:20","date_gmt":"2022-10-14T06:03:20","guid":{"rendered":"https:\/\/cigniti.com\/blog\/?p=18074"},"modified":"2024-01-31T23:53:20","modified_gmt":"2024-01-31T18:23:20","slug":"seven-steps-execute-chaos-engineering","status":"publish","type":"post","link":"https:\/\/www.cigniti.com\/blog\/seven-steps-execute-chaos-engineering\/","title":{"rendered":"7 Steps to Execute Chaos Engineering"},"content":{"rendered":"
We\u2019ve all heard about the significant WhatsApp breakdowns recently, during which the app was unavailable to the public for an hour. However, from a technical standpoint, WhatsApp returned in less than an hour. What would have enabled the engineers at WhatsApp to restore the services quickly?<\/p>\n
Technically speaking, the team experienced an extremely stressful production failure because of this. Indeed, significant corporations like Netflix, Facebook, Google, and others use a technique called Chaos Engineering<\/strong>.<\/p>\n The purpose of chaos engineering<\/a> is to learn how our system will behave during catastrophic failures in production and how resilient our system is, allowing us to optimize and fix the issues.<\/p>\n Chaos engineering involves testing a system<\/a> to increase confidence in its ability to endure turbulence during production. You can use chaos engineering to compare what you expected to happen to what happened. To understand how to create more resilient systems, we need to \u201cbreak things on purpose<\/strong>.\u201d<\/p>\n When we were little, we used to pick up wooden sticks off the ground and bend them to split them in half. The point at which the stick breaks interests us most, though. The point truly represents the stick\u2019s ability to bear stress and pressure.<\/p>\n Chaos engineering is to observe, track, react to, and enhance<\/strong> the reliability of our systems in challenging conditions.<\/p>\n <\/strong><\/p>\n Before running your chaos experiments, thoroughly understand your system\u2019s architecture. Discuss the application architecture in a working session with your developers, architects, and SREs, and learn about the upstream\/downstream components, dependencies, timeframe, deployment schedule, and other factors. This will help you understand where exactly your system could fail.<\/p>\n Start writing a list of hypotheses, such as what might go wrong. Example: If a site has numerous nodes and one goes down, the load balancer must rapidly reroute traffic to the remaining healthy nodes. Additional instances of this kind include failing hard drives, broken network connections, potential production interruptions, etc. The main point is that there is no right or wrong when listing the hypotheses. It is an iterative process. Making our belief TRUE or FALSE is NOT our goal. Each theory will allow us to learn more about our system.<\/p>\n Always get going slowly. By reducing the blast radius, chaotic experiments can be conducted with less impact on the users. Example: Delete the build deployment flow in Jenkins and validate the resiliency. Even if you are deleting a deployment flow, make sure GitOps is active so that the GitOps flow will create the build deployment automatically. Another illustration would be to shut down a zone of the server rather than the entire region or to turn down 50% of the cluster\u2019s active nodes. You can progressively extend the blast radius once the chaos process has evolved and your crew is at ease.<\/p>\n Always think ahead and have a Plan B handy. Set up a unified communication channel in Teams (or your company\u2019s communication platform) to post updates periodically and notify all relevant stakeholders at least one week in advance. It is advisable to assemble your own Avengers team of developers, testers, DevOps<\/a>, SREs, and others to support you when you ignite your first experiment.<\/p>\n Running the first chaos experiment is like riding a thrilling roller coaster. Make sure you can stop the experiment and reverse the infrastructure with the aid of your Avengers squad in case things go wrong. To conduct an experiment, your system must be intentionally broken so that some infrastructure components are unavailable. Examples include shutting down working processes, deleting database tables, stopping access to internal-external services, and terminating cluster machines.<\/p>\n Although these experiments are challenging, you will be astonished by how much you can learn from Chaos no matter what you try. Watch your Observability dashboard throughout the experiment to keep track of important metrics like response time, disc usage, pass\/fail transactions, health checks, etc. Nobody is flawless. It\u2019s okay if your initial experiment doesn\u2019t go as planned. Post an update as soon as possible, notifying all parties involved.<\/p>\n Once the experiment is complete, record all your observations in a spreadsheet, analyze them, and define your hypothesis verdict. Again, there is only learning and no PASS or FAIL. Schedule a meeting with the respective stakeholders, including your Avengers team, to discuss your verdict. This will help the team understand the verdicts and fix the issues you discovered. You can repeat the experiments after addressing the problems.<\/p>\n <\/p>\n If you discover the system is durable, consider enlarging the explosion radius and repeating the experiments.<\/p>\n Chaos engineering aims to experience disastrous circumstances. Although it may seem challenging and requires a lot of imagination, the extra work is unquestionably worthwhile. You must inject failures into your system to make some infrastructure components unavailable. Later, you can mimic situations like high latency caused by slow networks that can upset the steady state.<\/p>\n Enterprises building distributed systems must exercise Chaos engineering as a resilience strategy. Running Chaos tests continuously is one of several things you can do to improve the resiliency of your applications and infrastructure.<\/p>\n\n
7 Steps to Execute Chaos Engineering<\/h2>\n
System Architecture<\/h3>\n
Step 3. Write a Hypothesis<\/h3>\n
Step 4. Minimize the Bang<\/h3>\n
Step 5. Plan for a Play Date<\/h3>\n
Step 6. Run your First Experiment<\/h3>\n
Step 7. Analyze & Brainstorm the Experiment Results<\/h3>\n
Conclusion<\/h2>\n