The migration of our database proved to be quite the arduous task, involving extensive man-hours, several developer assessments, and even weekend consultations to ensure everything was up to spec before the set deadline. However, a power surge during the migration caused the system to shut down.
It is hard to fathom how a single error could completely derail a project or worse still, bring about permanent damage to your database.
Feeling anxious about the future is a natural biological reaction that helps us navigate unfamiliar situations. However, it is not healthy to solely fixate on the worst-case scenario.
Augustus De Morgan is renowned for his quote, “Anything can happen, will happen if we make trials sufficiently”. He goes on to assert that “the first experiment already indicates a reality of the theory, thoroughly proven by experience”. This perception is widely regarded as a precursor to the concept of Murphy’s Law, which contends that “everything that can go wrong, will go wrong”.
While highly improbable, it is not entirely impossible.
My colleague often jests that when launching, the question isn’t about the presence of an impact, but rather, the type of impact it’ll have. Although spoken in jest, there’s an element of truth in the statement.
It is a common practice to replicate production settings in development environments and vice versa, as closely as possible. Even minor tweaks to a computer system can sometimes have far-reaching consequences, and this is just the tip of the iceberg.
Frequent causes of downtime include human error, hardware malfunctions, sluggish networks, and problems with storage media or files. Events such as the AWS outage of 2023, which had an impact on hundreds of users, are regrettable and hard to foresee.
Although the likelihood of a similar event recurring is low, it is still essential to acknowledge the possibility. This is why engineers typically account for redundancy, as exemplified by the Apollo 11 mission. NASA had established a backup for their backup, which ultimately averted a catastrophic crash landing on the moon’s surface.
Upon investigation, it was discovered that the trouble code 1202 on Apollo’s on-board computer was caused by congestion. Luckily, NASA’s programmers had predicted this possibility and integrated a failsafe mechanism that could reset the computer and erase its memory, enabling it to carry out future operations successfully.
Minimizing Downtime and Swift Recovery
Modern-day engineers ought to study the lunar landing as a model for decreasing Mean Time to Recovery (MTTR) in case of a catastrophe. It is crucial that we strive to restore systems to full functionality as expeditiously as possible, in the event of an unanticipated incident.
Recent records indicate that Company A encountered several system outages, in contrast to Company B, which experienced only one failure. It’s apparent that numerous individuals would aspire to emulate Company B’s success, if only they had insight regarding their winning formula.
Suppose Company A has a median time to recovery (MTTR) of roughly 20 seconds, while Company B’s MTTR ranges from 4 to 6 hours. In the case of 20 outages per day, Company A would experience approximately 6-10 minutes of downtime. As a result, the frequency of system failures bears far less significance.
To enhance your body’s recovery process, it’s crucial to comprehend the notion of controlled failure. Though it may sound contradictory, further examination uncovers its potential advantages. Firstly, it’s advantageous to establish a “safe mode” in order to reboot your system.
The team is aware of the precise time and date of the system failure, but its cause remains unknown. It’s vital that the problem is analyzed meticulously and remedied with swiftness to re-establish the services.
We observe system data before and after the breakdown to gain insight into the recovery process and to facilitate future analysis and enhancement.
Witnessing the impacts of unforeseen system faults firsthand is an effective way to generate new ideas. It’s similar to a reality check, as the safety mechanism suddenly exposes its own weaknesses.
Organizing such drills enables teams to acquire a more comprehensive comprehension of the system’s shortcomings, which, in turn, facilitates the creation of more efficient procedures. Though the process may be grueling, the benefits are substantial. It’s reasonable to assume that this could be one of the most challenging intellectual undertakings for a development team.
Tests Are About to Get Turbulent
The integration of controlled failure experiments marks the initiation of Chaos Testing or Chaos Engineering. This involves the capacity to deliberately and consistently disrupt a production environment.
Netflix’s adoption of ‘Chaotic Engineering’ was pivotal to its success as a prominent streaming media provider. The technical team was equipped to tackle various ‘chaos monkeys’, including latency issues and worldwide Amazon Web Services outages.
To cope with these unforeseen challenges, it’s imperative that your team adopts a proactive approach towards research and development. This approach should seek to enhance system and team resilience.
An example of a resilient system is a streaming platform that can redirect its traffic in response to an unforeseen surge in latency, leading to a slowdown in data delivery.
Resilience is a crucial characteristic of successful teams, evidenced by their ability to adjust proficiently and devise imaginative solutions during challenging times. Instead of panicking or becoming complacent, resilient teams perceive crisis situations as prospects to be tackled with commitment and resourcefulness.
Testing in a chaotic environment can encourage a more resilient mindset in both the systems and personnel involved. It’s akin to the fire drills we encountered in our childhood, which taught us how to manage stressful situations. With such practice, we can enhance our readiness for actual crises.
It’s worth mentioning that this approach can be highly taxing for developers, hence it’s most appropriate for experienced teams working on more extensive projects. This technique is advantageous for intricate projects where even minor errors can result in significant consequences.
Chaotic testing is utilized by software and engineering giants such as IBM owing to its recognition as one of the most efficient methods for elevating resilience and reducing Mean Time to Recover (MTTR).
To secure your business, intentionally create chaos.
Although it may seem paradoxical, it’s indubitable that the results of “constructing with disorder” speak for themselves. Netflix is one of the most reliable systems worldwide, illustrating the efficiency of Chaotic Testing when implemented correctly.