Knowledge Byte: Designing the Cloud to Expect Failure
Cloud Credential Council (CCC)
Designing software for failure is an extra barrier to overcome but isn’t too hard, and it certainly pays off.
Largely, it boils down to make sure that operations do not leave the system in an unstable state if they are aborted partway through for some reason. This is mainly a challenge for the frameworks and infrastructure upon which applications are built; with the infrastructure for retrying failed operations built into the system, application developers only really need to worry about the areas where the system can’t automate recovery from failure (such as operations that trigger real-world actions).
Design Tips:
● Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure. The best practice is deployment into multiple availability zones.
● Each application component must make no assumptions about the underlying infrastructure, it must be able to adapt to changes in the infrastructure without downtime.
● Each application component should be partition tolerant, it should be able to survive network latency (or loss of communication) among the nodes that support that component.
● Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure.
The use of Chaos Monkey—the best way to avoid failure is to fail constantly. From an early stage, Netflix used a Chaos Monkey—a piece of software that can randomly kill off different services/ features in Netflix, with the intention of assessing how well the recovery works. Initially, this was used in the test, but now is being used randomly in production. Quoting from the blog—“If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most—in the event of an unexpected outage.”
Courses to help you get
results with Cloud
Professional Cloud Developer™
The industry-recognized CCC Professional Cloud Developer certification is based on the Cloud standards developed by NIST and includes best practices on application design for Cloud environments. Importantly, it supports many vendor technology solutions, covering Open Source and major vendor standards. We made sure to include various high-quality reference materials to help you in your career…
Never miss an interesting article
Get our latest news, tutorials, guides, tips & deals delivered to your inbox.