Design for failure

Design for Failure with the Cloud

Design For Failure Philosophy

“Everything fails, all the time” – Werner Vogels

Designing for failure in a cloud computing environment is of supreme importance. The reason why it is crucial is that you don’t want your users to suffer. More apps these days are becoming mission-critical, and that means that creators must account for failure throughout the process.

Applications across the board rely on the internet. 

These applications might range from electricity dashboards to other systems that affect your life. Remember, it was just a few days ago when Robinhood down for almost two days. That is quite a problem for those who use the service to take out options and must sell it if the market turns against them. We saw that many people were upset and frustrated.

Many lawsuits are likely to follow, and users are expected to smart for some time.

So, what was the deal with Robinhood?

They noted that their systems had trouble communicating with each other. That doesn’t sound like a highly resilient and reliable architecture that would serve users best.

Remember that different components within a system can fail, but it should have a design in such a way that the application will continue to run without any issues. You might wonder, how is that possible? How could components fail with the result of a fantastic application?

The answer is that those components would be switched out with another similar resource that would keep the system going without issue. A critical design principle of Amazon is design for failure now so that the system is fail proof long-term. 

If stability is critical over the long term in more applications in your life, then reliability from a design standpoint is of the utmost importance.

What should you do if you want your application to be focused on HA?

Highly Available Architecture

The first step in the process is to have this idea of highly available architecture in mind. All present and future creators must look at their applications and understand how to prevent outages from the start.

What’s the next step?

– Assess for single points of failure.

Single points of failure or choke points affect the availability of the application. If that component is taken out, the whole system goes into chaos or disarray. An increase in turmoil more than likely means an increase in user dissatisfaction.

– Decouple components.

This aspect refers to minimizing the need for connections between components. If one goes down, the other is unaffected. You can decouple components by using elastic load balancers, simple message queue, and simple workflow engines, among other tools.

– Security

Focus on security throughout the process. The application provider is still responsible for security, encryption, and other components from an application standpoint. Cloud providers such as Amazon and others must focus on protecting the physical environments, maintaining uptime, and other elements to provide a seamless experience.

Now, you have an understanding of what it is to do design for failure and why it is vital in a dynamic world.


Leave a Reply

Your email address will not be published. Required fields are marked *