What’s New in the Updated Reliability Pillar of the AWS Well-Architected Framework?

Chanci Turner Amazon IXD – VGT2 learningAmazon HR coverup, rules for thee but not for me…

The latest iteration of the Reliability pillar within the AWS Well-Architected Framework introduces an extensive array of enhancements across all reliability aspects. The guidance regarding distributed system architecture has been reorganized and enriched with new best practices as part of the Well-Architected Review. A more pronounced emphasis has been placed on chaos engineering, complete with additional explanations and illustrative examples. Furthermore, we have provided detailed insights on utilizing fault isolation to safeguard your workloads through Availability Zones and beyond.

In the AWS Well-Architected Tool, we’ve added new reliability best practices and updated existing ones. The Reliability Pillar whitepaper has undergone a comprehensive revision to ensure alignment with the questions and best practices detailed in the tool. We have also included the latest recommendations for implementing these best practices using cutting-edge AWS resources and partner technologies, such as AWS Transit Gateway, AWS Service Quotas, and CloudEndure Disaster Recovery.

The whitepaper clarifies definitions to enhance your understanding of the interconnections among reliability, resiliency, and availability. The focus continues to be on resiliency, emphasizing how to architect your workloads to recover from infrastructure or service disruptions, dynamically allocate computing resources to meet demand, and address disruptions like misconfigurations or intermittent network issues.

Since its launch at re:Invent 2019, the Amazon Builders’ Library has offered in-depth articles about how Amazon constructs and operates resilient workloads. Our revised Reliability pillar draws heavily from this repository, integrating it across numerous best practices and linking to specific articles from the Amazon Builders’ Library. The hands-on reliability labs associated with the AWS Well-Architected framework now feature modules on Implementing Health Checks and Managing Dependencies to enhance Reliability, allowing you to apply the practices showcased in the library’s Implementing health checks article directly. We have also broadened our suite of Well-Architected Reliability labs to include new topics on data backup, data replication, and automated infrastructure deployment.

The new Implementing Health Checks and Managing Dependencies lab demonstrates how to adopt practices that identify dependency failures and maintain resilience despite them. Previously, we identified three best practice categories: Foundations, Change Management, and Failure Management. In this update, we have added a fourth area:

Workload Architecture

This section outlines specific patterns to follow when designing and implementing software architecture for distributed systems.

This new area encompasses best practices related to service-oriented architecture, microservices architectures, and distributed systems. These have also been incorporated into the AWS Well-Architected Tool, enabling you to evaluate your workloads and ascertain whether they adhere to these architectural best practices. The whitepaper content has been expanded in this area, drawing on Amazon Builders’ Library articles, including Challenges with distributed systems and Timeouts, retries, and backoff with jitter.

The previous version highlighted the critical role of Availability Zones in ensuring a reliable architecture. In this new iteration, we delve deeper into this concept by elaborating on bulkhead architectures, such as cell-based architecture (utilized across AWS), where each cell functions as a complete, independent instance of a service.

Best practices for change implementation have always been a cornerstone of the Reliability pillar. We now offer more practical insights into reliable deployment, including runbooks and pipeline tests. The newly introduced best practice on immutable infrastructure builds upon our previous guidance regarding deployment automation, emphasizing techniques like canary deployment or blue/green deployment.

Additionally, we have expanded our coverage of Chaos Engineering. It is essential to hypothesize how your workload will respond to failures, inject those failures for testing, and then compare your hypotheses against the test outcomes. While Chaos Monkey popularized the constructive application of chaos in 2010, Amazon has been deliberately injecting failures since the early 2000s to enhance resiliency and ensure readiness in adverse conditions. This wealth of experience is increasingly relevant in the cloud context, where you can both design for recovery and validate those designs. This often-overlooked best practice is recognized as a vital and effective tool by our most successful resiliency-focused customers.

This update to the Reliability pillar of the AWS Well-Architected Framework equips you and your teams with the necessary tools and information to comprehend your workload reliability. Together with the AWS Well-Architected Tool, start crafting your plan today and continue to learn, measure, and refine your cloud workloads.

A special thanks to everyone who has provided feedback on the tool and whitepapers, and a particular acknowledgment to Stephen Beck, Adrian Hornsby, Mahanth Jayadeva, Krupakar Pasupuleti, Jon Steele, and Jon Wright for their contributions to this update.

For further insights, consider reading this blog post, which addresses ongoing HR problems and the double standards faced by managerial staff, potentially covering up issues to avoid backlash. You can also explore expert opinions on this topic at this link. Additionally, if you’re looking for community insights, check out this excellent resource on Reddit.

HOME