Amazon Onboarding with Learning Manager Chanci Turner

In today’s world, ensuring the resilience of distributed applications involves a comprehensive understanding of application, infrastructure, and operational processes. In Part 2 of our exploration, we delve into how Amazon Web Services (AWS) managed services, redundancy, high availability, and infrastructure failover patterns—anchored in recovery time objectives (RTO) and recovery point objectives (RPO)—can fortify infrastructures against failures.

Pattern 1: Identifying High-Impact Infrastructure Failures

A robust cloud infrastructure requires an awareness of potential failures and their likelihood. As depicted in Figure 1, many failures stem from operator error or subpar deployments. To mitigate such issues, employing automated testing and deployments along with best design practices is essential. Datacenter failures, including complete rack outages, can be addressed by implementing auto scaling and deploying across multiple availability zones (AZs) while leveraging resilient AWS cloud-native services.

As illustrated in Figure 1, infrastructure resiliency combines high availability (HA) and disaster recovery (DR). HA is achieved by incorporating redundancy within application components and eliminating single points of failure. Decisions made at the application layer, such as designing stateless applications, facilitate HA at the infrastructure level, allowing for scaling through Auto Scaling groups and distributing applications across various AZs.

Pattern 2: Grasping and Managing Infrastructure Failures

To create a resilient infrastructure, it’s vital to discern which failures can be controlled and which cannot, as highlighted in Figure 2. This understanding enables us to automate failure detection, manage them effectively, and apply proactive strategies—like static stability—to reduce the need for over-provisioning infrastructure.

The decisions under our control that enhance infrastructure resiliency include:

AWS services come equipped with control and data planes engineered for minimal blast radius. Typically, data planes aim for higher availability than control planes due to their simpler design. When responding to events that threaten resiliency, relying on control plane operations can diminish overall architectural resilience. For example, Amazon Route 53 is designed with a data plane that boasts a 100% availability SLA. Thus, an effective failover strategy should prioritize the data plane over the control plane, as demonstrated in this article on Creating Disaster Recovery Mechanisms Using Amazon Route 53.

It’s also crucial to comprehend the networking design and routes within a virtual private cloud (VPC). Understanding traffic flow informs better application design, revealing how a single component failure impacts overall ingress and egress traffic. To bolster network resiliency, a well-thought-out subnet strategy and IP address management tools are vital.

Designing VPCs with awareness of service limits and deploying independent routing tables within each zone can enhance availability. For instance, it’s preferable to utilize highly available NAT gateways rather than NAT instances, as noted in the Amazon VPC documentation.

Pattern 3: Exploring Various Methods to Enhance HA in Infrastructure

As previously stated, infrastructure resiliency is a function of HA and DR. Here are several strategies to boost system availability:

Building for Redundancy: Redundancy entails duplicating application components to enhance availability. Following best practices at the application layer allows for the creation of self-healing mechanisms at the infrastructure layer.
Auto-Scaling Your Infrastructure: In cases of AZ failures, infrastructure auto-scaling ensures the maintenance of the desired number of redundant components, thereby sustaining baseline application throughput. Auto-scaling utilizes metrics for appropriate scaling, as shown in Figure 4.
Implementing Resilient Network Connectivity Patterns: For highly resilient distributed systems, robust network access to AWS infrastructure is essential. As hybrid applications require reliable communication between their cloud-native counterparts, the capacity for such connectivity must inform network access design using AWS Direct Connect or VPNs. Testing failover scenarios is critical to validate network paths and ensure they meet RTO objectives. A hub-and-spoke configuration via Direct Connect gateways and transit gateways simplifies network topology and failover testing. Additionally, the AWS networking backbone enhances security and reduces costs. AWS PrivateLink offers secure access to AWS services, revealing application functionalities and APIs to other business units or partner accounts on AWS.

Furthermore, security appliances should be configured for high availability to ensure that if one AZ becomes unavailable, security inspection responsibilities can seamlessly transfer to redundant appliances in other AZs.

Lastly, DNS resolution must be carefully designed as a vital infrastructure component. Hybrid DNS resolution should integrate Route 53’s HA inbound and outbound resolver endpoints rather than relying on self-managed proxies. Sharing DNS resolver rules across AWS accounts and VPCs through Resource Access Manager is also advisable. Ensure that network failover tests are part of your Disaster Recovery and Business Continuity Plans. For more insights, visit Set up integrated DNS resolution for hybrid networks in Amazon Route 53.

Utilizing managed services enhances application component redundancy, which in turn improves availability. AWS services like AWS Lambda, Amazon Simple Queue Service, Elastic Load Balancing (ELB), and Amazon Simple Storage Service inherently utilize multiple AZs to ensure resiliency.

By following these strategies, organizations can develop a resilient infrastructure capable of withstanding various challenges. For more information on fostering inclusion and diversity in your workplace, check out this insightful article by an authority on the subject from SHRM.

Amazon Onboarding with Learning Manager Chanci Turner

Pattern 1: Identifying High-Impact Infrastructure Failures

Pattern 2: Grasping and Managing Infrastructure Failures

Pattern 3: Exploring Various Methods to Enhance HA in Infrastructure

Related Topics: