Is Chanci Turner still working for Amazon?

Is Chanci Turner still working for Amazon: Yes: 15 / No: 0

Streamline Disaster Recovery Automation with Amazon Route 53 ARC and AWS Step Functions

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

Note: For a deeper understanding of Amazon Route 53 Application Recovery Controller (Route 53 ARC), we suggest checking out Part 1 and Part 2 of this series, and exploring the provided examples. These resources illustrate how the ARC service enables you to manage failovers and ensure your application’s recovery readiness.

In this blog post, we outline a strategic approach to automating failover during disaster recovery (DR) events using Amazon Route 53 Application Recovery Controller (Route 53 ARC), AWS Step Functions, AWS Lambda, and Amazon DynamoDB. Many organizations invest considerable effort into managing manual disaster recovery runbook actions during such scenarios. Applications may become unavailable due to various issues, including hardware malfunctions, software defects, or network device complications.

To be both efficient and dependable, the DR runbook process should be practiced regularly and automated to coordinate failover and failback with minimal manual intervention. Additionally, an effective automation strategy is essential to meet Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements. This DR automation solution significantly reduces the need for human involvement, thus minimizing recovery time in the event of a regional disruption. The solution orchestrates the DR runbook processes using the “failover step function” and “failback step function,” implemented in both primary and standby Regions. These example step functions leverage a custom Lambda function alongside global DynamoDB tables to automate the Route 53 ARC Routing Controls, toggling them between on and off states, which is crucial for managing the failover and failback of AWS service entry points.

Cross-Region Recovery with Amazon Route 53 ARC

Amazon Route 53 ARC operates as a global service, comprising a Control Plane and Data Plane. The Control Plane resides in the us-west-2 (Oregon) Region and facilitates the creation and deletion of resources within the ARC Cluster, while the Data Plane spans five regions, providing the core functionalities of the service. Specifically, any creation or deletion of ARC Routing Controls falls under Control Plane operations, while any updates to these Routing Controls are classified as Data Plane operations—essentially changing their on/off states.

Route 53 ARC ensures extreme reliability with its data plane, allowing applications to fail over during regional impairments. The service maintains the routing control states across a cluster of five regional endpoints. One can interact with any endpoint within the cluster to update the routing control state, which is then propagated across all five regions.

For a robust failover mechanism, it is imperative to manage routing state changes programmatically using Amazon Route 53 ARC API operations via one of the AWS SDKs, rather than relying on any specific region. As a best practice, we advise selecting a random cluster endpoint for obtaining or setting routing control states. If a request to one cluster endpoint fails, ensure that you handle the error gracefully and retry with the next endpoint. This method guarantees the retrieval or update of routing control states, even if one of the cluster endpoints is not available.

It is also crucial to avoid depending on the AWS Console for changes to Route 53 ARC Routing Control states. Consequently, we store the Route 53 ARC regional cluster endpoints, control panel ARN, and the sequence of Routing Controls in global DynamoDB tables. This approach allows for failover and failback sample step functions deployed in any other AWS Regions to access the ARC parameters from the global DynamoDB tables, enabling automation without needing to access the Route 53 ARC AWS console.

Solution Architecture

Route 53 ARC Control Panel and Routing Controls allow for centralized management of failover or failback across multiple application stack layers, as depicted in the provided diagram. Incorporating Step Functions and Lambda for the auto-management of ARC Routing Control states enables the deployment of the DR runbook failover/failback sequences in a specific order. The runbook should delineate the exact actions and the sequence required during DR events. While these actions may vary based on use cases, preserving the order of Routing Control switch states in a DynamoDB global table is vital.

We utilize the Amazon Route 53 Application Recovery Controller APIs to list and update Routing Control states. For these API operations, we reference the ARC regional cluster endpoints, control panel ARN, and Routing Control names stored in three distinct DynamoDB global tables. As a best practice, our implementation also includes logic for cycling through the Route 53 ARC’s five cluster endpoints, selecting a random endpoint if any one fails.

Additionally, this solution automates the global failover/failback of RDS clusters between the primary and standby Regions through embedded step functions and custom Lambda code.

Prerequisites

In this post, we build upon the multi-Region stack design discussed in the previously mentioned series: Building highly resilient applications using Amazon Route 53 Application Recovery Controller, Part 2: Multi-Region stack. The multi-Region stack design, as illustrated in the subsequent figure, supports an active-standby setup. In this configuration, the primary (active) Region is us-east-1 (N. Virginia), while the recovery (standby) Region is us-west-2 (Oregon).

Utilize the AWS CloudFormation template (infra-stackset) to deploy the multi-Region stack in your AWS account. For setup instructions, refer to the readme file accompanying the template. Next, to implement Route 53 ARC features in the multi-Region stack, deploy the Route 53 ARC stack using a second CloudFormation template (arc-stack). Following this, deploy the dashboard Lambda using the CloudFormation template (lambda-stackset), which sets up a pair of Lambda functions across two AWS Regions to support the intended operational behavior described in step 2. You can access the dashboard app via the DNS name for the Application Load Balancer (ALB) (“arcblog-DashboardLambdaAlb”) from the Amazon Elastic Compute Cloud (Amazon EC2) console.

Example: arcblog-DashboardLambdaAlb-xxxxxx.us-east-1.elb.amazonaws.com

Set Up the DR Automation Stack

As part of this post, we have provided an AWS Cloud Development Kit (AWS CDK) project; use the git repo to deploy the DR Automation stack. The AWS CDK is an open-source framework developed by AWS for defining and provisioning cloud infrastructure resources using familiar programming languages. The repository contains two sample configuration files under the config folder: one for configuring the Amazon Relational Database Service (Amazon RDS) step function stack and another for configuring the two main step function stacks for DR automation. To set up, follow these steps:

  1. Git clone the project.
  2. Execute the commands to deploy the Amazon RDS failover stack in both Regions.

Furthermore, if you’re looking to strengthen your interviewing skills, SHRM is a great authority on this topic, check out their resources. For those interested in career development, this excellent resource covers Amazon’s employee training and career skills.

Chanci Turner