Automating Data Change Capture During the Restoration of an Amazon DynamoDB Table

Amazon DynamoDB is a fully managed, serverless NoSQL database service that delivers low-latency performance at scale. While the point-in-time recovery (PITR) feature provides a safeguard against data loss, restoring a table can be complex, particularly in production environments. Manual tasks such as identifying the restore point, rerouting write operations, and adjusting application settings can lead to risks and potential downtime, which may be unacceptable for crucial applications.

This post is the inaugural entry in a series focused on table restoration and data integrity. Here, we introduce a solution that automates the PITR restoration process and manages data changes that happen during the restoration, ensuring a seamless transition back to the restored DynamoDB table with minimal downtime. This solution allows for efficient restoration of a DynamoDB table with little impact on your application.

Advantages of PITR

In today’s world, the demand for data reliability, quick recovery, and minimal downtime is prevalent across various sectors. Automating the PITR restoration process helps reduce service interruptions. An automated PITR solution not only facilitates data recovery but also bolsters business continuity, data integrity, and operational efficiency. By automating the PITR restoration steps, organizations can swiftly address data issues, minimize downtime, and maintain user trust.

Alternatives to PITR

Other data modeling strategies, such as versioning and optimistic locking, can help ensure that table items reference the correct metadata version, thereby minimizing the fallout from a faulty deployment. By utilizing version numbers, you can retain previous metadata for a defined timeframe. Should an erroneous application deployment occur, you would need to identify impacted items, ascertain the correct metadata, and update the current values. However, if the deployment altered several versions of the same item, determining the correct version might become problematic. If you’re utilizing versioning based on date and time, the solution could be straightforward, but what if you’re using numerical identifiers or hashes for version control?

Incremental exports to S3 present another viable alternative. Once you pinpoint the time of the erroneous deployment, you can export the DynamoDB data to Amazon S3 selectively from that point. This allows you to run a custom diagnostic script that identifies the incorrect items and updates their previous values in the live DynamoDB table. This method is efficient because it analyzes only a fraction of your table data.

Industries that can greatly benefit from automated PITR solutions include:

E-commerce: Frequently updated product catalogs and promotional features necessitate a reliable fallback to revert changes without losing recent customer transactions. In case a restoration is needed, the entire system can be reverted to the last known functional state.
Content Management Systems: Rapid deployment cycles to meet content demand can sometimes introduce bugs that corrupt data. An automated PITR solution can quickly rectify this without losing new content. For further insights on how media and entertainment sectors utilize DynamoDB for content management systems, check out this blog.
IoT Data Collection Systems: Continuous data collection is critical, but errors in data processing must be rectified swiftly without disrupting the flow of new, accurate data.

When faced with a PITR restoration, engineers often grapple with essential questions to define requirements and challenges, such as: What happens to the data being written to the table during restoration? Is there a method to update data changed during the restore? Can we reduce downtime and keep the system operational during the restoration?

The following diagram illustrates the common challenges encountered in a production DynamoDB environment.

The key events depicted are:

Initial State: The application writes correct data to the DynamoDB table, functioning as intended.
Issue Introduction: A new application version deployment has led to unintended data corruption or other issues. Various factors, such as software bugs or schema changes, may compromise data integrity.
Troubleshooting Period: The team recognizes the data issues and begins troubleshooting. Meanwhile, the application continues to write flawed data to the table.
Restore Decision: After thorough analysis, the team concludes that restoring the DynamoDB table to a known good state using the PITR feature is the best course of action.
PITR Restore Process: The team initiates the PITR restoration to revert the table to a specific point in time prior to the data issues.

The PITR restoration process is crucial, yet it introduces a new challenge: what happens to the data written to the table during restoration? The team must devise a way to capture and integrate any changes made during the PITR restoration process to ensure a smooth transition back to the restored DynamoDB table and maintain data consistency while avoiding data loss.

The next section presents a solution that automates the PITR restoration process and addresses data changes during restoration, helping you minimize downtime and uphold data consistency.

Prerequisites

Setting Up Your Local Environment to Deploy the Solution

This solution leverages AWS CloudTrail management events to automate triggers surrounding PITR restore events. Ensure that CloudTrail management events are enabled in your target account, as illustrated in the accompanying screenshot.

DynamoDB Table, PITR, and DynamoDB Streams

Confirm that PITR is enabled on the DynamoDB table you wish to restore. Once PITR is activated, you can restore to any point between the EarliestRestorableDateTime and LatestRestorableDateTime, which is typically five minutes before the current time. Additionally, you must enable DynamoDB Streams for CDC. After enabling DynamoDB Streams, copy the stream ARN as it will be needed as a deployment parameter.

AWS CDK

To deploy the solution, execute the following snippet, which carries out the necessary steps using an AWS Cloud Development Kit (AWS CDK) stack to set up, prepare, and deploy the components:

cdk bootstrap
cdk synth -c table-name=<insert table name here> -c table-streams-arn=<ddb streams arn here>
cdk deploy -c table-name=<insert table name here> -c table-streams-arn=<ddb streams arn here> --qualifier final

Solution Overview

In this article, we demonstrate how to automate many of the manual tasks and replicate the current data to the newly restored table. The following diagram outlines the solution architecture.

The workflow consists of these steps:

The source table handles live traffic.
The system administrator decides to restore the table to a specific point in time. This decision carries significant implications for data integrity and operational continuity.

For further insights on effective onboarding strategies, consider exploring this excellent resource on lessons from Amazon.