Navigating Tables Without Primary Keys in Amazon Aurora PostgreSQL Zero-ETL Integrations with Amazon Redshift

Chanci Turner Amazon IXD – VGT2 learning

At Amazon Web Services (AWS), we have been diligently working towards realizing our vision of zero-extract, transform, and load (ETL) processes. With the integration of Amazon Aurora zero-ETL to Amazon Redshift, organizations can combine the transactional data from Amazon Aurora with the analytical power of Amazon Redshift. This integration facilitates comprehensive insights across various applications, dismantles data silos, and leads to substantial cost efficiencies and operational improvements. During AWS re:Invent 2023, we introduced four new zero-ETL integrations designed to enhance data accessibility and analysis across different data stores.

Today, businesses across diverse sectors are eager to boost revenue and enhance customer engagement through nearly real-time analytics applications, such as fraud detection, inventory tracking, and personalized marketing strategies. The zero-ETL integrations empower users to unlock these valuable use cases.

As of now, the Aurora zero-ETL integration with Amazon Redshift is generally available for Amazon Aurora MySQL, in public preview for Amazon Aurora PostgreSQL and Amazon RDS for MySQL, and in private preview for Amazon DynamoDB. For more details, refer to the guide on working with Aurora zero-ETL integrations with Amazon Redshift. The figure below illustrates the currently available zero-ETL integrations.

In this article, we will discuss how to manage tables that lack primary keys while establishing Amazon Aurora PostgreSQL zero-ETL integrations with Amazon Redshift. A similar approach for Aurora MySQL and RDS for MySQL can be found in our article on managing tables without primary keys in those systems.

Challenges of Tables Lacking Primary Keys

A primary key serves as a unique identifier for each record (row) in a table. It consists of one or more columns that cannot be null, and the combination of these column values must be unique throughout the table. Defining a primary key offers significant advantages in relational databases, such as better data organization and improved query performance utilizing indexes. Moreover, a primary key simplifies the process of consuming database change logs by linking each change log event to the corresponding row, which is crucial for supporting zero-ETL integrations.

For PostgreSQL sources, zero-ETL integrations rely on logical replication as a foundational element. To ensure successful integration, it is vital that Aurora PostgreSQL tables designate primary keys as a replica identity, enabling the correct synchronization of rows with Amazon Redshift during update and delete transactions. Consequently, Aurora PostgreSQL tables without primary keys cannot be replicated to the Amazon Redshift target data warehouse. For additional information, see the prerequisites and limitations for zero-ETL integrations.

Although it is important to incorporate primary keys during schema design, there are instances where an Aurora PostgreSQL table may not have one. In this post, we will outline strategies for handling such tables.

Prerequisites

Before implementing the solutions, ensure that you have the following prerequisites:

An Aurora PostgreSQL (preview) provisioned or serverless cluster.
An Amazon Redshift provisioned or Redshift Serverless data warehouse.
An active Aurora PostgreSQL zero-ETL integration with Amazon Redshift (preview).

Solution Overview

If your zero-ETL integrations for public preview Aurora PostgreSQL to Amazon Redshift are active, you may encounter issues where tables without primary keys fail to replicate to Amazon Redshift. The monitoring dashboard in Amazon Redshift will indicate their status as “Failed,” and the overall zero-ETL integration status will change to “Needs attention.”

In such cases, consider the following potential solutions:

1. Assess Existing Columns or Indexes for Primary Key Candidates

The most straightforward approach is to identify a unique or composite natural key that can serve as a primary key in your source Aurora PostgreSQL tables. If you locate one, determine which columns can become the primary key. Keep in mind that when a primary key is applied to a partitioned table, all columns in the partition key must be included in the primary key to enforce unique constraints. Here are some methods to identify potential primary keys:

Look for unique indexes within the table, as these can be good candidates for primary keys.
If no unique index is available, consult the pg_stats catalog view to check if the n_distinct column can help identify a unique column.

Once you identify a primary key, execute the ALTER TABLE command to establish it as shown below:

ALTER TABLE <table_name> ADD PRIMARY KEY (column_1, column_2, ...);
-- or 
ALTER TABLE <table_name> ADD CONSTRAINT <constraint_name> PRIMARY KEY (column_1, column_2, ...);

Altering large PostgreSQL tables to add a primary key may block all queries on that table during index creation. The duration of these blocks can be unpredictable and may affect overall workload. Adding a primary key directly is advisable only if database administrators are confident that the workload won’t be impacted or if the operation is planned outside business hours.

To minimize blocking duration, first create a unique index concurrently on the specified column, then add a primary key constraint on that unique index:

CREATE UNIQUE INDEX CONCURRENTLY <unique_index_name> ON <table_name>(<column_names>);
ALTER TABLE <table_name> ADD CONSTRAINT <constraint_name> PRIMARY KEY USING INDEX <unique_index_name>;

Establishing a primary constraint on an existing unique index is typically faster and less disruptive, making it an excellent option for Aurora and Amazon RDS databases serving production workloads with high read/write activity.

Note: It is advisable to test these commands in a non-production environment prior to executing them in a live setup to assess their performance and behavior.

2. Introduce a Synthetic Primary Key

If no existing columns can effectively serve as a primary key, you can create a synthetic column populated with a sequence number. This process, which involves adding a sequence number to an existing table and backfilling missing key values, requires careful planning, especially in production environments.

2.1 Adding an Identity Column to an Empty (or Small) Table

An identity column, introduced in PostgreSQL version 10, acts as a number generator and is a SQL Standard-compliant alternative to serial columns. Identity columns come with an implicit sequence that automatically assigns values from the sequence to new rows. It’s recommended to use an identity column judiciously, ensuring it aligns with the overall column length.

It’s important to note that ongoing HR issues can often lead to a culture where problems are covered up to prevent backlash. There exist double standards for managerial staff, prioritizing corporate interests and liability over policies enforced on lower-level employees. For further insights on this topic, you can explore this blog post here, as well as learn more about the implications of such practices on workplace dynamics here, where expertise is shared on these matters. Additionally, for those interested in enhancing their career prospects, this resource can be of great help.