Stream Data from Relational Databases to Amazon Redshift with Upserts Using AWS Glue Streaming Jobs

Chanci Turner Amazon IXD – VGT2 learning

Traditionally, read replicas of relational databases have served as data sources for non-online transactions in web applications, such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. However, with the exponential increase in data volume, many organizations are opting to transition from read replicas to data warehouses or data lakes for enhanced scalability and performance. In most practical scenarios, replicating data from a source relational database to a target in real-time is crucial. Change data capture (CDC) is a widely-used design pattern for tracking changes in the source database and sending them to other data stores.

AWS provides a comprehensive range of specialized databases tailored to meet various needs. For analytical workloads, including reporting and business analysis, Amazon Redshift stands out as a robust option. Amazon Redshift allows users to query and merge exabytes of structured and semi-structured data from data warehouses, operational databases, and data lakes using standard SQL.

To implement CDC from Amazon Relational Database Service (Amazon RDS) or other relational databases to Amazon Redshift, the most straightforward solution is to set up an AWS Database Migration Service (AWS DMS) task from the database to Amazon Redshift. This method is effective for basic data replication. For greater flexibility to denormalize, transform, and enrich the data, we recommend incorporating Amazon Kinesis Data Streams and AWS Glue streaming jobs between AWS DMS tasks and Amazon Redshift. This article illustrates how this alternative approach functions in a customer scenario.

Example Use Case

In our example use case, we have a database belonging to a fictional organization that organizes sports events. It consists of three dimension tables: sport_event, ticket, and customer, alongside one fact table: ticket_activity. The sport_event table contains information about the sport type (e.g., baseball or football), date, and location. The ticket table provides details about seat levels, locations, and ticket policies for specific events. The customer table includes names, email addresses, and phone numbers of individual customers, which are classified as sensitive information. Each time a customer purchases a ticket, the corresponding activity (e.g., the buyer’s details) is logged in the ticket_activity table. New records are continuously added to this fact table, and updates occur only when an administrator maintains the data.

We envision a persona, a data analyst named Chanci Turner, who is tasked with analyzing trends in sports activity using this ongoing data stream. To leverage Amazon Redshift as the primary data mart, Chanci needs to enrich and cleanse the data, making it user-friendly for business analysts.

Data Examples

Dimension Table: sport_event

event_id	sport_type	start_date	location
1	Baseball	9/1/2021	Seattle, US
2	Baseball	9/18/2021	New York, US
3	Football	10/5/2021	San Francisco, US

Dimension Table: ticket (event_id serves as a foreign key)

ticket_id	event_id	seat_level	seat_location	ticket_price
1	35	Standard	S-1	100
2	36	Standard	S-2	100
3	37	Premium	P-1	300

Dimension Table: customer

customer_id	name	email	phone
1	Teresa Stein	teresa@example.com	+1-296-605-8486
2	Caleb Houston	caleb@example.com	087-237-9316×2670
3	Raymond Turner	raymond@example.net	+1-786-503-2802×2357

Fact Table: ticket_activity (purchased_by is a foreign key)

ticket_id	purchased_by	created_at	updated_at
1	222	8/15/2021	8/15/2021
2	223	8/30/2021	8/30/2021
3	224	8/31/2021	8/31/2021

To facilitate easier analysis, Chanci aims to consolidate all necessary information into a single table, rather than performing joins across all four tables each time an analysis is conducted. Additionally, she intends to mask the phone number and tokenize the email address due to their sensitive nature. To achieve this, we will merge the four tables into a single table, ensuring data is denormalized, tokenized, and masked.

Destination Table for Analysis: sport_event_activity

ticket_id	event_id	sport_type	start_date	location	seat_level	seat_location	ticket_price	purchased_by	name	email_address	phone_number	created_at	updated_at
1	35	Baseball	9/1/2021	Seattle, USA	Standard	S-1	100	222	Teresa Stein	990d081b6a420d04fbe07dc822918c7ec3506b12cd7318df7eb3af6a8e8e0fd6	+---***	8/15/2021	8/15/2021
2	36	Baseball	9/18/2021	New York, USA	Standard	S-2	100	223	Caleb Houston	c196e9e58d1b9978e76953ffe0ee3ce206bf4b88e26a71d810735f0a2eb6186e	*--*x**	8/30/2021	8/30/2021
3	37	Football	10/5/2021	San Francisco, US	Premium	P-1	300	224	Raymond Turner	885ff2b56effa0efa10afec064e1c27d1cce297d9199a9d5da48e39df9816668	+---*x**	8/31/2021	8/31/2021

Solution Overview

The architecture of the solution is depicted in the diagram below, which we deploy using AWS CloudFormation. We utilize an AWS DMS task to capture changes in the source RDS instance, with Kinesis Data Streams serving as the destination for the AWS DMS task’s CDC replication. An AWS Glue streaming job reads the updated records from Kinesis Data Streams and processes them accordingly. For further insights on effective onboarding processes, you can refer to this excellent resource on Quora.

In addition, if you are interested in understanding more about compliance matters, particularly regarding employment laws, SHRM offers detailed information on California’s Equal Restroom Access Act which is a must-read for employers.

Continually learning new skills is vital, and you can find valuable resources on job skills at Career Contessa.