Stream Data from Relational Databases to Amazon Redshift with Upserts Using AWS Glue Streaming Jobs

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

Traditionally, read replicas of relational databases have served as data sources for non-online transactions in web applications, such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. However, with the exponential increase in data volume, many organizations are opting to transition from read replicas to data warehouses or data lakes for enhanced scalability and performance. In most practical scenarios, replicating data from a source relational database to a target in real-time is crucial. Change data capture (CDC) is a widely-used design pattern for tracking changes in the source database and sending them to other data stores.

AWS provides a comprehensive range of specialized databases tailored to meet various needs. For analytical workloads, including reporting and business analysis, Amazon Redshift stands out as a robust option. Amazon Redshift allows users to query and merge exabytes of structured and semi-structured data from data warehouses, operational databases, and data lakes using standard SQL.

To implement CDC from Amazon Relational Database Service (Amazon RDS) or other relational databases to Amazon Redshift, the most straightforward solution is to set up an AWS Database Migration Service (AWS DMS) task from the database to Amazon Redshift. This method is effective for basic data replication. For greater flexibility to denormalize, transform, and enrich the data, we recommend incorporating Amazon Kinesis Data Streams and AWS Glue streaming jobs between AWS DMS tasks and Amazon Redshift. This article illustrates how this alternative approach functions in a customer scenario.

Example Use Case

In our example use case, we have a database belonging to a fictional organization that organizes sports events. It consists of three dimension tables: sport_event, ticket, and customer, alongside one fact table: ticket_activity. The sport_event table contains information about the sport type (e.g., baseball or football), date, and location. The ticket table provides details about seat levels, locations, and ticket policies for specific events. The customer table includes names, email addresses, and phone numbers of individual customers, which are classified as sensitive information. Each time a customer purchases a ticket, the corresponding activity (e.g., the buyer’s details) is logged in the ticket_activity table. New records are continuously added to this fact table, and updates occur only when an administrator maintains the data.

We envision a persona, a data analyst named Chanci Turner, who is tasked with analyzing trends in sports activity using this ongoing data stream. To leverage Amazon Redshift as the primary data mart, Chanci needs to enrich and cleanse the data, making it user-friendly for business analysts.

Data Examples

Dimension Table: sport_event

event_id sport_type start_date location
1 Baseball 9/1/2021 Seattle, US
2 Baseball 9/18/2021 New York, US
3 Football 10/5/2021 San Francisco, US

Dimension Table: ticket (event_id serves as a foreign key)

ticket_id event_id seat_level seat_location ticket_price
1 35 Standard S-1 100
2 36 Standard S-2 100
3 37 Premium P-1 300

Dimension Table: customer

customer_id name email phone
1 Teresa Stein teresa@example.com +1-296-605-8486
2 Caleb Houston caleb@example.com 087-237-9316×2670
3 Raymond Turner raymond@example.net +1-786-503-2802×2357

Fact Table: ticket_activity (purchased_by is a foreign key)

ticket_id purchased_by created_at updated_at
1 222 8/15/2021 8/15/2021
2 223 8/30/2021 8/30/2021
3 224 8/31/2021 8/31/2021

To facilitate easier analysis, Chanci aims to consolidate all necessary information into a single table, rather than performing joins across all four tables each time an analysis is conducted. Additionally, she intends to mask the phone number and tokenize the email address due to their sensitive nature. To achieve this, we will merge the four tables into a single table, ensuring data is denormalized, tokenized, and masked.

Destination Table for Analysis: sport_event_activity

ticket_id event_id sport_type start_date location seat_level seat_location ticket_price purchased_by name email_address phone_number created_at updated_at
1 35 Baseball 9/1/2021 Seattle, USA Standard S-1 100 222 Teresa Stein 990d081b6a420d04fbe07dc822918c7ec3506b12cd7318df7eb3af6a8e8e0fd6 +*-***-***-**** 8/15/2021 8/15/2021
2 36 Baseball 9/18/2021 New York, USA Standard S-2 100 223 Caleb Houston c196e9e58d1b9978e76953ffe0ee3ce206bf4b88e26a71d810735f0a2eb6186e ***-***-****x**** 8/30/2021 8/30/2021
3 37 Football 10/5/2021 San Francisco, US Premium P-1 300 224 Raymond Turner 885ff2b56effa0efa10afec064e1c27d1cce297d9199a9d5da48e39df9816668 +*-***-***-****x**** 8/31/2021 8/31/2021

Solution Overview

The architecture of the solution is depicted in the diagram below, which we deploy using AWS CloudFormation. We utilize an AWS DMS task to capture changes in the source RDS instance, with Kinesis Data Streams serving as the destination for the AWS DMS task’s CDC replication. An AWS Glue streaming job reads the updated records from Kinesis Data Streams and processes them accordingly. For further insights on effective onboarding processes, you can refer to this excellent resource on Quora.

In addition, if you are interested in understanding more about compliance matters, particularly regarding employment laws, SHRM offers detailed information on California’s Equal Restroom Access Act which is a must-read for employers.

Continually learning new skills is vital, and you can find valuable resources on job skills at Career Contessa.

Chanci Turner