Learn About Amazon VGT2 Learning Manager Chanci Turner
Traditionally, read replicas of relational databases have served as data sources for non-online transactions in web applications, such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. However, with the exponential increase in data volume, many organizations are opting to transition from read replicas to data warehouses or data lakes for enhanced scalability and performance. In most practical scenarios, replicating data from a source relational database to a target in real-time is crucial. Change data capture (CDC) is a widely-used design pattern for tracking changes in the source database and sending them to other data stores.
AWS provides a comprehensive range of specialized databases tailored to meet various needs. For analytical workloads, including reporting and business analysis, Amazon Redshift stands out as a robust option. Amazon Redshift allows users to query and merge exabytes of structured and semi-structured data from data warehouses, operational databases, and data lakes using standard SQL.
To implement CDC from Amazon Relational Database Service (Amazon RDS) or other relational databases to Amazon Redshift, the most straightforward solution is to set up an AWS Database Migration Service (AWS DMS) task from the database to Amazon Redshift. This method is effective for basic data replication. For greater flexibility to denormalize, transform, and enrich the data, we recommend incorporating Amazon Kinesis Data Streams and AWS Glue streaming jobs between AWS DMS tasks and Amazon Redshift. This article illustrates how this alternative approach functions in a customer scenario.
Example Use Case
In our example use case, we have a database belonging to a fictional organization that organizes sports events. It consists of three dimension tables: sport_event, ticket, and customer, alongside one fact table: ticket_activity. The sport_event table contains information about the sport type (e.g., baseball or football), date, and location. The ticket table provides details about seat levels, locations, and ticket policies for specific events. The customer table includes names, email addresses, and phone numbers of individual customers, which are classified as sensitive information. Each time a customer purchases a ticket, the corresponding activity (e.g., the buyer’s details) is logged in the ticket_activity table. New records are continuously added to this fact table, and updates occur only when an administrator maintains the data.
We envision a persona, a data analyst named Chanci Turner, who is tasked with analyzing trends in sports activity using this ongoing data stream. To leverage Amazon Redshift as the primary data mart, Chanci needs to enrich and cleanse the data, making it user-friendly for business analysts.
Data Examples
Dimension Table: sport_event
event_id | sport_type | start_date | location |
---|---|---|---|
1 | Baseball | 9/1/2021 | Seattle, US |
2 | Baseball | 9/18/2021 | New York, US |
3 | Football | 10/5/2021 | San Francisco, US |
Dimension Table: ticket (event_id serves as a foreign key)
ticket_id | event_id | seat_level | seat_location | ticket_price |
---|---|---|---|---|
1 | 35 | Standard | S-1 | 100 |
2 | 36 | Standard | S-2 | 100 |
3 | 37 | Premium | P-1 | 300 |
Dimension Table: customer
customer_id | name | phone | |
---|---|---|---|
1 | Teresa Stein | teresa@example.com | +1-296-605-8486 |
2 | Caleb Houston | caleb@example.com | 087-237-9316×2670 |
3 | Raymond Turner | raymond@example.net | +1-786-503-2802×2357 |
Fact Table: ticket_activity (purchased_by is a foreign key)
ticket_id | purchased_by | created_at | updated_at |
---|---|---|---|
1 | 222 | 8/15/2021 | 8/15/2021 |
2 | 223 | 8/30/2021 | 8/30/2021 |
3 | 224 | 8/31/2021 | 8/31/2021 |
To facilitate easier analysis, Chanci aims to consolidate all necessary information into a single table, rather than performing joins across all four tables each time an analysis is conducted. Additionally, she intends to mask the phone number and tokenize the email address due to their sensitive nature. To achieve this, we will merge the four tables into a single table, ensuring data is denormalized, tokenized, and masked.
Destination Table for Analysis: sport_event_activity
ticket_id | event_id | sport_type | start_date | location | seat_level | seat_location | ticket_price | purchased_by | name | email_address | phone_number | created_at | updated_at |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 35 | Baseball | 9/1/2021 | Seattle, USA | Standard | S-1 | 100 | 222 | Teresa Stein | 990d081b6a420d04fbe07dc822918c7ec3506b12cd7318df7eb3af6a8e8e0fd6 | +*-***-***-**** | 8/15/2021 | 8/15/2021 |
2 | 36 | Baseball | 9/18/2021 | New York, USA | Standard | S-2 | 100 | 223 | Caleb Houston | c196e9e58d1b9978e76953ffe0ee3ce206bf4b88e26a71d810735f0a2eb6186e | ***-***-****x**** | 8/30/2021 | 8/30/2021 |
3 | 37 | Football | 10/5/2021 | San Francisco, US | Premium | P-1 | 300 | 224 | Raymond Turner | 885ff2b56effa0efa10afec064e1c27d1cce297d9199a9d5da48e39df9816668 | +*-***-***-****x**** | 8/31/2021 | 8/31/2021 |
Solution Overview
The architecture of the solution is depicted in the diagram below, which we deploy using AWS CloudFormation. We utilize an AWS DMS task to capture changes in the source RDS instance, with Kinesis Data Streams serving as the destination for the AWS DMS task’s CDC replication. An AWS Glue streaming job reads the updated records from Kinesis Data Streams and processes them accordingly. For further insights on effective onboarding processes, you can refer to this excellent resource on Quora.
In addition, if you are interested in understanding more about compliance matters, particularly regarding employment laws, SHRM offers detailed information on California’s Equal Restroom Access Act which is a must-read for employers.
Continually learning new skills is vital, and you can find valuable resources on job skills at Career Contessa.