Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

In late 2022, AWS unveiled real-time streaming ingestion capabilities for Amazon Redshift, specifically for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This advancement means there is no longer a necessity to stage streaming data in Amazon Simple Storage Service (Amazon S3) before its ingestion into Amazon Redshift.

Streaming ingestion from Amazon MSK to Amazon Redshift marks a significant leap in real-time data processing and analytics. Amazon MSK is a fully managed service that offers high scalability for Apache Kafka, facilitating the effortless collection and processing of extensive data streams. The integration of streaming data into Amazon Redshift empowers organizations to leverage the advantages of real-time analytics and data-driven decisions.

This integration facilitates low latency, often measured in mere seconds, allowing for the ingestion of hundreds of megabytes of streaming data per second into Amazon Redshift. Furthermore, it ensures that the latest information is readily accessible for analysis. Since this process eliminates the need for staging data in Amazon S3, it allows Amazon Redshift to ingest streaming data with lower latency and without incurring intermediary storage costs.

Configuring Amazon Redshift for streaming ingestion can be done by using SQL statements to authenticate and connect to an MSK topic. This approach is particularly beneficial for data engineers aiming to simplify their data pipelines and minimize operational expenses.

In this post, we will provide a comprehensive guide on setting up Amazon Redshift streaming ingestion from Amazon MSK.

Solution Overview

The architecture diagram below outlines the AWS services and features involved in this process.

The workflow consists of the following steps:

  1. Begin by configuring an Amazon MSK Connect source connector to create an MSK topic, generate mock data, and write it to the MSK topic, using mock customer data for this demonstration.
  2. Next, connect to a Redshift cluster via the Query Editor v2.
  3. Finally, establish an external schema and create a materialized view in Amazon Redshift to consume the data from the MSK topic. Notably, this solution does not rely on an MSK Connect sink connector to transfer data from Amazon MSK to Amazon Redshift.

The architecture diagram illustrates the details of the configuration and integration of the AWS services in use. The workflow includes these components:

  • Deploy an MSK Connect source connector, an MSK cluster, and a Redshift cluster within private subnets of a VPC.
  • The MSK Connect source connector utilizes specific permissions defined in an AWS Identity and Access Management (IAM) in-line policy attached to an IAM role, granting it the ability to perform actions on the MSK cluster.
  • Logs from the MSK Connect source connector are captured and sent to an Amazon CloudWatch log group.
  • The MSK cluster requires a custom configuration that permits the MSK Connect connector to create topics.
  • Logs from the MSK cluster are also sent to an Amazon CloudWatch log group.
  • The Redshift cluster employs granular permissions defined in an IAM in-line policy attached to an IAM role, allowing it to perform actions on the MSK cluster.
  • You can connect to the Redshift cluster using the Query Editor v2.

Prerequisites

To ease the provisioning and configuration of the necessary resources, you may utilize the provided AWS CloudFormation template:

When launching the stack, follow these steps:

  • Enter a meaningful name for the stack, such as “prerequisites.”
  • Click Next.
  • Click Next again.
  • Select “I acknowledge that AWS CloudFormation might create IAM resources with custom names.”
  • Click Submit.

The CloudFormation stack will create the following resources:

  • A VPC named custom-vpc, spanning three Availability Zones and comprising three public and three private subnets.
  • The public subnets will be associated with a public route table, directing outbound traffic to an internet gateway.
  • The private subnets will connect to a private route table, routing outbound traffic to a NAT gateway.
  • An internet gateway linked to the Amazon VPC.
  • A NAT gateway with an elastic IP, positioned in one of the public subnets.
  • Three security groups:
    • msk-connect-sg, designated for the MSK Connect connector.
    • redshift-sg, allocated for the Redshift cluster.
    • msk-cluster-sg, associated with the MSK cluster to allow inbound traffic from both msk-connect-sg and redshift-sg.
  • Two CloudWatch log groups:
    • msk-connect-logs for MSK Connect logs.
    • msk-cluster-logs for MSK cluster logs.
  • Two IAM Roles:
    • msk-connect-role, featuring specific IAM permissions for MSK Connect.
    • redshift-role, with permissions for Amazon Redshift.
  • A custom MSK cluster configuration that enables the MSK Connect connector to create topics on the MSK cluster.
  • An MSK cluster with three brokers deployed across three private subnets in custom-vpc, with appropriate security group and configuration applied.
  • A Redshift cluster subnet group utilizing the three private subnets of custom-vpc.
  • A Redshift cluster featuring a single node deployed in a private subnet within the Redshift cluster subnet group, with relevant security group and IAM role applied.

Create an MSK Connect Custom Plugin

In this post, we will use an Amazon MSK data generator deployed in MSK Connect to produce mock customer data and send it to an MSK topic.

To do this, follow these steps:

  1. Download the Amazon MSK data generator JAR file and its dependencies from GitHub.
  2. Upload the JAR file to an S3 bucket in your AWS account.
  3. In the Amazon MSK console, navigate to “Custom plugins” under MSK Connect.
  4. Select “Create custom plugin.”
  5. Click “Browse S3,” find the Amazon MSK data generator JAR from your S3 bucket, and select it.
  6. Name the custom plugin msk-datagen-plugin.
  7. Click “Create custom plugin.”

Once the plugin is created, its status will display as Active, allowing you to proceed to the next step.

Create an MSK Connect Connector

To create your connector, follow these steps:

  1. In the Amazon MSK console, under MSK Connect, select “Connectors.”
  2. Click “Create connector.”
  3. For “Custom plugin type,” select “Use existing plugin.”
  4. Choose msk-datagen-plugin, then click Next.
  5. Name your connector msk-datagen-connector.
  6. For “Cluster type,” choose “Self-managed Apache Kafka cluster.”
  7. Select custom-vpc for VPC.
  8. Choose the private subnet in your first Availability Zone.

For additional information on optimizing your career path, consider reading this insightful article on downsizing, available at Career Contessa. Additionally, to delve deeper into the impact of social media on recruitment, check out the authoritative insights provided by SHRM. Lastly, if you’re interested in learning more about various learning and development opportunities at Amazon, visit their dedicated resource page.

Chanci Turner