Learn About Amazon VGT2 Learning Manager Chanci Turner
The significance of data warehouses and the analytics conducted on these platforms have been growing increasingly over time. Many organizations now view these systems as essential for both immediate operational decisions and long-term strategic planning. Traditionally, data warehouses rely on batch cycles for data refresh, be it monthly, weekly, or daily, enabling businesses to extract various insights.
However, numerous companies are discovering that near-real-time data ingestion combined with sophisticated analytics presents new possibilities. For instance, a financial institution can identify whether a credit card transaction is fraudulent by executing an anomaly detection process in near-real-time rather than relying on batch methods.
In this article, we illustrate how Amazon Redshift can facilitate streaming ingestion and machine learning (ML) predictions within a single platform. Amazon Redshift is a rapid, scalable, secure, and fully managed cloud data warehouse that simplifies and makes it cost-effective to analyze all your data using conventional SQL. Amazon Redshift ML empowers data analysts and database developers to create, train, and apply ML models utilizing familiar SQL commands in Amazon Redshift data warehouses.
We are thrilled to introduce Amazon Redshift Streaming Ingestion for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This feature allows you to ingest data directly from a Kinesis data stream or Kafka topic without the need to stage the data in Amazon Simple Storage Service (Amazon S3). With Amazon Redshift streaming ingestion, you can achieve low-latency ingestion of hundreds of megabytes of data into your data warehouse in just seconds.
This article will guide you on how to leverage Amazon Redshift for building near-real-time ML predictions using streaming ingestion and Redshift ML features with familiar SQL syntax.
Solution Overview
By following the outlined steps, you will set up a producer streamer application on an Amazon Elastic Compute Cloud (Amazon EC2) instance that simulates credit card transactions and streams data into Kinesis Data Streams in real time. You will create a materialized view in Amazon Redshift for streaming ingestion, where the incoming data is received. Additionally, you will train and develop a Redshift ML model to generate real-time predictions from the streaming data.
The architecture and process flow can be visualized in the following diagram.
The step-by-step process is as follows:
- The EC2 instance simulates a credit card transaction application, inserting transactions into the Kinesis data stream.
- The data stream captures the incoming credit card transaction data.
- An Amazon Redshift Streaming Ingestion materialized view is generated on top of the data stream, enabling automatic ingestion of streaming data into Amazon Redshift.
- You will construct, train, and deploy an ML model using Redshift ML, with the model being trained on historical transaction data.
- The streaming data is transformed to produce ML predictions.
- Customers can then be alerted, or the application updated to mitigate risk.
This walkthrough utilizes simulated credit card transaction data. The data presented is fictitious and generated using a simulator, and the customer dataset is also generated with random data functions.
Prerequisites
- Create an Amazon Redshift cluster.
- Configure the cluster to utilize Redshift ML.
- Create an AWS Identity and Access Management (IAM) user.
- Update the IAM role attached to your Redshift cluster to include permissions for accessing the Kinesis data stream. For information about the necessary policy, refer to Getting started with streaming ingestion.
- Create an m5.4xlarge EC2 instance. We validated the Producer application using an m5.4xlarge instance, but you may opt for a different instance type. While creating the instance, use the amzn2-ami-kernel-5.10-hvm-2.0.20220426.0-x86_64-gp2 AMI.
- To ensure Python3 is installed on the EC2 instance, verify your Python version with the command:
python3 --version
- Install the following required packages to operate the simulator program:
sudo yum install python3-pip
pip3 install numpy
pip3 install pandas
pip3 install matplotlib
pip3 install seaborn
pip3 install boto3
- Configure Amazon EC2 using the AWS credentials generated for the IAM user created in step 3. The following screenshot illustrates an example using
aws configure
.
Set Up Kinesis Data Streams
Amazon Kinesis Data Streams is a highly scalable and durable real-time data streaming service. It can continuously capture gigabytes of data per second from numerous sources, including website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The collected data becomes available in milliseconds, facilitating real-time analytics use cases like real-time dashboards, anomaly detection, dynamic pricing, and more. Kinesis Data Streams is chosen for its serverless nature, scaling based on usage.
Create a Kinesis Data Stream
To receive the streaming data, you will begin by creating a Kinesis data stream:
- In the Amazon Kinesis console, select Data streams from the navigation menu.
- Choose Create data stream.
- For Data stream name, enter cust-payment-txn-stream.
- For Capacity mode, select On-demand.
- For the remaining options, choose the defaults, and follow the prompts to finish the setup.
- Capture the ARN for the created data stream to use in the next section when defining your IAM policy.
Set Up Permissions
To enable a streaming application to write to Kinesis Data Streams, the application must have access to Kinesis. You can use the following policy statement to grant access to the simulator process you will set up later. Make sure to insert the ARN of the data stream you saved previously.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt123",
"Effect": "Allow",
"Action": [
"kinesis:DescribeStream",
"kinesis:PutRecord",
"kinesis:PutRecords",
"kinesis:GetShardIterator",
"kinesis:GetRecords",
"kinesis:ListShards",
"kinesis:DescribeStreamSummary"
],
"Resource": [
"arn:aws:kinesis:us-west-2:xxxxxxxxxxxx:stream/cust-payment-txn-stream"
]
}
]
}
Configure the Stream Producer
Before consuming streaming data in Amazon Redshift, a source that writes data to the Kinesis data stream must be established. This post employs a custom-built data generator along with the AWS SDK for Python (Boto3) to submit the data to the data stream. For setup instructions, refer to the Producer Simulator. This simulator publishes streaming data to the previously created data stream (cust-payment-txn-stream).
Configure the Stream Consumer
This section focuses on configuring the stream consumer, specifically the Amazon Redshift streaming ingestion view. Amazon Redshift Streaming Ingestion allows for low-latency, high-speed ingestion of streaming data from Kinesis Data Streams into an Amazon Redshift materialized view. For additional insights on onboarding at Amazon, this guide is an excellent resource: Amazon New Hire Orientation.
To enhance your career potential, consider checking out this blog post on salary expectations: How to Determine Salary Expectations. Lastly, for job descriptions relevant to this field, Audiovisual Specialist Job Description is a valuable resource.